 Okay guys, so I have 15 minutes only. I will try to keep it as fast as possible. Okay, so today we are going to discuss about web scrapping and data analysis using Selenium and Python. So before I start, may I know like, how many of you are working with Selenium and Python? Okay, not much. Okay, so this entire exercise, I'm going to do it on IPython notebook. I will tell you later on what is the IPython notebook. So we are going to discuss about web scrapping using Selenium. So far we have seen like Selenium, it is used as a web browser automation tool and it is used in the field of automation testing. But in this session, we are going to see how Selenium can be used as a web scrapping tool and how we can extract the data using Selenium. So I'm going to do a small exercise. I see if my time permits, I'm going to take an MDB page. I'm going to extract the data from that page and we will see some interesting facts coming out of our analysis. Okay, so data science is one of the area which has got a lot of attention in last couple of years and we have seen like data scientists, they use different set of tools in their day to day life to get their work done. Some of the tools are listed here. These are basically Python oriented tool. So you will see like there is always a fight between Python and R, but both of them pretty much do the same thing. So there is a lot of time which is spent on extracting and cleaning the data. So if you ever get a chance to talk to any of the data scientists, you'll know like around 60 to 70% of the time they spent on extracting or cleaning the data. Okay, so because the reason is data is collected from different sources, whether it's a web, database, CSV or Excel file, then it is thoroughly clean and then later on it is used for exploration because the main reason, the main objective of any data science project is to find the hidden trends and pattern within the data. Okay, so coming to web data extraction, they're one of the best way to extract the data from the web is through APIs because APIs provide the data in a more structured format in a JSON or XML format. Okay, so if you have to run any sentiment analysis on your Twitter data, you can use the Twitter APIs, you can get the data from the Twitter. But what if there is no APIs present for your website? What you will do in that case? So web scrapping is an excellent way to extract the data from the web, the unstructured data from the web, then transform the data into a structured format, Excel, CSV or database or a text file. And then after that you can use it for your purpose, whatever, like analysis of the data or anything. So in our session today, we are going to see how we can scrap the data with the help of Selenium, and then we'll convert the data into a structured format in a more meaningful format. And then we'll do some analysis on top of that. Okay, so let's talk about iPython Notebook. iPython Notebook is basically a Python interpreter or a Python, it is tightly integrated with your operating system shell. So it provides you a web environment where you can execute your Python code. And it is also integrated with the Matplotlib library, which basically helps you to plot the graphs. And it helps you to visualize your data in the web browser. Let me see if I can open my GitHub account. Actually it's not my computer, I'm having a hard time using it. Okay, so by the time it opens, okay, so it got open. So you can see it provides a cell-like structure. You can write your Python code here. And all the results will be displayed beneath the cell. Okay, and it helps you to plot beautiful graphs also with the help of Matplotlib. I will just go down and I will show you, yeah, so you can see here. So you can import your data, you can do analysis on your data, and then you can visualize it using Matplotlib. Okay, so now going back to our presentation, next we're going to use Pandas. Pandas is a data analysis library in Python, and it provides rich data structure in form of series and data frame. So in our session, we are going to use data frames and we'll see like how using Pandas function, we can analyze the data that we are going to extract from the web. We talked about Matplotlib, it's a visualization library and it helps you to plot the graphs in the IPython notebook. Okay, so there are four simple steps that I'm going to do today. First I'm going to extract the data from an IMDB page with the help of Selenium. Then I'm going to clean that data, put that data into a more structured format, and then I'm going to consume that data with the help of Pandas data frame, and then we'll do some analysis on top of it using Pandas function. Sorry, I'm going fast. Okay, so the data which I'm going to use today for our analysis is all the movies which has won the best picture in the last 65 years in the Filmfare Awards. So for anyone of you who doesn't know about Filmfare, it's one of the highest category award in the area of film and music in India. So I'm taking the data for all the movies between 1955 to 2015, and we'll do some analysis on that. Okay, so as we talked, I'm going to import the web driver, Matplotlib, and Pandas for our exercise today. Don't have any control on this. Okay, is it okay now? Okay, fine. So I'm going to open a Chrome driver and... Sorry? Previous slide, okay. Here it is. Yep. Okay, so I'm going to open the Chrome driver. I'm not sure if I can... Wait, let me show you this page also. If I can... Okay, I'm not sure if I can show you this web page. So basically I will tell you what this web page contains. It contains the list of the movies which has won the best picture in the film fair from the last 65 years. And it contains the data like the name of the movie, the year in which it is released, the director of the movie, ratings, votes, blah, blah, a lot of things are there. Okay, so I'm going to open a Chrome driver here. I'm sorry, Chrome browser here. So if you want, you can do it in a headless browser, PhantomJS also. Okay, and I'm going to open this MDB page. Okay, now the first thing that we have to do here is we have to get the data from that web. So this is a small Python function that I have written to get the data from that page. So the data which you can see here is these are the data. So I mean if you see like it is very simple in using Selenium to extract the data, you have to just use one function here, find element by xpath. And you have to just provide the address of the data that you want to fetch and it does everything for you. So the data which I'm fetching today is movie name, the rating, the runtime of the movie, what is the duration of that movie? What is the number of votes for that movie, the director and the genre of the movie? Okay, so these are the data that I'm going to fetch from that web page. Okay, so if you compare Selenium with other libraries like BeautifulSoup, Request or Scrappy, which is also used for the web scrapping, okay. The major difference is Selenium, it works better with a web page which has a JavaScript enabled. So because the Selenium waits for the entire page to load, but the other libraries, they work on the static pages only. And one more important thing is Selenium is very easy to use. You can see like with just help of one function, I'm able to extract everything from the Selenium. Okay, so once I extracted the data, the data looks like this. So you have all the data in a Python list now. So it has an individual list which contains all the movie's name, all the ratings, all the directors and everything is in one individual Python list. Okay, but this data doesn't make any correlations. So for a movie if I want to know what is the year of the movie, what is in which it is released, what is the rating of that movie, who is the director of that movie, I cannot get it from this data. So we have to link this data now. So the best way to do it is to store it in a Python dictionary. So Python dictionary is basically a key value pair. It gives you a key and its corresponding value. So I'm going to put all this data into a dictionary now. So after putting it into a dictionary, this is how it looks like. Now it makes sense. Okay, now something is wrong with this data now. So if you see the director field, I have some unwanted text called director colon. If you look at the votes, there is a comment between and it is in a string format. So this is not in the right format for me to do any data manipulation or aggregation on this data. So in order to make it in a right shape, I have to clean this data. Okay, so I'm going to clean this data. So there is a small piece line of code that I've written. And we are also taking care of the null values. So for some of the movies, some of the datas are missing. So for that movies, we are going to put the null values there. Okay, so after cleaning the data, this is how it looks like. Now it looks better. So you have all the votes, runtime here, general movie name, rating, everything is in a right format now. Okay, I have five minutes only. I will try to go fast. Okay, so now once our data is ready, I just wanted to show you the entire data how it looks like in a dictionary, but I cannot show you, I think. Sorry for that. Okay, so now once we have the data in the right format, we have corrected everything. We have cleaned the data. Now we have to get this data into a pandas data frame just to make the data analysis. So what is a pandas data frame? Basically it is a spreadsheet like a structure which gives you a row and column structure where you can put your data and then using the functions, you can do some analysis on that. So once you import your data in a pandas data frame, this is how it looks like. So you can see it is in a tabular format and whatever fields you can see in the dictionary that is now imported in this tabular format. Okay, now there are some missing values in this entire data. So pandas provide a beautiful way to handle the missing values or the null values. Okay, so we can see like these three movies, it doesn't have any value for the runtime. So what should we do now? So we are going to take the mean of the data, whatever data is available for this particular column runtime, and we are going to replace this null values with the mean data. So once you do that, this is how it looks like. So I replace it with the mean value of the entire data set. Okay, so now we are ready to do our analysis because we have the data is in a pandas data frame structure and it is in a right format now, all the missing values and the data playing, everything has been taken care. So now let's do some analysis on this data. So I have around 61 movies and data for all those 61 movies. So now I'm going to see what are the movies which is having the highest rating here? Okay, so these are the top five movies which is having the highest rating. Okay, so if you see, one interesting thing to note here is if you look at the year of the movie. So most of the movies are before 1985 or 1990. Okay, there is only one movie which falls in 2000, okay, in the present day. So you see like in the earlier days, in the classical days, we used to make much better movies than what we make in the present day. Okay, this is what we can get from this data. One more important thing to note here is the votes and the rating. So if you look at the movie which is having the highest rating here, it is having around 10,000 votes and the one which is falling in the third line, it is having 82,000 votes. So it doesn't always mean if your movie is having a higher number of votes, then it will have a higher rating also. So your rating doesn't correspond to your vote. Okay, now I'm plotting a graph for the trend. I want to see what is the trend for the ratings in the last 65 years for the movies. So if you look at this graph, you can clearly say that like before 1985 or 1990, 90 used to make a much better quality movies than what we make in present year. You can look at the peaks of the graph. Okay, now let's, we talked about the movie with the highest rating. Now let's talk about the movie with the lowest ratings. So these are the movies with the lowest rating. So again, if you look at the year here, all the movies are after 1990. So we started making a bad quality movies after 1990. There is only one movie, which is the fourth one, which is before 1980. Okay, let's look at the movies with the maximum runtime. So we have 61 movies. So let's see these are the top 10 movies which is having the maximum runtime out of that 61 movies. So again, if you look at the year, these all are after 1990. So what we can conclude from here is like, as the rating of the movies are going down, the runtime of the movies are increasing, the duration of the movies are increasing. So we tend to watch the bigger movies, but the quality wise, those movies are not good as it was in the earlier days. Again, if you look at the trends, you can see the trend here. If you look at the peaks, it's all 2000. So in earlier days, we used to make a shorter movies with better qualities. Now if you look at the average of these 61 movies, it is coming around 162 minutes, which is around two and a half hours. So, I mean, usually we Indians, we spend around two and a half hours to watch a movie. Okay, let's look at this IMDB ratings of these movies. So there are around 56 movies out of 61, which is having a rating greater than seven, which is quite good. Now if you visualize this by a bar graph, we can see like the movies with the ratings between seven to eight, there are around 32 movies. There are 25 movies, which is having a rating of eight or greater than eight. And there is only a small percentage of the movie, which is between six to seven. And any IMDB rating greater than seven and eight is considered very good. So this is, I'm visualizing by a pie graph and I can see like around 90% of the movies. It is having a rating of more than seven. So let's look at the movie by the general. So what we can see here, there are two general drama and musical. These are all the movies which have won the best picture. Most of them are fall into drama and musical category only. So since this is a Bollywood movie, so we can say that music is optional here, obvious here. So, and we are less likely to watch the family and comedy movies. You can see from this graph. Okay, so let's look at some of the best directors who has won the film fair more than once. So these are the directors who has made movie, who have won the film fair more than once. So next time if they are making any movies, we can go directly to watch it before looking at the review. Okay, so we have done the entire journey of extracting the data from a webpage, then cleaning that data, then putting the data into a structured format and then consuming the data with the pandas data frame and then we have done analysis on the data. So what we have concluded finally with all this exercise. So any movies which is likely to be selected for the best picture in film fair are, which is having a rating greater than seven. Anything which is having a runtime more than two hours and which falls in a category drama and musical. So next time if you see a movie which falls in any of these three criteria, it is likely to get a film fair award. So that's all I have for today and thanks for watching. I don't think I have time for question and answers. Thank you all.