 So, let's start. What we are going to be covering in this session is about handling data. Partly, programmatically, partly, manually and I am going to focus more on the programmatic mid-serf. If you take the cycle of data, first you need to get it and once you got it you got to format it a bit and that's when you start your analysis and we will be talking about ways in which we can find data, ways in which we can format the data and not ways in which we can analyze data for free. That's something that we will start on the next session on this. Now, if you want to find data the easiest way is to ask if someone knows where it is because it will be the quickest way of getting the data or you could search for it. I imagine most of you would go to a Google search or search through any special websites that you may have access to or you could crawl the data. Now, this is the term for pulling data off of multiple web pages. That's what crawling is. That's what most search engines do every page and you could do something similar. If for instance you have a series of tables sitting on a website then you could just go to each page one by one, copy and paste the data into Excel and get the data. That's the manual version of calling. If you wrote a program to do that, that would be a program that would be doing your calling. And associated with that is scraping. Scraping is the part where when you paste it into Excel it recognizes that it's a table and puts it into a structure. If it were an HTML table and you open let's say in an Excel or Google Chrome and copied and pasted that table into Excel it will recognize it as a table which is fine. Sometimes the data isn't in the form of a table. It's just all over the place. It's maybe a list of items. So if you had say a list of movies, the first line maybe the movie title, the second line maybe the director's name, the third line maybe the list of actors, the fourth line maybe the year of release and then the next line will be the next movie. So it's not quite the same kind of structure. So you have to convert it into a format that is more usable, more structured and that's part of what scraping does. You could do this manually line by line or you could try it out. Lastly sometimes data is available through an API. An API is an application programming interface which really means that there's a program sitting at the other end saying tell me how you want the data, tell me what data you want and you send it a request. It will get you the data back in a structured form. Exactly the data sources from where you can get structured data with relation to the projects. For example, you could get Twitter data this way. You could say give me all the tweets that contain the word India in the last 24 hours. As the tweets come up, tell me who are all the people that are tweeting with an iPhone. As a man they can't just give me the data. Or you could query the internet movie database and say give me all the movies between 1970 and 1990 that begin with the word the and do not have or are not in the English language. These are things that an API would support. A database is a classic form of an API. You use SQL which is structured language, structured query language that you use to query databases. So if you've got data in a database, that's an easy way of extracting it. These are roughly in increasing order of structure. When you ask somebody the data that you are likely to get is possibly the least structured. When you go to the API, the data that you get is likely to be the most structured and there are varying levels of behavior structure. If the data is perfectly structured, you don't really have to do much to format it. It's in the form that you want. The data that you get could be any of many forms. It could be a book in which you'll have to potentially scan it and after scanning it, use OCR to recognize it, which is a somewhat tough problem even in English and outside of that it's a lot tougher. Or it could be the form of PDFs. PDFs are slightly better in the sense that at least you can copy and paste text. You don't have to recognize the text but it's still far from being structured because if you try copying and pasting a table from Excel, you as a copy and pasting a table from a PDF file into Excel, you wouldn't necessarily get the same structure that you want. There are ways around it, slightly painful but at least it's a step ahead of having to scan a book. Or the data could be HTML, which is slightly better. HTML is slightly more structured than a PDF. In the sense that we can analyze its internals, we can copy and paste from HTML and put it in Excel and it generally works fine. And even if it were not the form of a table, it's possible to extract specific chunks of it and put it in the structure we want. Excel files are even better. I use Excel files as a general terminology for any tabular text files. So these are files in which we have a series of rows and series of columns and you know exactly what each cell represents. Columns have listing meanings, the rows have listing meanings. This is among the most structured forms of data that we get today. Even more structured would be data that you get from a database. So here's a database you can do all your querying on it, slice it any way you want to do the analysis specifically in the database and work is simple. So ideally you want data from these sources in these forms. You often get it from these sources in these forms. And this session is going to be about how you automate some of this. Obviously there's no point worrying about automating the bottom bits. There's nothing to automate you or you got it in the form you want. The top bits are rather tough. So we're going to start with the middle part which is, we're going to talk about how we can automate searching, crawling and scraping. We're going to talk about how we can automate getting data out of PDFs and each of your files. That's what we're going to try and cover today. Let's start with the easy bits and I'm going to break this. This class will have two themes or sort of two kinds of lessons. One which could be done by the one program and the other which does require programming background. Some of you already have programming background. You have taken the Python course recently. Some of you have not done either. And irrespective of this, some of the things that I'll be covering today will go over your head. What I'm hoping is that there will be at least one thing today that none of you and hopefully not hopefully better. Potentially at least one thing which none of you will understand as well. But that's all right. We aim is to cater to a very wide audience. And don't worry therefore if you don't get it. Let's start with the simple bits. So how do we automate a search? Let me rephrase the question. Let's say I am curious about whether there is any new mention of the Mahabharata anywhere on the internet. Option one, wake up the morning, do a Google search for Mahabharata and go page after page and see if there's anything that I haven't seen before. Assume that that's your worst case. And how does one improve this? Google alerts. Subscribe to us. It's a good help. Let's walk through this. I'm going to start with a search for the Mahabharata. So the base option was do this everyday and look to see if there's anything that's changed. Given that the order in which it appears is the order of the alerts, there's a good chance that the second day you'll see the same results, the third day you'll see the same results. And you'll have to go leave it to the text to see if something new has emerged. And there are more search tools here. So the first thing that's all mentioned was usage. I can just search for something that happened in the past hour, or something that changed in the past hour. So somebody is talking about Kodashweta in Loknath Maharashtra. Why are you using the blogspot.com has a question. These are all things that happened in the last hour. Or you could expand that in the past 24 hours. If you're looking at it every day, then this is a good option. You can see what's happening every day. And it's still the longest, but you can at least wait through it and know that these are things, these are pages that have changed in the last 24 hours. They may not necessarily be sites that have no information about the Mahabharata in the last 24 hours. For example, if somebody made a correction, if somebody changed an ad, if this was a long page in which people kept adding stuff. In all of these cases, they would be mentioned here, but there's not any new information about the Mahabharata. So this isn't perfect, but it's a start. What it's doing is transferring the burden of identifying new pages onto Google, which is obviously far more efficient. The second thing something mentioned is using Google Alerts. Who's used Google Alerts? That's less than half the class. Google Alerts. And that shows you the site that says monitor the web for interesting new content, which is almost exactly what we're trying to do here. So if I search for Mahabharata, right, it shows me what it might have done. I can filter by result type. I can say I just want to be used to the Mahabharata, a lot of blog posts or discussions or books about it, for now I'm going to say everything. Often, once a week, you would say show me how it happens, or once a week, or do I want only the best results, or show all the results, and they will have to pick your email ID out. There are multiple options here. First we need a list of email IDs. Those are my email IDs. And another option that says feed. If I select an email ID and click on create another, I will get an email last a week, and the contents of the email will look like what I have on the right side. If I set feed, then there's no how or from this, always as it happens, and feed is what I'll be doing in the shop. Then I create another, and see what happens when I get this. Okay, I've got a whole bunch of others. So, we have a link here that says open it in Google Reader. Google Reader is a feed reader, and a feed is not a feed, it is an alert enhancement, except that instead of coming to your email, it comes to an application, instead of coming to the email application, it comes to a different application called the feed reader. If, and if you are already using a feed reader, then this makes perfect sense. If you are not, it's a great way of staying up to date with news. How many of you are using feed reader? That's most of you. For those of you who are not, Google Reader is a good place to start. That's reader.gb.com or as always Google Research or Google Reader. The other application you might want to consider is Microsoft Outlook, where you can ask these feeds if you happen to be already using Microsoft Outlook. It's a good way of managing emails and feeds in the same place. Some of us like email, you've got a bunch of folders, and inside each folder, you've got a bunch of substitutions. So, in this particular case, let's see, I've got a bunch, I've got a photocall of data inside which I've got a bunch of feeds. What is feed called? These are the last four alerts from the people that I'm watching on GitHub. GitHub is a site where you can share code and there are four changes that are being made by people. So, this is my alert. I can choose to go to what happened here. I've just got a feed, no more, what changed. Mark it as a red or I can just choose to ignore the whole thing and mark them all as red. That's what they are managing feeds. Now that we've created this feed for the Mahabharata, I can click on this thing and it automatically opens it in Google Reader and it starts by saying, well, Google, I love this feed, I love this feed, Google, I'll receive new alerts for everything, blah, blah, blah. The next time I was in Google Reader, there'd be an Android item and yes, you may do an app and for us, there'd be an Android item that says here's your next Google alert for Mahabharata if there were new solutions, if there were non-friendly, it just wouldn't happen. There is another way you might want to look at this. If you knew that there was a page that carried news about the Mahabharata and you just wanted to be alerted every time that page changes, that's possible as well. You can set up a feed that you can capture on Google Alert, on Google Reader or on Microsoft Outlook. That's using a site called page2alysis, page2alysis, just Google Search, page2alysis.com. If I know that the site is Google Search for Mahabharata, let's take the behavior page and I want to see if anyone's ever changing the behavior page because that's possibly what's going to be your analysis. And this is now, this link is now something that I can add to Google Reader for Outlook. I can copy it to Google Reader, click on Subscribe, paste it right at the bottom. I have a new empty, which will be getting updated anytime somebody makes an update to the Mahabharata. So if you have a source and they keep adding new sources to add a new information to that page, then there's a good way of tracking. What are the other examples of pages that we might want to track this way? Let's take the RBI site. If on the RBI site's homepage or they have a publications page where they keep adding new publications and you want to know every time they do that, they do that sometimes, let's say every month, every week, whatever, you don't have to go to the site every time to check it. This is a big deal. This is practical science. You go to your RSS reader, you go to your feed reader, you can see changes for any site, changes of Mahabharata, changes on the RBI, particular on earthquakes, the RBI site, whatever. Think of it as your inbox for any news or any kind of news. This is one way of automating search. What I'm not going to talk is the next way of automating search, which is you don't need to type in a search in the first place. Google provides a search API which means that we can write a program, say, and send it to query Mahabharata and it would return the top results for the Mahabharata, sort it in various ways. You can say, just give me the new results for a day. So you can write a program that does it. What you need to do to learn more about this is search for Google search API. Yahoo offers a similar API, you can take pretty much any search engine, do a search for search API and go ahead and use it, assuming you want to do some programming with this. I'm going to try out one exercise. Let's say you've got some data. You want to process this data because it's not quite in clean form, it's not quite an example that I'm going to take is moving data. This data is a regular from the internet moving data piece. Let's do an exercise where we take the data and find out and try and answer the question. Have the ratings of the movies improved over the years? So first thing to do is figure out where this data is. Let's try a query. I want a query that tells me where I can find the internet movie databases list of movie ratings. What query do you use at this time like this? Internet movie database. Ratings database. Ratings database. Ratings database. How many movies? Nine movies. Nine movies. Where should I go? That's it. I'll keep scrolling. If you want me to search for something, you can stop the video. Would you be able to make an accurate answer to the question? Not all? No, yes. Take the years. So it's according to the years and the ratings. Some of the years and ratings. If there are more than one, take the average. There's only one. Let's do that. Let's try and do that. So first, one way of course of doing it is doing it by hand. Obviously, this is not what we're going to do. Do you think I would be able to copy and paste this into Excel? Yes. After I've kept the mouse exactly where I want it, there's a shift. It's an example there. Selecting across one hundred pages. Control C to copy. Or right click and copy to copy. Go to Excel. Paste. Now we need to extract the movie names. The first thing I do is make this a little wider so that I can see the whole name that I used to have been a lockdown. This looks a bit ugly. So the first thing to do that for that is use the shocker keys. I'm not sure how to hear it. Right click, format sets. And there's an alignment session where it says, lock test and we'll remove that lock test. All of these are the same line and I can increase the width and I can increase it. Now we put extract in here. How do you do that? Split it. So it's a different column. It's a column. On Excel, we have a function for a problem. Splitting me that particular cell. There's a bracket start. At its start, split it. It takes the information on the right side of the copy. You can get it in another column and then let's walk through both approaches. You were saying convert the columns. That is the text to columns. Using what separate? Open text. Open text. Now what if there were a movie that had brackets in its title. Is there a way to avoid it? Is there a way to avoid it? Yes. Now this is the list of two wonderful movies. Open bracket starting the number. Sorry, starting the number. Open bracket starting the number. That's not possible using Excel's convertive columns. There might be a few of you who are not following this dialogue. Don't worry about it. I'm going to go to this step by step. So that is not possible. The way you can shape on the right then find the rightmost bracket. And find the next closing bracket. And next closing bracket. Because it's a little bit of programming. Yes. Yeah. That's effectively requiring use of functions. Now it also appears that the movie is always the last. Yes. Sorry, the years are always the last. The good part about the year is that it's always four numbers or divisions or four characters. And the last character is always a closed bracket. So if you take the last characters and ignore the last one, which is always going to be the closing bracket, we should have the year. But let's walk through these approaches one by one. The first approach was let's split it into columns using text to columns. Now, I can set any column and go to data, text to columns. And this allows me to select I was able to break up this one column into multiple columns using different approaches. And say that it's I want to break up wherever there is a delimiter. In this case a delimiter would be an open bracket. So let's try that. I can say a column for open bracket as a delimiter. It gives me a preview of what was it like. And so far things seem fine. There isn't a third column. But the top 50 movies keep changing. And there is in fact there have been times when the bracket had been part of a movie. So while this works there's no Yankee left to go to the future. But we will try to support all of this. And then we'll see what kind of policies we have. I'm going to click on finish. At this point it's supposed to create a new column. But you might be wondering where is going to put the new column. I've got column C. We're going to put column D. Let's see what it says. It says you want to delete the contents of the destination sets. Which in this case is column D. I'm not reading first of all the votes. I don't know how to vote. I'm going to read another rating. So in this particular case I'm going to say ok. But what you may or may not want to do is insert a black column between C and D so that it doesn't go over there. Or you need column C to be there. So it's over there. And what I have here is something that looks like the U.S.L. Closing bracket there. Let's remove that. How do I do that? Use the function right. Ok. Two approaches. Somebody said find and delete this. Somebody said use the right function. I'll come to the use of functions in a web. Both are like the approaches. In this particular case such and delete this is probably going to be easier. Because they just have to select the column. Press delete this. Close bracket. Delete it. That's how to do first. I don't want you to notice. Let me do that. Replace the closing bracket. Ok. And say replace column. If column is the same. It has made 240. They should have made 250. Which is an indication of something wrong. In this case it just turns out to be me having made a keyboard instruction. I'll accidentally appear to have copied time below. So I'll just do a few undoes. Until that gets sorted. I'll just go back to the table. Find out the instruction. And then go to close bracket. And select the column again. Replace. Or closing brackets. Now it's going to be the case. Exit and delete gives you a message saying this is what I learned. How many rows I have written. How many rows I have written. So it's worth doing a quick check. See if that's happened or not. Ok. Is this fine? Do we now have a thing? Let's go through this list. And let me use port name of normalize. There is a comma after. There is a comma. Yes. Ok. I don't think you are able to see this. But. This says 2011 slash I. I am leaving sometimes movies with the same title. In the same year. In which case it says this is the first movie that was released that year. The second movie that was released this year. So how do we get rid of that? One possibility would be to say if it's always going to end with slash something. We could just take the first four letters. The other possibility is to say let's do text to columns again with slash as I believe that it's done. And move to the next column. And we will do text to columns again. Then data. Text to columns. Then slash. And finish. This new column E. Or the happy I. So we got that in row 155. We got that in row 18. Which was the artist. We got that. But now everything out here. What will be now? And now we can do the analysis. What if we got that almost. Now we need to get the let's say we just want to see which decades have the best records. So getting the decades is very easy. Take the first three digits. Take the first three digits. Take the third digit alone. Take the third digit alone. That almost but in this case it would work. It would be confused between for instance 1910 and 2010. So we want to take the first three. Not. When there were a couple of movies released in 1910. For us to want to work on that. As long as this is. But you would want to factor that. So okay. First three digits. Any other way? Divide by 10 and multiply by 10. Divide by 10 and multiply by 10. That's a possibility. In fact if you just divide by 10 that provides the same result as well. Both are parallel approaches. So I could say decade is this divided by 10. I want to talk to the nearest number. So I can use the wrong formula flow which is getting the lowest number and copy that formula down. Or I could use the left function which gives me the first three characters and use the same result. Doesn't matter which one. So let me copy both these formulas. And now comes the question. What is the average rating by decade? For each decade we want to calculate what the average rating is. Group. We want to. Take the ratings. Divide by 10. Divide by 10. Solve it by the decade and manually do the calculation for each decade. That's another option. Let's try those in the reverse order. Let's sort this by decade. Sorry. Just sort by decade. Sort numbers. So there are things 1920s, 1930s, 1940s and so on. Let's see how well that will be average. If I want to add them all I can just select the ones that I want. Up to 1992. That's an average of 8.18. So 19.20s that's 8.18. And then I go to the next one which is 93 onwards of the year. That's 8.21. So 1930s is 8.21. And so on. There's probably a problem. Come back. As you're doing what they came you could compare and upgrade your model. Absolutely. You could do something with functions. You probably don't even need to go as far as functions. They will pass away to your hands. So let me insert. Insert. Insert. And it automatically selects the real asset that you want. What we want is by decade I'm going to move that to the pros and ask for the RUG. I'm going to move that to the values. I'll give me some of what I want to call the average of pros. So I can use the strongman go to the settings, select average and now I'm going to use the average of pros. I just want two decimals so let's go and do the one decimal for the two decimals. The easiest way to figure out which of these is the largest one is to use the little insert bar chart like this. So clearly the 1990s have the best rated movies for over the 1970s. At least as for the top 250 movies. What if we did the same analysis with the full data set? The process is almost exactly the same. One crazy time is getting the list of movies. One option is to like a program to create great and stuff like that. But the best option really is to ask somebody how to get this data set in this particular case. I will tell you that there is I really suppose a raw data data download I know you are just trying to go through the process. The first thing that it provides is imdb.com which describes all 10 ways to access IAMD locally in copies of the data directly on your system. Let's click on that. It takes us to a bunch of sources from where you can get the main exercise. I'm going to click on the Swedish name. This usually. Now the list of ratings for all the movies happens to be in a file called ratings.list.design which is a 7.6 ME file. You can download it. And you can open that file. Here is where this file is you can unzip it. Now I am going to open the same set. Ratings.list Now when I open a file like this it asks me this is a text file. Next it is a text file. But it's not format and therefore it asks you to be equal to the text files. The structure of a file is somewhat different from the other files. It's got some initial description. In fact it's got a privileged email at the beginning describing all of this how the rank is computed, how the model is computed and so on. But then after a while the movie details start. In this case I am now going to split it by a separator. There is no separator between the columns. Instead these are all fixed widths. So there is another way in except of using fixed widths let's just add a fixed width. It's very easy to use actually. Once you select fixed width and click on Next. It allows you to position these vertical bars wherever you want to split the columns. You can drag them around. Which makes it fair to you. And once you've done that just click on finish. This takes away. It's a large file for 22 years. So once it's done how large a file you know can accept. 16 million rows is one constraint that can sell in courses. So it has to be less than 16 million rows. This one is less than a million rows. How many columns? It has to be less than 200,000 columns. 65,000 60,000 columns. That's the second event. There is also a restriction of the total number of cells. I don't often remember what it is. But there's a good chance that if you have a file that hits either one of these limits you're already looking at a file that's if it comes near these limits you're looking at something that is going to be a little bit slow and be processing. So it might not be worth the effort. What I've found is generally at least people's opinion of what it's going to open is a lot lower than what it really can open. So as a good rule of thumb just open it in Excel you'll find that it's faster than you might have thought. This rule of thumb works for most people that I do it myself. I didn't think this file was going to open in Excel this morning but it's open and now we have it. Here's where the Woodstock starts. I'm going to delete everything about that and towards the end it's about a footer or something and I'm going to delete that. That's just the thought to delete the movies. Here's where the footers to movies start and at the bottom 400,000 rows beyond this there's some stuff that I don't care about. Maybe that is better. And now we've got these 400,000 rows which have pretty much all the movies. And we have the same problem as before which is the movie, the year of the movie is in practice. So let's do the same thing as before which is kind of this but now for sure we're going to have a problem because there will be movies with brackets can't use that. And it's not just that, sometimes these movies can put an open bracket in so I can't even be sure that's the last four characters. One possibility is to go through all these remove everything that ends with T. Will that always be in the year? Maybe, maybe not. This is going to be at the end. So we're going to remove everything that has a B or a T near the end. To the best of my opinion, that doesn't quite do it. BG. That's our problem. But the good part is if you go through this data set you won't find more than half a dozen types of things that have this at the end. And numbers are too large. I'm not interested in the perfect analysis. I don't care if the difference between what they're getting on this 0.32 or 0.3 years on which is the biggest. So, let's look at the analysis. This actually is extremely important. And there's this video on YouTube that talk called Street-Riding Mathematics. You won't take a look at this. You definitely won't take a look at this. What we're doing to the course is probably the first one. Part of what stops us from doing stuff is the inability to do it the right way. But what's most important is getting some wisdom which is the roughly accurate information you have to go for. You don't want the answer. You don't want the right answer. You just want to be able to take a step forward not get stuck. In this particular case, I'm going to say I don't care about these things. I just don't care. But I'm going to take the last four characters equals take the last five characters that's often the move. And then I'm going to take the I'm going to ignore the code bracket in that using the left function I'm going to take the first four. I could have done some other ways it was nice to find other ways of doing it. And copy this on and let's talk about this. And that's good enough. And what do you do? Obviously this obviously some of these are not yours. That's fine. Let's consider this for now that's good enough. Let's see what kind of results we get. And this is important because once you get a sense of the results you'll know if the direction which is this is being taken. This process that we're following is effectively the equivalent of T-shows and so on. So actually that's what is equivalent to because we're referring I think that has a V or a T or a VG at the end of the term. That's fine. We're just going to do the values for movies not T-shows. Let's do the same in the way before. Now let's get the decade. And divide this by 10. Now the good part both dividing by of the number divided by 10 let me take the first three characters in the decade. I'm taking the first three characters in the first four characters. Once this is done let's do a pivot table in which I paste the decade and now I have the back and I get the average of the back. All of these I want to know. I don't know anything except where they look like 1900s or 2100s a lot of junk here. No idea where this came from but this is equal all of this and only focus on the ones that start with the first three characters. That could be a year 1910, 1920, 1930 1926, 1928 1929 now good point zero 200 201 2029 this is one way of doing it it's not a use of doing it it's an instance to say I don't even want to fail to do those number of movies was at least a thousand in any decade there were a thousand movies then I would send that a decade or something that's too long to go on to have a future. So I got the average of the lines which was the same thing that we did last time last year let's get rid of this many and let's add the large large arts. Looks like the 1920s so we had a good time with movies in the 1920s then things started getting worse until the 2000s then things started getting better and 2010 things in this decade things are looking even better so much only in the last 20 years it has been continuously planned for cinema and now things are starting to get worse that didn't take too long we could have gotten stuck with that whatever the language is for a long time I bet you when you actually do the analysis the results would not be any different at least from the from a language perspective it won't make any any difference to be conclusion just a couple of things that I wanted you to take away from us using minimal formulas and using Excel you can do a fair bit of processing with test based data probably much more than you can think that includes a volume of data Excel probably can handle a lot more than you think it can handle second thing I wanted to take away is try to get the results quickly it doesn't matter how rough they are start with one version of the results and then when you work with it you need to get a better version of the results that is completely rejected you may find that you have more interesting analysis that you might want to follow or you might just run a runtime and have a terminal submission right there get the first iteration of that quickly as you can then be able to improve it if that's required that's what I want to cover from this example let's take a short break maybe 5 minutes and when we come back we will be going through problematic examples where we will be cropping data as well as scraping data