 So, what we are going to do now is walk through an example which is programmatic extraction of data and programmatic transformation of data. I say programmatic end to end it will be something that can be automated. I am expecting that half of you will not follow a thing and will probably not be doing any of this in the foreseeable future and that is just fine and hoping that the rest of you will still find something new to learn from this. So, we are going to get into hardcore great programming and crawling. What we do is try and find the correlation between currency rates in the last three months. I am not actually going to get the answer to this bit. Let us leave that for next week and if you are curious enough you may want to do that by yourself. But what I am going to do is take us as far as getting to the data that is required to do this analysis. That is what this session is about. Getting the data. The first question is where does one find currency? If I may tell you what historical currency it is. I have got a site open here that happens to provide it. This was not something that I found out about searching. This was the first hit on move that I had searched for it a few months ago. So, if I wanted to find for instance what the value of the U.S. government is with respect to the Indian rupee and that is 55.2150 as of today. What are what historical exchange rates? If you call it historical exchange rates, it says USD to IRR. It can also convert other currencies a range of 30 days from a star day to an end date and gives me the graph. It also gives me a date. Now the question is how do I get this data extracted? There is a download link here. Let's see what happens to actually comment down here. I get a file called data.cs. I can just click on it and open it. What this file has is as of a given day, what is the USD INR balance? What I want is the USD versus INR, EURO versus INR, GDP versus INR and a whole series of exchange rates over a longer period. I can change the period out here to say 90 days. Download it again. If you want to see it, please have it up to this date, starting from 3 months ago. So, it's got the right period. But now we have one column and I can't see anything where I can change the currency I have. The IOMR, the long meter, that's fine. I want that constant. But I've got to change this too, more than that. And even if I have five such options, I don't know if I want the whole list, or at least four kinds of currencies. Also, I want people to do the schedule day, so I can't go to each page, click on download, save it as a file and it will be the whole process. Now, if by clicking on this thing, I am able to download a file, maybe I can just find out what this link is and write a program that downloads that link. Is that all? So, the first question is, is there a way of downloading IOMR? This is something that is provided in every Linux and Linux line system. So, which means on Mac and Linux, you already have this. On Windows, you do it by default. And the easiest way to get around this is to install a product called Sigwin. I'll be using a reference to Sigwin a number of times during this course. Sigwin is a version. This is a way of getting most of the Linux commands onto Windows. You can go to Sigwin.com and this is a program called setup downloaded. Just click next to almost everything and you'll have this set up on your machine. But it is a large download. It is set up on a small one. But once you've done set up on a small one, it then asks you for a number of things and then it will actually work out well. And that is a fairly large one. And I just did this in the morning and now I am going to go to the process. There are two commands that Sigwin originally provides that you have a set up to get. One is called kub and the other is called wget. Once you install Sigwin, you get a sync or sigwin term. You can open that. And from now on, this box has most of the Linux commands. So I can say ns-insert here. And that's my list of commands. The command curve gets a URL and prints it to the console. So you can say htd-codes-wwp-analysis.nl That's connecting to the page and printing the full contents of the htm on the console. If I wanted to download it, I could do that using the curve. Or there's another command which is almost exactly the same as the curve. The only primary difference is that the default option is to download the file, which is called wget. So wget htd-codes-analysis.nl It says save in the index or htm. And it starts. Now if I open this index or htm, let me go to the index or htm. I have the index or htm. If I open it, it will take a few seconds to load the image. So you download this web page. And that's what we do. Pretty much any web page. What I can do then is write a series of such commands on, say, Notepad. And so if, for instance, I had a link. I say wget htd-codes-wp-analysis.nl u-s-d.c-s-d u-s-d.c-s-d u-s-d.c-s-d And so on. All I have to do is every day copy and paste this into the console. Just copy the whole thing, paste it on that batch terminal. And it does the whole thing for that thing. It's also possible to save this as something like a batch program. As a batch file. And just type the command and leave the link every time. That's what I did. But the question is, what is the link that I have to provide to get to that u-s-d-c-s-d? So let's go back to kohadna.com and see historical names. And let's see if we can figure out what u-s-d-c-s-d is. Now, here's a trick. If the u-r-r is not visible, I think that's pretty good. Can I click on this download? It just gets me the data.c-s-d. A way of figuring that out is using Chrome's inspection tools. And go to settings bar here. Good tools. And there's a link that says developer tools. Click on developer tools. You'll get this box right here at the bottom. Which shows you all the stuff. One of those tells you what the network activities are. So any interaction with data will be shown here. And one of the things that will show you is the link of the file that is being accessed. For example, if I click on this download, it's sending some messages now. So behind the scenes, what the stage is doing is sending messages to this AI or PHP. And the response. So that was a request. It looks like some SWF, which is a flash fight, was transmitted. Here's some of the information that was transmitted. We don't know what was happening here. I'm going to clear all of this. Now let's click on download and see what happens. When I click on download, this is the link that I've accessed. In some cases, that's possible. In this particular case, when I hover over it, the link that you see is JavaScript colon. So you'll see the link out here when I hover over it. The link is JavaScript colon void 0. Let me try copying it. Right click. I will see the link copy. Copy it. Now let's paste that and see what we get. Which, well, let's see if that works. If I just get that on new page, the best of anything is simply to try it out and see what works. I paste it. What was that in Chrome when I copied this whole thing and tried pasting it. It only pastes the void 0. Let's try typing JavaScript colon void 0. It's nothing else. So what's happening is sometimes the package we have is going on behind the scenes that finally sends it to this particular URL. The request URL is mentioned here. I'm going to copy and paste that and see what that looks like. That's probably wrong, but let's dig it up and use it. So I have htb code slash that, htb code slash that, htb code slash that, htb code slash that, htb code slash that, htum unforgiveness mark. Anything other question mark, commands are going to be passed to that page. Let's see what recommends that it imposes. Code currency is equal to thirsty, and end date is something, and start date is something and year equals daily, and display equals absolute, and blah blah blah. But my guess is if I just change this usb to something else, there's a good chance it will give you whatever. Let's try it. So if I set W get, this will do what I want. Now, I'm not sure what file name is going to say it as. So I'm going to explicitly give it a file name, and use Curve, and say greater than USG.CLC. So what that's going to do is run Curve, which means it's going to get this entire URL, and it's going to put it inside USG.CLC. Now, when I run this, it will throw an error. I'm not quite sure that it's giving me something. But what's happening to all these terms from there is that the anchor sign has a special meaning in US. It means run the process in a background. So if you have a URL with anchor signs in it, you want to put quotes around a quote there. And the general rule of thumb, when we're using anything in US, just put quotes, especially around words. Now, I'm going to run the same thing when I create a screen first. And I'm going to paste this. Seems to receive some data. Let's see what we have. I've got a USG.CLC. Let's open that. Seems to have been conceived for a 30-day period. I want it for 90 days. The way to do it for 90 days is go back there, change the range from 30 days to 90 days, and click on this download link, and see what happened here. That's the last download here for 30 days. I don't know if you, if you do a cable examination of the second URL, you'll find out what's changed. This is data and this code range has changed from D30 to data range D90. I'm just going to go over the second URL curve. Let's not forget the quotes. I'm going to run USG.CLC. In this case, the dates are hard coded. 20, 9, 3 at the end date. Start date, end date, you're going to be going 20, 9, 3 and start date 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20. We ideally want to write a program that would change those that we've been working on now. That's not the end goal. I've got this URL, and I'm going to get this for the other currencies. First, let's go for some of the code for the records. That's great. It's a USG.CLC device called Social Security, that's where they had it open and exit. I just closed it and exit, we'll run it again. Now let's put the end exit. It starts on 9, 3 and ends on 6, 6. That's roughly 90 days of data. So we've got it for this one currency. How do I get it for all the other currencies? An easy way is to create a series of commands which are such a big USG or something else. Here's a clever way of doing it. Use Excel to write your code. Up here, I'm going to type in a formula, a simple string formula, equals, open quotes, and remember all the simple, simple quotes. That's fine. Now what I'm going to do is inside this, let's say I've got some code. I want to use USG, GVP, Cayama, or more like, USGP, James Yen, Chinese Yuan, and we've got four elements. I want to do a certain replace for USD and put the other currency here, for which I can write a formula. Equals, open quotes. The curl remains as it is, and we use the WL codes to escape the codes inside the codes. And where I have USD, I'm going to replace it with a cell reference. And the codes inside can get mixed with cell A2, cell A2 contains USD. So it will replace that with USD. Then I'm going to say under sign, open quotes, the rest of the string continues, and then I go all the way down to this USG, and I'm going to enter the WL code here, and I'm going to scale that. And then I replace this with A2, A2, A3. So that's, that USG is going to replace with the contents of A2 as well, and then I've got a CSE. So it looks like exactly the same, but now when I paste this formula downwards, notice that over here, the currencies have changed. You would have done this using a circuit search in many cases then, just as it's faster than you have lots of currencies. And there are programmatic ways of doing this as well. This should do one other way over there. Now this is your program. Let's save it. Now like I said, you would either save this as a batch pattern, or you would choose to simply copy in cases to a console manual, copy in cases to a console, and let it take its own sort of value. You won't take no more than a few seconds. And as it's done with your files, you can see the progress these files are coming through. Now that we have these files, it's a really simple matter to open them on the one, and paste them on after the other, and then create the correlation matrix of these, try and see if the first is coded for the second, and so on, but can we automate that as well? Let's see what this file looks like. I'm going to use a command less, which is the word, say, show me the file, USD, and that's CSV. Show the line. So the first line, second line, third line, fourth line, simply leave it there. I also won't structure this file. I know that this column is exactly the, the exchange rate in this column is at 8, so I don't even need the header. If I would skip all of these, and just take the first column for any one file, and the second column for every other file, and paste them together somehow, that would make a job easier. You need to suppose a set of commands for that. I'm going to work it through those. Let's jot down the steps that we would want. The first step is get rid of the first file lines. Second step is, that's the first step. In addition to the first five lines, we may also have to get rid of some stuff on the model. Get rid of the last four lines, even though those are averages. That's the second step. Get rid of the last four lines. Then, let's assume that we always know that this is the last column of commands, and actually the base total may depend on the coalitions, as I was saying earlier. All you want to do is extract the second column. The third step is get the second column. For each of these, there is a command in the last units. To get rid of the first certain number of lines, or in other words, to get the bottom portion of the file, the command is paid. And as you might have guessed, the opposite of that is paid. So, paid will give you the first certain number of lines, paid will give you the last certain number of lines, and getting a specific cost data using a command called cut. I'm going to talk only about these three commands today, but there are two other commands that you find very useful, actually three other commands that you find very useful. One is get, which is to find stuff. Another is set, to change stuff. So, think of this as search and replace, and something called Ock, which is programming for stuff. These are the three increasing orders of complexity usage. You would probably be better off using Python, than Ock, from a related to this perspective. Correct is already getting to a point where it starts, it is slightly easy. But these are extremely fast, extremely powerful, and you can do a lot more stuff with the data than these tools, than with most programming languages. My entire working process involves these six commands, running and else, and that's all we're going to use now. Let's start. First thing to do is to use the command, pay to get rid of the first 10 lines, the first 10 lines. In US, if you want to know how a command works, you say man for manual, then the command. So, man pay, blah, blah, blah. Now, after reading this, you may eventually come to this line, it says minus N, which outputs the last K lines instead of the last K lines. Instead of the last 10, oh, okay. My default pay gives you the last 10 lines. Let's try it out. Pay USD per season, that's the last 10 lines. Now, instead of the last 10 lines, we want to use everything except the first line. So, man pay, it says minus N, and it gives minus N plus K to output minus starting the K. We want output minus starting the 6th, you know, into the first line. So, if I say minus N plus 6, this ought to be R. I am not expecting you to be created, I'm not expecting you to be able to understand this. We haven't tried it already. The only thing that I wanted to know is that there is a amount called a pay, about which you can find out using a amount called man, and both of these are instructions for someone who's paid around a little bit with you next time to a gun, for those who have not just passed some years to it. So, take minus N plus 6, USD plus CSP. Now, if I just printed it out, it will come to very fast working game, you will see, but that's what it did then. Now, let's go back, after I set this, it started printing from 0903, which is great game. So, it's chopping off the lines. Now, I've got to chop off the last four lines, for which to find this head. Now, what I can do is use this thing called a pipe symbol, to say, pay the input, pay the output upon the element passing along the program. So, pay the output of this program and pass it to the program head, in which I take minus N, actually it opens in that way, let's do a man head, minus N minus four, let's say, a head minus zero, let's see what that is. Okay, let's chop off and do nothing, 06 is 06. If I did a minus three, it got a blank line. So, it's chopping off at the right point, if I say minus four. So, I've got rid of the first few lines and the last few lines. Now, I want the second column. For that, the command was cut. You would do a man cut, which would get you, there are two things that cut requires. One is a delimiter. Now, in this case, our delimiter is a column and the field numbers, which is just very minus there, this key number is two, and that just give me the prices. If I don't like the quotes, I can get rid of the quotes as well. I can use such a details that said, say, if I said S slash, of course, we'll get rid of it. You're a single quote. Let's see, let's see. This way. I'm just hoping to avoid using the quotes, but yeah, this is using the quotes. That does it. So, I've got to do the quotes as well. Now, I could feel this whole thing to curl and curl a lot of attitude. Meaning, at the beginning of this command, I would say curl whatever, pipe all of this, and then send the output to usd.csv. We'll see what you're doing. I have to do the guide. I have to do the media.csv in the same way, so that A will take you from the previous command and put it here. Let's try this. I'm going to take this entire string and back into Excel inside the formula. It's not greater than usd.csv. I'm going to say pipe A, blah, blah, blah, by set and then send it into usd.csv. Now, I think it's about, it's hard to change the struggles. So, we don't go to anybody who works in Excel. That's a long string. So, I'm going to take it into some parts somewhere. Now, let's try this command. Let's make sure that it goes to usd.csv. So, it's getting the string, chopping off the first few lines and the last few lines, getting the second column, moving the quotes and putting it into these parts. Now, let's look at usd.csv. That just has the column to be. Okay, now you want to put the idea. There is an antibiotic, paste. It's a paste usd.csv.cd.cst. It just goes inside by set. Let's see what all the csv files I have. I've got six csv files. I just say paste start.csv. Those are not real. Now, I've got something. Let me call this, a column.csv. Currency.csv. Which I can open and exit and start doing coalition files. But before that, it's automatically learning about it into contains.com and let's say finish. The only thing missing, of course, is the name of the currency. But we could even add that back, which is going to leave it as it is. This is, this was to be the US dollar. And let's see what of these starts with eight. C and Y, the first is actually okay. This is the order in which the five are taken. The first is a Canadian dollar. The second is a Chinese Yuan. So let's look at the correlation between the Canadian dollar and Chinese Yuan. Over the last three months, that correlation has been about 20% or 30%, okay? Not that strongly correlated, but still positively correlated. We could do this with every balance. The thing is, with this stuff, you automated the extraction down to just one copy piece. Every time you need to do it, it's a lot bigger. Would it have been faster to do this manually the first time? Yes, absolutely. The second time, that's when you're done start paying off. If it is, you're going to do something because of more of a better opportunity. If you know you're going to do a login once, don't worry about it. That is going to take a lot of time. I was going to cover scraping, but we are going to have a short amount of time before we cover that, right? Let's skip that. I will point you to a few links and it will require me to take up scraping when we actually don't need to use it. For now, we've actually covered crawling, we've covered the way of automatically extracting data from and basic transformations of that. Just remember, number one, Unix commands are extremely powerful. Look at the very least, how do you use the first five of these six commands? A, head, cut, grab, and save. Secondly, you can download URLs using the command curve and pass it to this. This can help your automated people to sync step and consider it. You'll be using all of these commands in at least half of the projects that we discussed earlier. So just plan, at least install, save them, and walk the song in your chance. See, I think what would be really, of course, this is sort of the workflow for Unix to accept, right? It might be interesting if we can conceptualize the workflow more abstractly and say, here is probably speaking what the steps are and have slots under it, which are the specific commands for the Unix to excel, but if there are other, if you replace Unix with something else and you replace Excel with something else, because this idea of using the delimiters and essentially going with one command to the data output file should work for lots of, is both sources and targets, right? And it might be just useful to say, okay, here is the overall conceptualization and here are the specific commands in Unix and in Excel, but you can fit it without being there for that. That's right. Ben Fryer, who's the author of this thing called and also sent another quote. Talked about is data analysis is going through. But let me talk you through this. At a very high level, what you're trying to do is get the data converted to a different format, perform statistical transformations on it, which could be finding the averages and finally potentially highlighting extract insights out of it or visualizing it. We treat those as the four steps. Each of these steps can be done with a different set of tools. The first, getting the data. Use your browser. Use Unix commands like Curve. Use Excel. Excel can download files. Use R, use Python. Use whatever programming language you like. But that is invariably the first state, which is getting the data from potentially something. Now, it doesn't have to be a link. You may want to extract it from a PDF file, which is, there are tools to do that. You may want to extract it from a book, which is this post here. And that's one stack of pieces. Let's just put it on slide. You start with getting data. Then you convert the data. And the tool set that you use in this conversion of data step. We've seen some of those. One is the Unix commands. If you got a set of tests, you want to chop off the last few lines, chop off the last few lines, put a search and replace a common case that that's one set of conversions you can do. That's just using Unix tools. Obviously, every programming language has its own set of conversion tools. You've seen almost exactly the same process in Excel. You're using the code that it goes to in my way. You can do this in R, you can do this in Stas. A large variety of tools provides support for this. The third part is the analysis. By which, I mean, but analysis really happens in your head. What happens in the tool set is you convert that into a series of questions or transformations, which there may be kind of answers if you want. And the things that we're doing are averages. But this is not something that I would suggest using the Unix tools for. Or for instance, you can compute averages, but you are usually better off doing these using a programming language or using a storage package. Storage package is like R, SAS, SPSS, Excel, as well. All programming languages. And if you're using a programming language, you're better off using breakable languages. Do not write code first instead of transmitting to yourself. Never, unless you're implementing it the first time in the world. Do not. You will make mistakes. You will make far more mistakes than you can conceive of. Once you've got that analysis done, then you're talking about usually visualizing it. The tool set here is often brief. If you look at a tool set, that simple tool set that covers things from end to end, it will always be a programming language. There's nothing else that covers the spectrum all the way through. The next best, there are stretch-taker packages that will cover most of the spectrum. So R for instance has these visualization languages. R can get data from the internet. Can do the analysis, can do the conversion. So can SAS, so can SPSS. All the way into these, but stretch-taker packages will be your next best bet. Excel also spans most of the spectrum. So I club it under these stretch-taker packages section. Unistools fit in in a niche. They can do the first two. Not the others. But the reason I'm using those is because they are extremely fast, very versatile in areas outside of data analysis. And it's something that many people have on their system, many of these. More than many others, they're fast. They're fantastic for exploratory analysis. You just like the time you see what's happening. And that exploratory analysis immediately translates into automation because you just copy and paste this commands, put them in a shared script and read them at any point. But in terms of the breadth of their coverage, you're restricted to getting the data and converting the data. Not on shoddy analysis. So those are the three categories of rules and they're going to have more to do. All is over for you. You want to go all the way on Python. You can just go. There should be an if, then statement which basically says if so and so, use Unix, use Excel, use Word. I think that has sort of been answered in the first class where people said, you know, for instance, someone asked why Java was, I know Java or why you do basic, I know you do basic. That's usually a bigger deciding factor than anything else. There's almost without doubt some tool that you already know and unless that tool does not do the job and it's unlikely to be Unix alone. So if you know a statistical package, there are enough to be able to do this range of analysis. If it's a programming language, there are enough to be able to do this analysis. What we'll be focusing on in this course are more techniques rather than the tools that this class will need is probably one of the few exceptions. And even here, the focus is on knowing that these are the logical chunks by which one extracts the data. These are the chunks one will, by which one processes the data rather than memorizing these commands. There are other ways of getting the same things done. This is just for those who were already commands to do the same when we get started off. That's it.