 As we begin our discussion of coding and data science actually want to begin with something that's not coding. I want to talk about applications or programs that are already created that allow you to manipulate data. And we're going to begin with the most basic of these spreadsheets. We're going to do the rows and columns and cells of Excel. And the reason for this is you need spreadsheets. Now you may be saying to yourself, no, no, no, no, not me, because you know what, I'm fancy. I'm working in my big set of servers, I got fancy things going on. But you know what, you too, fancy people, you need spreadsheets as well. There's a few reasons for this. Most importantly, spreadsheets can be the right tool for data science in a lot of circumstances. There are a few reasons for that. Number one, spreadsheets, they're everywhere. They're ubiquitous. They're installed on a billion machines around the world. And everybody uses them. They probably have more data sets in spreadsheets than anything else. And so it's a very common format. Importantly, it's probably your client's format. A lot of your clients are going to be using spreadsheets for their own data. I've worked with billion dollar companies that keep all of their data in spreadsheets. And so when you're working with them, you need to know how to manipulate that and how to work with it. So regardless of what you're doing, spreadsheets, or specifically CSV comma separated value files, are sort of the lingua franca, the universal interchange format for data transfer to allow you to take it from one program to another. And then truthfully, in a lot of situations, they're really easy to use. And if you want a second opinion on this, let's take a look at this ranking. There's a survey of data mining experts, it's the Katie Nuggets data mining poll. And these are the tools they most use in their own work. And look at this, lowly Excel is fifth on the list. And in fact, what's interesting about it, it's above Hadoop and Spark two of the major big data fancy tools. And so Excel really does have place of pride in a toolkit for a data analyst. Now, since we're going to go into sort of the low tech end of things, let's talk about some of the things that you can do with a spreadsheet. At this point, they're really good for data browsing, you actually get to see all the data in front of you, which isn't true if you're doing something like our Python, they're really good for sorting data, sort by this column, then this column, then this column, they're really good for rearranging columns and cells and moving things around, they're good for finding and replacing and seeing what happens. So you know that it worked right. Some more uses, they're really good for formatting, especially conditional formatting. They're good for transposing data, switching the rows and the columns, they make that really easy. They're good for tracking changes. Now it's true if you're a big fancy data scientist, you're probably using GitHub, but for everybody else in the world, spreadsheets and the tracking changes is a wonderful way to do it. You can make pivot tables that allows you to explore the data in a very hands on way, in a very intuitive way. And they're also really good for arranging the output for consumption. Now, when you're working with spreadsheets, however, there's one thing you need to be aware of, they're really flexible. But that flexibility can be a problem. And that when you're working in data science, you specifically want to be concerned about something called tidy data. That's a term I borrowed from Hadley Wickham, very well known developer in the R world. Tidy data is for transferring data and making it work well. There's a few rules here that undo some of the flexibility inherent in spreadsheets. Number one, what you want to do is have a column be equivalent to the same thing as a variable, columns, variables, they are the same thing. And then rows are equal, exactly the same thing as cases. And then you have one sheet per file, and then you have one level of measurement, say individual, then organization, then state per file. Again, this is undoing some of the flexibility that's inherent in spreadsheets, but it makes it really easy to move the data from one program to another. Let me show you how all this works. You can try this in Excel. If you've downloaded the files for this course, we simply want to open up this spreadsheet. Let me go to Excel and show you how it works. So when you open up the spreadsheet, what you get is totally fictional data here that I made up, but it's showing sales over time of several products at two locations, like if you're selling stuff at a baseball field. And this is the way spreadsheets often appear. We've got blank rows and columns. We've got stuff arranged in a way that makes it easy for the person to process it. And we've got totals here and with formulas putting them all together. And that's fine. That works well for the person who made it. And then that's for one month, and then we have another month right here, and then we have another month right here, and then we combine them all for the first quarter of 2014. We've got some headers here, and we've got some conditional formatting and changes. And if we come to the bottom, we've got a very busy line graphic that eventually loads. It's not a good graphic, by the way. But similar to what you will often find. So this is the stuff that, while it may be useful for the client's own personal use, you can't feed this into R or Python. It'll just choke and it won't know what to do with it. And so you need to go through a process of tidying up the data. And what this involves is undoing some of this stuff. So for instance, here's data that is almost tidy. Here we have a single column for the date, a single column for the day, a column for the site, so we have two locations, A and B. And then we have six columns for the six different things that are sold and how many were sold on each day. Now in certain situations, you would want the data laid out exactly like this. If you're doing, for instance, the time series, you'll do something vaguely similar to this. But for true tidy stuff, we're going to collapse it even further. Let me come here to the tidy data. And now what I've done is I've created a new column that says, what is the item being sold? And so by the way, what this means is that we've got a really long data set now. It's got over 1,000 rows. Come back up to the top here. But what that shows you is that now it's in a format that's really easy to import from one program to another that makes it tidy and you can reminipulate it however you want once you get to each of those. So let's sum up our little presentation here in a few lines. Number one, no matter who you are, no matter what you're doing in data science, you need spreadsheets. And the reason for that is that spreadsheets are often the right tool for data science. Keep one thing in mind though, and that is as you're moving back and forth from one language to another, tidy data or well-formatted data is going to be important for exporting data into your analytical program or language of choice.