 Welcome to Data Science and Introduction. I'm Barton Poulsen, and what we're going to do in this course is we're going to have a brief accessible and non technical overview of the field of data science. Now, some people when they hear data science, they start thinking things like data and think about piles of equations and numbers and then to throw on top of that science and think about people working in their lab and they start to say, That's not for me. I'm not really a technical person and that just seems much too techy. Well, here's the important thing to know. Well, a lot of people get really fired up about the technical aspects of data science. The important thing is that data science is not so much a technical discipline, but creative. And really, that's true. The reason I say that is because in data science you use tools that come from coding and statistics and from math. But you use those to work creatively with data. The idea is that there's always more than one way to solve a problem or answer a question. Most importantly, to get insight, because the goal, no matter how you go about it, is to get insight from your data. And what makes data science unique compared to so many other things is that you try to listen to all of your data, even when it doesn't fit in easily with your standard approaches and paradigms, you're trying to be much more inclusive in your analysis. And the reason you want to do that is because everything signifies everything carries meaning and everything can give you additional understanding and insight into what's going on around you. And so in this course, what we're trying to do is give you a map to the field of data science and how you can use it. And so now you have the map in your hands and you can get ready to get going with data science. Welcome back to data science and introduction. And we're going to begin this course by defining data science. That makes sense. But we're going to do it in kind of a funny way. The first thing I'm going to talk about is the demand for data science. So let's take a quick look. Now, data science can be defined in a few ways. I'm going to give you some short definitions. Take one on my definition is that data science is coding math and statistics in applied settings. That's a reasonable working definition. But if you want to be a little more concise, I've got take two on a definition that data science is the analysis of diverse data, or data that you didn't think would fit into standard analytic approaches. A third way to think about it is that data science is inclusive analysis. It includes all of the data, all of the information that you have in order to get the most insightful and compelling answer to your research questions. Now, you may say yourself, wait, That's it. Well, if you're not impressed, let me show you a few things. First off, let's take a look at this article. This says data scientists, the sexiest job of the 21st century. And please note that this is coming from Harvard Business Review. So this is an authoritative source. And it's the official source of this saying that data science is sexy. Now, again, you may be saying to yourself, sexy, I hardly think so. Well, oh yeah, it's sexy. And the reason data science is sexy is because first, it has rare qualities. And second, it has high demand. Let me say a little more about those. The rare qualities are that data science takes unstructured data than finds order, meaning and value in the data. Those are important, but they're not easy to come across. Second, high demand. Well, the reason it's in high demand is because data science provides insight into what's going on around you. And critically, it provides competitive advantage, which is a huge thing in business settings. Now, let me go back and say a little more about demand. Let's take a look at a few other sources. So for instance, the McKinsey Global Institute published a very well known paper, and you can get at it with this URL. And if you go to that web page, this is what's going to come up. And we're going to take a quick look at this one, the executive summary, the PDF that you can download. And if you open that up, you'll find this page. And let's take a look at the bottom right corner, two numbers here, I'm going to zoom in on those. The first one is they are projecting a need in the next few years for somewhere between 140 and 190,000 deep analytical talent positions. So this means actual practicing data scientists. That's a huge number. But almost 10 times as high is 1.5 million more data savvy managers will be needed to take full advantage of big data in the United States. So that's people who aren't necessarily doing the analysis, but have to understand it, who have to speak data. And that's one of the main purposes of this particular course is to help people who may or may not be the practicing data scientists, learn to understand what they can get out of data and some of the methods used to get there. Let's take a look at another article from LinkedIn. Here's a shortcut URL for it. And that will bring you to this web page, the 25 hottest job skills that got people hired in 2014. And take a look at number one here, statistical analysis and data mining very closely related to data science. And just to be clear, this was number one in Australia and Brazil, and Canada and France and India, and the Netherlands in South Africa and the United Arab Emirates and the United Kingdom, everywhere. And if you need a little more, let's take a look at Glassdoor, which published an article this year, 2016. And it's about the 25 best jobs in America. And look at number one right here, it's data scientists and we can zoom in on this information. It says there's going to be 1700 job openings with a median based salary of over 116,000 and fabulous career opportunities and job scores. So if you want to take all of this together, the conclusion you can reach is that data science pays. And I can show you a little more about that. So for instance, here's the list of the top 10 highest paying salaries that I got from US news, we have physicians or doctors, dentists and lawyers and so on. Now if we add data scientists to this list using data from O'Reilly.com, we have to push things around the side. And it goes in third with an average total salary not the base that we had in the other one but the total compensation of about $144,000 a year. That's extraordinary. So in sum, what do we get from all this? First off, we learned that there is a very high demand for data science. Second, we learned that there is a critical need for both specialists, those are the sort of practicing data scientists, and for generalists, people who speak the language and know what can be done. And of course, there's excellent pay and all together, this makes data science a compelling career alternative and a way of making you better at whatever you're doing. Back here in data science, we're going to continue our attempt to define data science by looking at something that's really well known in the field, the data science Venn diagram. Now if you want to, you can think of this in terms of what are the ingredients of data science. Well, we're going to first say thanks to Drew Conway, the guy who came up with this. And if you want to see the original article, you can go to this address. But what drew said is that data science is made of three things. And we can put them as overlapping circles because it's the intersection that's important. Here on the top left is coding or computer programming or as he calls it hacking. On the top right is stats or stats and mathematics or quantitative abilities in general. And on the bottom is domain expertise or intimate familiarity with a particular field of practice business or health or education or something like that. And the intersection here in the middle, that is data science. So it's the combination of coding and statistics and math and domain knowledge. Now let's say a little more about coding. The reason coding is important because it helps you gather and prepare the data, because a lot of the data comes from novel sources and it's not necessarily ready for you to gather. And it can be in very unusual formats. And so coding is important because it can require some real creativity to get the data from the sources to put it into your analysis. Now, a few kinds of coding that are important. For instance, there's statistical coding. A couple of major languages in this are our and Python to open source free programming languages are specifically for data Python's general purpose but well adapted to data. The ability to work with databases is important to the most common language there is SQL usually pronounced SQL which stands for structured query language, because that's where the data is. Also, there's the command line interface, or if you're on a Mac, people just call it the terminal. Now most common language there is bash, which actually stands for born again shell. And then searching is important and regex or regular expressions. While there's not a huge amount to learn there it's a it's a small little field is sort of like super powered wildcard searching that makes it possible for you to both find the data and reformat it in ways that are going to be helpful for your analysis. Now let's say a few things about the math. You're going to need things like a little bit of probability some algebra, of course, regression very common statistical procedure, those things are important. And the reason you need the math is because that's going to help you choose the appropriate procedures to answer the question with the data that you have. And probably even more importantly, it's going to help you diagnose problems when things don't go as expected. And given that you're trying to do new things with new data in new ways, you're probably going to come across problems. And so the ability to understand the mechanics of what's going on is going to give you a big advantage. And the third element of the data science Venn diagram is some sort of domain expertise. Think of it as expertise in the field that you're in. Business settings are common. You need to know about the goals of that field, the methods that are used, the constraints that people come across. And it's important because whatever your results are, you need to be able to implement them well. Data science is very practical and it's designed to accomplish something. And your familiarity with a particular field of practice is going to make it that much easier and more impactful when you implement the results of your analysis. Now let's go back to our Venn diagram here just for a moment. Because this is a Venn, we also have these intersections of two circles at a time. At the top is machine learning. At the bottom right is traditional research. And on the bottom left is what drew Conway called the danger zone. Let me talk about each of these. It's off machine learning or ML. Now you think about machine learning. And the idea here is that it represents coding or statistical programming and mathematics without any real domain expertise. Sometimes these are referred to as black box models that you kind of throw data in. And you don't even necessarily have to know what it means or what language it's in and it'll just kind of crunch through it all and it'll give you some regularities. That can be very helpful. But machine learning is considered slightly different from data science because it doesn't involve the particular applications in a specific domain. Also, there's traditional research. This is where you have math or statistics and you have domain knowledge, often very intensive domain knowledge, but without the coding or programming. Now you can get away with that because the data that you use in traditional research is highly structured. It comes in rows and columns is typically complete. And it's typically ready for analysis. Doesn't mean your life is easy, because now you have to expand an enormous amount of effort in the method in designing the project and in the interpretation of the data. So still very heavy intellectual cognitive work, but it comes in a different place. And then finally, there's what Conway called the danger zone. And that's the intersection of coding and domain knowledge, but without math or statistics. Now, he says it's unlikely to happen. And that's probably true. On the other hand, I can think of some common examples where they're called word counts, where you take a large document or series of documents, and you count how often each word appears in there. That can actually tell you some important things. And also drawing maps and showing how things change across place and maybe across time. You don't necessarily have to have the math, but it can be very insightful and helpful. So let's think about a couple of backgrounds where people come from here. First is coding. You can have people who are coders who can do math, stats and business. So you get the three things. And this is probably the most common most of the people come from a programming background. On the other hand, there's also stats or statistics. And you can get statisticians who can code and who also can do business. That's less common, but it does happen. And finally, there's people who come into data science from a particular domain. These are for instance, business people who can code and do numbers. And they're the least common. But all of these are important to data science. And so in some, here's what we can take away. First, several fields make up data science. Second, diverse skills and backgrounds are important and their need is in data science. And third, there are many roles involved, because there's a lot of different things that need to happen. We'll say more about that in our next movie. The next step in our data science introduction and our definition of data science is to talk about the data science pathway. So I like to think of this as when you're working on a major project, you got to do one step at a time to get from here to there. In data science, you can take the various steps and can put them into a couple of general categories. First, there are the steps that involve planning. Second, there's the data prep. Third, there's the actual modeling of the data. And fourth, there's the follow up. And there are several steps within each of these, I'll explain each of them briefly. First, let's talk about planning. The first thing you need to do is you need to define the goals of your project. So you know how to use your resources well, and also so you know when you're done. Second, you need to organize your resources. So you might have data from several different sources, you might have different software packages, you might have different people, which gets us to the third one, you need to coordinate the people so they can work together productively. If you're doing a handoff, it needs to be clear who's going to do what and how their work is going to go together. And then really to state the obvious, you need to schedule the project so things can move along smoothly, you can finish in a reasonable amount of time. Next is the data prep where you're taking like food prep and getting the wrong ingredients ready. First, of course, is you need to get the data and it can come from many different sources and be in many different formats. You need to clean the data. And the sad thing is, this tends to be a very large part of any data science project. And that's because you're bringing in unusual data from a lot of different places. You also want to explore the data, that is really see what it looks like, how many people are in each group, what the shape of the distributions are like, what's associated with what. And you may need to refine the data. And that means choosing variables to include choosing cases to include or exclude making any transformations to the data you need to do. And of course, these steps kind of can bounce back and forth from one to the other. The third group is modeling or statistical modeling. This is where you actually want to create the statistical model. So for instance, you might do a regression analysis, or you might do a neural network. But whatever you do, once you create your model, you have to validate the model. You might do that with a holdout validation, you might do it really with a very small replication, if you can. You also need to evaluate the model. So once you know that the model is accurate, what does it actually mean and how much does it tell you. And then finally, you need to refine the model. So for instance, there may be variables you want to throw out there may be additional ones you want to include. You may want to again, transform some of the data, you may want to get it so it's easier to interpret and apply. And that gets us to the last part of the data science pathway. And that's follow up. And once you've created your model, you need to present the model, because it's usually work that's being done for a client could be in house could be a third party. But you need to take the insights that you got and share them in a meaningful way with other people. You also need to deploy the model. It's usually being done in order to accomplish something. So for instance, if you're working with an e-commerce site, you may be developing a recommendation engine that says people who bought this and this might buy this. You need to actually stick it on the website and see if it works the way you expected it to. Then you need to revisit the model, because a lot of times the data that you worked on is not necessarily all of the data. And things can change when you get out in the real world, or things just change over time. And so you have to see how well your model is working. And then just to be thorough, you need to archive the assets document what you have and make it possible for you or for others to repeat the analysis or develop off of it in the future. So those are the general steps of what I consider the data science pathway. And in some what we get from this is three things. First data science isn't just a technical field. It's not just coding. Things like planning and presenting and implementing are just as important. Also contextual skills, knowing how it works in a particular field, knowing how it will be implemented. Those skills matter as well. And then as you got from this whole thing, there's a lot of things to do. And if you go one step at a time, there'll be less backtracking, and you'll ultimately be more productive in your data science projects. We'll continue our definition of data science by looking at the roles that are involved in data science, the way that different people can contribute to it. That's because it tends to be a collaborative thing. And it's nice to be able to say that we're all together working together towards a single goal. So let's talk about some of the roles involved in data science and how they contribute to the projects. First off, let's take a look at engineers. These are people who focus on the back end hardware, for instance, the servers and the software that runs them. This is what makes data science possible. And it includes people like developers software developers or database administrators, and they provide the foundation for the rest of the work. Next, you can also have people who are big data specialists. These are people who focus on computer science and mathematics. And they may do machine learning algorithms as a way of processing very large amounts of data. And they often create what are called data products. So a thing that tells you what restaurant to go to or that says you might know these friends or provides ways of linking up photos. Those are data products, and those often involve a huge amount of very technical work behind them. There are also researchers. These are people who focus on domain specific research. So for instance, physics or genetics or whatever. And these people tend to have very strong statistics, and they can use some of the procedures and some of the data that comes from the other people like the big data researchers. But they focus on the specific questions. Also, in the data science realm, you'll find analysts, these are people who focus on the day to day tasks of running a business. So for instance, they might do web analytics like Google analytics, or they might pull data from a sequel database. And this information is very important and good for business. And so analysts are key to the day to day functioning of business. But you know, they may not exactly be data science proper, because most of the data they're they're working with is going to be pretty structured. Nevertheless, they play a critical role in business in general. And then speaking of business, you have the actual business people the men and women who organize and run businesses. These people need to be able to frame business relevant questions that can be answered with the data. Also, the business person manages the project and the efforts and the resources of others. And while they may not actually be doing the coding, they must speak data, they must know how the data works, what it can answer and how to implement it. You can also have entrepreneurs. So you might have, for instance, a data startup, they're starting their own little social network of their own little web search platform. An entrepreneur needs data and business skills. And truthfully, they have to be creative at every step along the way, usually because they're doing it all themselves at a smaller scale. Then we have in data science something known as the full stack unicorn. And this is a person who can do everything at an expert level. And they're called a unicorn because truthfully, they may not actually exist. I'll have more to say about that later. But for right now, we can sum up what we got out of this video by three things. Number one, data science is diverse. There's a lot of different people who go into it. And they have different goals for their work, and they bring in different skills and different experiences and different approaches. Also, they tend to work in very different contexts. An entrepreneur works in a very different place from a business manager works in a very different place from an academic researcher. But all of them are connected in some way to data science and make it a richer field. The last thing I want to say in data science and introduction where I'm trying to define data science is to talk about teams in data science. The idea here is that data science has many different tools and different people are going to be experts in each one of them. Now, you have for instance, coding, and you have statistics. Also, you have fields like design or business and management that are involved. And the question of course is, who can do all of it? Who's able to do all of these things at the level that we need? Well, that's where we get this saying I've mentioned it before, it's the unicorn. And just like in ancient history, the unicorn is a mythical creature with magical abilities. In data science, it works a little differently. It is a mythical data scientist with universal abilities. The trouble is, as we know from the real world, there's really no unicorns, animals, and there's really not many unicorns in data science. Really, there's just people. And so we have to find out how we can do the projects, even though we don't have this one person who can do everything for everybody. So let's take a hypothetical case just for a moment. I'm going to give you some fictional people. Here is my fictional person auto, who has strong visualization skills who has good coding, but has limited analytics or statistical ability. And if we graph his stuff out his ability, so here we got five things that we need to have happen. And for the project to work, they all have to happen at at least a level of eight on the zero to 10. If we take his coding ability, he's almost there. Statistics, not quite halfway. Graphics, yes, he can do that. And then business. All right, and project pretty good. So what you can see here is in only one of these five areas is auto sufficient on his own. On the other hand, let's pair him up with somebody else. Let's take a look at Lucy. And Lucy has strong business training has good tech skills, but has limited graphics. And so if we get her profile on the same thing that we saw, there's coding, pretty good statistics, pretty good graphics, not so much business, good and projects. Okay, now the important thing here is that we can make a team. So let's take our two fictional people auto and Lucy, and we can put together their abilities. Now actually have to change the scale here a little bit to accommodate the both of them. But our criterion still is at eight, we need a level of eight in order to do the project competently. And we combine them. Oh, look, coding is now past eight. Statistics is past eight graphics is way past business way past. And then the projects there too. And so when we combine their skills, we're able to get the level that we need for everything, or to put it another way we have now created a unicorn by team. And that makes it possible to do the data science project. So in some, you usually can't do data science on your own, that's a very rare individual. Or more specifically, people need people. And in data science, you have the opportunity to take several people and make collective unicorns so you can get the insight that you need in your project and you can get the things done that you want. In order to get a better understanding of data science, it can be helpful to look at contrasts between data science and other fields. Probably the most informative is with big data, because these two terms are actually often confused. It makes me think of situations where you have two things that are very similar, but not the same like we have here in the Piazza San Carlo in turn Italy. Part of the problem stems from the fact that data science and big data both have Venn diagrams associated with them. So for instance, Venn number one for data science is something we've seen already. We have three circles. And we have coding and we have math and we have some domain expertise that put together get data science. On the other hand, Venn diagram number two is for big data. It also has three circles. And we have the high volume of data, the rapid velocity of data and the extreme variety of data. Take those three v's together, you get big data. Now, we can also combine these two if we want in a third Venn diagram, we call it big data and data science. This time is just two circles with big data on the left and data science on the right. And the intersection there in the middle is big data science, which actually has a real term. But if you want to do a compare and contrast, it kind of helps to look at how you can have one without the other. So let's start by looking at big data without data science. So these are situations where you may have the volume or velocity of variety data, but don't need all the tools of data science. So we're just looking at the left side of the equation right now. Now, truthfully, this only works if you have big data without all three v's. Some say you have to have the volume velocity and variety for to count as big data. I basically say anything that doesn't fit into a standard machine is probably big data. I can think of a couple of examples here of things that might count as big data, but maybe don't count as data science. Machine learning, where you can have very large data sets and probably very complex doesn't require much domain expertise. So that may not be data science word counts, where you have an enormous amount of data. And it's actually a pretty simple analysis. Again, doesn't require much sophistication in terms of quantitative skills or even domain expertise. So maybe maybe not data science. On the other hand, to do any of these, you're going to need to have at least two skills, you're going to need to have the coding, and you will probably have to have some sort of quantitative skills as well. So how about data science without big data? That's the right side of this diagram. Well, to make that happen, you're probably talking about data with just one of the three v's from big data. So either volume or velocity or variety, but singly. So for instance, genetics data, you have a huge amount of data. And it comes in a very set structure, and it tends to come in at once. So you got a lot of volume. And it's a very challenging thing to work with, you have to use data science, but it may or may not count as big data. Similarly, streaming sensor data, where you have data coming in very quickly, but you're not necessarily saving it, you're just looking at these windows in it. That's a lot of velocity. And it's difficult to deal with it takes data science, the full skill set, but it may not require big data per se. Or facial recognition where you have enormous variety in the data, because you're getting photos or videos that are coming in. Again, very difficult to deal with requires a lot of ingenuity and creativity may or may not count as big data depending on how much of a stickler you are about definitions. Now, if you want to combine the two, we can talk about big data science. And in that case, we're looking right here at the middle. This is a situation where you have volume and velocity and variety in your data. And truthfully, if you have the three of those, you are going to need the full data science skill set, you're going to need coding and statistics and math, and you're going to have to have domain expertise, primarily because of the variety you're dealing with. But taken all together, you do have to have all of it. So in sum, here's what we get. Big data is not equal to is not identical to data science. Now there's common ground. And a lot of people who are good at big data are good at data science and vice versa, but they are conceptually distinct. On the other hand, there is the shared middle ground of big data science that unifies the two separate fields. Another important contrast you can make in trying to understand data science is to compare it with coding or computer programming. Now this is where you're trying to work with a machine and you're trying to talk to that machine to get it to do things. In one sense, you can think of coding as just giving task instructions how to do something. And it's a lot like a recipe when you're cooking. You get some sort of user input or other input. And then maybe you have if then logic and you get output from it to take an extremely simple example. If you're programming and Python version two, you write print and then quotes hello world and that will put the words hello world on the screen. So you gave it some instructions and it gave you some output. Very simple programming. Now coding and data gets a little more complicated. So for instance, there's word counts where you take a book or a whole collection of books, you take the words and you count how many there are in there. Now this is this is a conceptually simple task. And domain expertise and really math and statistics are not vital. But to make valid inferences and generalizations, in the face of variability and uncertainty in the data, you need statistics and by extension, you need data science. It might help to compare the two by looking at the tools of the respective trades. So for instance, there are tools for coding or generic computer programming. And there are tools that are specific to data science. So what I have right here is a list from the I triple e of the top 10 programming languages in 2015. And it starts at Java and C and goes down to shell. And some of these are also used for data science. So for instance, Python, and our and SQL are used for data science. But the other ones aren't major ones in data science. So let's in fact, take a look at a different list of most popular tools for data science. And you see that things move around a little bit now ours at the top, SQL's there, Python's there. But for me, what's the most interesting on this list is that Excel is number five, which would never be considered programming per se, but is in fact a very important tool for data science. And that's one of the ways that we can compare and contrast computer programming with data science. In sum, we can say this data science is not equal to coding, they're different things. On the other hand, they share some of the tools, and they share some practices, specifically when coding for data. On the other hand, there is one very big difference in that statistics, statistical ability is one of the major separators between general purpose programming and data science programming. When we talk about data science, and we're contrasting it with some fields, another field that a lot of people get confused and think they're the same thing is data science and statistics. Now, I'll tell you, there's a lot in common, but we can talk a little bit about the different focuses of each. And we also get into the issue of sort of definitionalism that data science is different, because we define it differently, even when there's an awful lot in common between the two. It helps to take a look at some of the things that go on in each field. So let's start here about statistics, put little circle here, and we'll put data science. And to borrow a term from Steven J. Gould, we can call these non overlapping magisteria Noma. So you think of them as separate fields that are sovereign unto themselves with nothing to do with each other. But you know, that doesn't seem right. And part of that is if we go back to the data science Venn diagram, you know, statistics is one part of it, there it is in the top corner. So now what do we do? What's the relationship? So it doesn't make sense to say these are totally separate areas. Maybe data science and statistics because they share procedures, maybe data science is a subset or a specialty of statistics more like this. But if data science were just a subset or specialty within statistics, then it would follow that all data scientists would first be statisticians. And interestingly, that's just not so. Say for instance, we take a look at the data science stars, the superstars in the field, we go to a rather intimidating article, it's called the world's seven most powerful data scientists from Forbes.com. And you can see the article if you go to this URL. There's actually more than seven people on the list because sometimes he brings them up in pairs. But let's check their degrees see what their academic training is in. If we take all the people on this list, we have five degrees in computer science, three in math, two in engineering, and one each in biology, economics, law, speech pathology, and one in statistics. And so that tells us, of course, that these major people in data science are not trained as statisticians, only one of them has formal training in that. So that gets us to the next question, where do these two fields, statistics and data science diverge because they seem like they should have a lot in common, but they don't have a lot in training. And specifically, we can look at the training, most data scientists are not trained formally as statisticians. Also in practice, things like machine learning and big data, which are central to data science, are not shared generally with most of statistics. And so they have separate domains there. And then there's the really important issue of context. Data scientists tend to work in different settings than statisticians. Specifically, data scientists very often work in commercial settings where they're trying to get recommendation engines or ways of developing a product that will make them money. So maybe instead of having data science as a subset of statistics, we can think of it more as these two fields have different niches, they both analyze data, but they do different things in different ways. So maybe it's fair to say they share they overlap, they both have analysis in common of data. But otherwise, they are ecologically distinct. So in some, what we can say here is that data science and statistics, both use data and they analyze it. But the people in each tend to come from different backgrounds, and they tend to function with different goals and contexts. And in that way, render them to be conceptually distinct fields, despite the apparent overlap. As we work to get a grasp on data science, there's one more contrast I want to make explicitly. And that's between data science and business intelligence or BI. The idea here is that business intelligence is data in real life. It's very, very applied stuff. The purpose of BI is to get data on internal operations on market competitors and so on, and make justifiable decisions, as opposed to just sitting in the bar and doing whatever comes to your mind. Now, data science is involved with this, except, you know, really, there's no coding in BI, there's using apps that already exist. And the statistics in business intelligence tend to be very simple. They tend to be counts and percentages and ratios. And so it's simple. The light bulb is simple. It just does this one job. There's nothing super sophisticated there. Instead, the focus in business intelligence is on domain expertise, and on really useful direct utility. It's simple, it's effective. And it provides insight. Now, one of the main associations with business intelligence is what are called dashboards or data dashboards, they look like this, it's a collection of charts and tables that go together to give you a very quick overview of what's going on your business. And while a lot of data scientists may let's say look down their nose upon dashboards, I'll say this, most of them are very well designed and you can learn a huge amount about user interaction, and the accessibility of information from dashboards. So really, where does data science come into this? What's the connection between data science and business intelligence? Well, data science can be useful to BI, in terms of setting it up identifying data sources and creating or setting up the framework for something like a dashboard or a business intelligence system. Also, data science can be used to extend it. Data science can help get past the easy questions and the easy data to get the questions that are actually most useful to you, even if they require really sometimes data that's hard to wrangle and work with. And also, there's an interesting interaction here that goes the other way. Data science practitioners can learn a lot about design from good business intelligence applications. So I strongly encourage anybody in data science to look at them carefully and see what they can learn. In some business intelligence or BI is very goal oriented. Data science perhaps prepares the data and sets up the form for business intelligence. But also data science can learn a lot about usability and accessibility from business intelligence. And so it's always worth taking a close look. Data science has a lot of really wonderful things about it. But it is important to consider some ethical issues. And I'll specifically call this do no harm in your data science projects. And for that, we can say thanks to Hippocrates, the guy who gave us the Hippocratic oath of do no harm. Let's specifically talk about some of the important ethical issues very briefly that come up in data science. Number one is privacy, that data tells you a lot about people and you need to be concerned about the confidentiality. If you have private information about people, their names, their social security numbers, their addresses, their credit scores, their health, that's private, that's confidential. And you shouldn't share that information unless they specifically gave you permission. Now, one of the reasons this presents a special challenge in data science, because we'll see later, a lot of the sources that are used in data science, we're not intended for sharing if you scrape data from a website or from PDFs, you need to make sure that it's okay to do that. But it was originally created without the intention of sharing. So privacy is something that really falls upon the analyst to make sure they're doing it properly. Next is anonymity. One of the interesting things we find is that it's really not hard to identify people and data. If you have a little bit of GPS data, and you know where a person was at four different points in time, you have about a 95% chance of knowing exactly who they are. You look at things like HIPAA, that's the Health Insurance Portability and Accountability Act. Before HIPAA, it was really easy to identify people for medical records. Since then, it has become much more difficult to identify people uniquely. That's an important thing for really people's well being. And then also proprietary data, if you're working for a client, a company, and they give you their own data, that data may have identifiers, you may know who the people are and they're not anonymous anymore. So anonymity may or may not be their major efforts to make data anonymous. But really, the primary thing is that even if you do know who they are, that you still maintain the privacy and confidentiality of the data. Next, there's the issue about copyright, where people try to lock down information. Now, just because something is on the web doesn't mean that you're allowed to use it. Scraping data from websites is a very common and a useful way of getting data for projects. You can get data from web pages from PDFs from images from audio from really a huge number of things. But again, the assumption that because it's on the web, it's okay to use it is not true. You always need to check copyright and make sure that it's acceptable for you to access that particular data. Next in our very ominous picture is data security. And the idea here is that when you go through all the effort to gather data to clean it up and prepare for an analysis, you've created something that's very valuable to a lot of people. And you have to be concerned about hackers trying to come in and steal the data, especially if the data is not anonymous and it has identifiers in it. And so there is an additional burden placed on the analyst to ensure to the best of their ability that the data is safe and cannot be broken into and stolen. And that can include very simple things like a person who is on the project but is no longer but took the data on a flash drive, you have to find ways to make sure that that can't happen as well. There's a lot of possibilities. It's tricky. But it's something that you have to consider thoroughly. Now, two other things that come up in terms of ethics, but don't usually get addressed in these conversations. Number one is potential bias. The idea here is that the algorithms or the formulas that are used in data science are only as neutral and bias free as the rules and the data that they get. And so the idea here is that if you have rules that address something that is associated with for instance, gender or age or race or economic standing, you might unintentionally be building in those factors, which say for instance for title nine, you're not supposed to be building those into the system without being aware of it. And an algorithm has this sheen of objectivity, and people can say they can place confidence in it without realizing that it's replicating some of the prejudices that may happen in real life. Another issue is overconfidence. And the idea here is that analyses are limited simplifications, they have to be that that's just what they are. And because of this, you still need humans in the loop to help interpret and apply this. The problem is when people run an algorithm to get out a number say to 10 decimal places and they say this must be true and treat it as written in stone, absolutely unshakable truth. When in fact, if the data were biased going in, if the algorithms were incomplete, if the sampling was not representative, you can have enormous problems and go down the wrong path with too much confidence in your own analyses. So once again, humility is in order when doing data science work. In sum, data science has enormous potential, but it also has significant risks involved in the projects. Part of the problem is that analyses can't be neutral, that you have to look at how the algorithms are associated with the preferences, prejudices and biases of the people who made them. And what that means is that no matter what, good judgment is always vital to the quality and success of a data science project. Data science is a field that is strongly associated with its methods or procedures. In this section of videos, we're going to provide a brief overview of the methods that are used in data science. Now, just as a quick warning, in this section, things can get kind of technical and that can cause some people to sort of freak out. But this course is a non technical overview. The technical hands on stuff is in the other courses. And it's really important to remember that tech is simply the means to doing data science. Insight or the ability to find meaning in your data. That's the goal tech only helps you get there. And so we want to focus primarily on insight and the tools and the tech as they serve to further that goal. Now there's a few general categories we're going to talk about again, with an overview for each of these. The first one is sourcing or data sourcing, that is how to get the data that goes into data science, the raw materials that you need. The second is coding that again is computer programming that can be used to obtain and manipulate and analyze the data. After that, a tiny bit of math that is the mathematics behind data science methods that really form the foundations of the procedures. And then stats, the statistical methods that are frequently used to summarize and analyze data, especially as applied to data science. And then there's machine learning ML. This is a collection of methods for finding clusters in the data for predicting categories or scores on interesting outcomes. And even across these five things, even then, the presentations aren't too techy crunchy, they're basically still friendly. And you know, really, that's the way it is. And so that is the overview of the overviews. In sum, we need to remember that data science includes tech, but data science is greater than tech, it's more than those procedures. And above all, that tech while important to data science is still simply a means to insight in data. The first step in discussing data science methods is to look at the methods of sourcing or getting data that's used in data science. You can think of this as getting the raw materials that go into your analysis. Now you've got a few different choices when it comes to this in data science. You can use existing data, you can use something called data APIs, you can scrape web data, or you can make data. We'll talk about each of those very briefly in a non technical manner. But right now, let me say something about existing data. This is data that already is at hand and it might be in house data. So if you work for a company, it might be your company records. Or you might have open data, for instance, many governments, many scientific organizations make their data available to the public. And then there's also third party data, this is usually data that you buy from a vendor, but it exists and it's very easy to plug it in and go. You can also use API's. Now that stands for application programming interface. And this is something that allows various computer applications to communicate directly with each other. It's like phones for your computer programs. It's the most common way of getting web data and the beautiful thing about it is it allows you to import that data directly into whatever programmer application you're using to analyze the data. Next is scraping data. And this is where you want to use data that's on the web, but they don't have an existing API. And what that means is usually data that's an HTML web tables and pages, maybe PDFs and you can do this either with using specialized applications for scraping data, or you can do it in a programming language like our Python and write the code to do the data scraping. Or another option is to make data and this lets you get exactly what you need. You can be very specific. And you can get what you need. You can do something like interviews or you can do surveys or you can do experiments. There's a lot of approaches. Most of them require some specialized training in terms of how to gather quality data. And that's actually important to remember because no matter what method you use for getting or making new data, you need to remember this one little aphorism you may have heard from computer science. It goes by the name of Gigo that actually stands for garbage in garbage out. And it means if you have bad data that you're feeding into your system, you're not going to get anything worthwhile any real insights out of it. Consequently, it's important to pay attention to metrics or methods for measuring and the meaning exactly what it is that they tell you. There's a few ways you can do this. For instance, you can talk about business metrics, you can talk about KPIs, which means key performance indicators also used in business settings, or smart goals, which is the way of describing the goals that are actionable and timely and so on. You can also talk about in a measurement sense classification accuracy. And I'll discuss each of those in a little more detail in a later movie. But for right now, in some we can say this data sourcing is important because you need to get the raw materials for your analysis. The nice thing is there's many possible methods, many ways that you can use to get the data for data science. But no matter what you do, it's important to check the quality and the meaning of the data so you can get the most insight possible out of your project. The next step we need to talk about in data science methods is coding. And I'm going to give you a very brief non technical overview of coding and data science. The idea here is that you're going to get in there and you are going to be king of the jungle, master of your domain, and make the data jump when you need it to jump. Now, if you remember when we talked about the data science van diagram in the beginning, coding is up here on the top left. And while we often think about sort of people typing lines of code, which is very frequent. It's more important to remember when we talk about coding or just computers in general, what we're really talking about here is any technology that lets you manipulate the data in the ways you need to perform the procedures you need to get the insight that you want out of your data. Now, there are three very general categories that we'll be discussing here on data lab. The first is apps. These are specialized applications or programs for working with data. The second is data or specifically data formats. There are special formats for web data. I'll mention those in a moment. And then code. There are programming languages that give you full control over what the computer does and how you interact with the data. Let's take a look at each one very briefly. In terms of apps, they're spreadsheets like Excel or Google Sheets. These are the fundamental data tools of probably the majority of the world. There are specialized applications like Tableau for data visualization, or SPSS, a very common statistical package in the social sciences and in business. And one of my favorite, JASP, which is a free open source analog of SPSS, which actually I think is a lot easier to use and replicate research with. And there are tons of other choices. Now, in terms of web data, it's helpful to be familiar with things like HTML and XML and JSON and other formats that are used to encapsulate data on the web. Because those are the things that you're going to have to be programming about to interact with when you get your data. And then there are actual coding languages. R is probably the most common along with Python, general purpose language, but it's been well adapted for data use. There's SQL, the structured query language for databases, and very basic languages like C and C++ and Java, which are used more in the back end of data science. And then there's bash, the most common command line interface, and regular expressions. And we'll talk about all of these in other courses here at Datalab. But remember this, tools are just tools. They're only one part of the entire data science process. There are means to the end. And the end, the goal is insight, you need to know where you're trying to go, and then simply choose the tools that help you reach that particular goal. That's the most important thing. So in sum, here's a few things. Number one, use your tools wisely. Remember, your questions need to drive the process, not the tools themselves. Also, I'll just mention that a few tools is usually enough, you can do an awful lot with Excel and R. And then the most important thing is focus on your goal and choose your tools and even your data to match the goal. So you can get the most useful insights from your data. The next step in our discussion of data science methods is mathematics. And I'm going to give a very brief overview of the math involved in data science. Now, the important thing to remember is that math really forms the foundation of what we're going to do. If you go back to the data science Venn diagram, we've got stats up here in the right corner, but really, it's math and stats or quantitative ability in general. But we'll focus on the math part right here. And probably the most important question is how much math is enough to do what you need to do, or to put it another way. Why do you need math at all? Because you've got a computer to do it. Well, I can think of three reasons you don't want to rely on just the computer, but it's helpful to have some sound mathematical understanding. Here they are. Number one, you need to know which procedures to use and why. So you have your question, you have your data, you need to have enough of an understanding to make an informed choice. That's not terribly difficult. Two, you need to know what to do when things don't work right. Sometimes you get impossible results. I know in statistics, you can get a negative adjusted R squared, that's not supposed to happen. And it's good to know the mathematics that go into calculating that so you can understand how something apparently impossible can work. Or you're trying to do a factor analysis or principal component to get a rotation that won't converge. It helps to understand what it is about the algorithm that's happening and why that won't work in that situation. And number three, interestingly, some procedures, some math is easier and quicker to do by hand than by firing up the computer. And I'll show you a couple of examples in later videos where that can be the case. Now, fundamentally, there's a nice sort of analogy here. Math is to data science. As for instance, chemistry is to cooking, kinesiology is to dancing and grammar is to writing. The idea here is that you can be a wonderful cook without knowing any chemistry. But if you know some chemistry, it's going to help. You can be a wonderful dancer without knowing kinesiology, but it's going to help. And you can probably be a good writer without having an explicit knowledge of grammar, but it's going to make a big difference. The same thing is true of data science, you will do it better if you have some of the foundational information. So the next question is what kinds of math do you need for data science? Well, there's a few answers to that. Number one is algebra. You need some elementary algebra. That's the basically simple stuff. You can have to do some linear or matrix algebra because that's the foundation of a lot of the calculations. And you can also have systems of linear equations where you're trying to solve several equations all at once. It's a tricky thing to do in theory, but this is one of the things that's actually easier to do by hand sometimes. Now there's more math. You can get some calculus, you can get some big O, which has to do with the order of a function, which has to do with sort of how fast it works. Probability theory can be important. And then Bayes theorem, which is a way of getting what's called a posterior probability can also be a really helpful tool for answering some fundamental questions in data science. So in sum. A little bit of math can help you make informed choices when planning your analyses. Very significantly, it can help you find the problems and fix them when things aren't going right. It's the ability to look under the hood that makes a difference. And then truthfully, some mathematical procedures like systems of linear equations that can even be done by hand sometimes faster than you can do with a computer. So you can save yourself some time and some effort and move ahead more quickly towards your goal of insight. Now data science wouldn't be data science and its methods without a little bit of statistics. So I'm going to give you a brief statistics overview here of how things work in data science. Now you can think of statistics as really an attempt to find order in case find patterns in an overwhelming mess sort of like trying to see the forest and the trees. Now let's go back to our little Venn diagram here. We recently had math and stats here in the top corner and we're going to go back to talking about stats in particular. What you're trying to do here, one thing is to explore your data. You can have exploratory graphics because we're visual people and it's usually easiest to see things. You can have exploratory statistics, a numerical exploration of the data and you can have descriptive statistics, which are the things that most people would have talked about when they took a statistics class in college if they did that. Next, there's inference. I've got smoke here because you can infer things about the wind and the air movement by looking at patterns and smoke. The idea here is that you're trying to take information from samples and infer something about a population. You're trying to go from one source to another. One common version of this is hypothesis testing. Another common version is estimation, sometimes called confidence intervals. There are other ways to do it, but all of these let you go beyond the data at hand to making larger conclusions. Now, one interesting thing about statistics is you're going to have to be concerned with some of the details and arranging things just so. For instance, you get to do something like feature selection that's picking variables that should be included or combinations. And there are problems that can come up. There are frequent problems and I'll address some of those in later videos. There's also the matter of validation. When you create a statistical model, you have to see if it actually is accurate. Hopefully you have enough data that you can have a holdout sample and do that, or you can replicate the study. Then there's the choice of estimators that you use, how you actually get the coefficients or the combinations in your model. And then there's ways of assessing how well your model fits the data. All of these are issues that I'll address briefly when we talk about statistical analysis at greater length. Now, I do want to mention one thing in particular here. And I just call this beware the trolls. There are people out there who will tell you that if you don't do things exactly the way they say to do it, that your analysis is meaningless, that your data is junk, and you've lost all your time. You know what, they're trolls. So the idea here is, don't listen to that. You can make enough of an informed decision on your own to go ahead and do an analysis that is still useful. Probably one of the most important things to think about in this is this wonderful quote from a very famous statistician that says, all models or all statistical models are wrong. But some are useful. And so the question isn't whether you're technically right or you have some sort of level of intellectual purity, but whether you've done something that is useful. That, by the way, comes from George Box. And I like to think of it basically as this as wave your flag, wave your do it yourself flag and just take pride in what you're able to accomplish, even when there are people who may be criticizing it. Go ahead, you're doing something, go do it. And so in some statistics allow you to explore and describe your data, they allow you to infer things about the population. There's a lot of choices available, a lot of procedures. But no matter what you do, the goal is useful insight. Keep your eyes on that goal, and you will find something meaningful and useful in your data to help you in your own research and projects. Let's finish our data science methods overview by getting a brief overview of machine learning. Now I got to admit, when you say the term machine learning, people start thinking about something like the robot overlords are going to take over the world. That's not what it is. Instead, let's go back to our Venn diagram one more time. And in the intersection at the top between coding and stats is machine learning, or as it's commonly called is just ML. The goal of machine learning is to go and work in a data space. So you can, for instance, take a whole lot of data, we've got tons of books here. And then you can reduce the dimensionality that is take a very large scattered data set and try to find the most essential parts of that data. And then you can use these methods to find clusters within the data. Like goes with like, you can use methods like K means you can also look for anomalies or unusual cases that show up in the data space. Or if we go back to categories, again, I talked about like for like, you can use things like logistic regression or K nearest neighbors K and N, you can use naive Bayes for classification or decision trees or SVM, which is support vector machines, or artificial neural nets. Any of those will help you find the patterns and the clumping in your data. So you can get similar cases next to each other and get the cohesion that you need to make conclusions about these groups. Also, a major element of machine learning is predictions. You're going to point your way down the road. The most common approach here, the most basic is linear regression, multiple regression. There's also Poisson regression, which is used for modeling count or frequency data. And then there's the issue of ensemble models where you create several models and you take the predictions from each of those and you put them together to get an overall more reliable prediction. Now, I'll talk about each of these in a little more detail in later courses. But for right now, I mostly just want you to know that these things exist. And that's what we mean when we refer to machine learning. So in some machine learning can be used to categorize cases and to predict scores on outcomes. And there's a lot of choices, many choices and procedures available. But again, as I said with statistics, and I'll say again, many times after this, no matter what the goal is not that I'm going to do an artificial neural network or an SVM. The goal is to get useful insight into your data. Machine learning is a tool and use it to the extent that it helps you get that insight that you need. In the last several videos, I've talked about the role and data science of technical things. On the other hand, communicating is also central to the practice. And the first thing I want to talk about there is interpretability. The idea here is that you want to be able to lead people through a path on your data, you want to tell a data driven story. And that's the entire goal of what we're doing with data science. Now, another way to think about this is when you're doing your analysis, what you're trying to do is solve for value. You're making an equation, you take the data, you're trying to solve for value. The trouble is this, a lot of people get hung up on analysis, but they need to remember that analysis is not the same thing as value. Instead, I like to think of it this way, that analysis times story is equal to value. Now, please note, that's multiplicative, not additive. And so one consequence of that is when you go back to analysis times story equals value. Well, if you have zero story, you're going to have zero value because as you recall, anything times zero is zero. So instead of that, let's go back to this and say what we really want to do is we want to maximize the story so that we can maximize the value that results from our analysis. Again, maximum value is the overall goal here. The analysis, the tools, the tech are simply methods for getting to that goal. So let's talk about goals, for instance, analysis is goal driven, you're trying to accomplish something as specific. And so the story or the narrative or the explanation you give about your project should match those goals. If you're working for a client and they had a specific question that they wanted you to answer, then you have a professional responsibility to answer those questions clearly and unambiguously. So they know whether you said yes or no, and they know why you said yes or no. Now part of the problem here is the fact that the client isn't you and they don't see what you do. And as I show here, simply covering your face doesn't make things disappear. You have to worry about a few psychological abstractions. You have to worry about egocentrism. And I'm not talking about being vain. I'm talking about the idea that you think other people see and know and understand what you know. That's not true. Otherwise they wouldn't have hired you in the first place. And so you have to put it in terms that the client works with and that they understand. And you're going to have to get out of your own center in order to do that. Also, there's the idea of false consensus, the idea that well, everybody knows that. And again, that's not true. Otherwise they wouldn't have hired you. You need to understand that they're going to come from a different background with a different range of experience and interpretation. You're going to have to compensate for that. A funny little thing is the idea about anchoring. When you give somebody an initial impression, they use that as an anchor and then they adjust away from it. So if you're going to try to flip things over on their heads, watch out for giving a false impression at the beginning, unless you absolutely need to. But most importantly, in order to bridge the gap between the client and you, you need to have clarity and explain yourself at each step. You can also think about the answers. When you're explaining the project to the client, you might want to start in a very simple procedure, state the question that you're answering. Give your answer to that question. And if you need to qualify as needed, and then go in order top to bottom. So you're trying to make it as clear as possible what you're saying what the answer is and make it really easy to follow. Now, in terms of discussing your process, how you did this all, most of the time it's probably the case that they don't care. They just want to know what the answer is and that you used a good method to do that. So in terms of discussing process or the technical details, only when absolutely necessary, that's something to keep in mind. The process here is to remember that analysis, which means breaking something apart. This, by the way, is a mechanical typewriter broken into its individual components. Analysis means to take something apart. An analysis of data is an exercise in simplification. You're taking the overall complexity, sort of the overwhelmingness of the data, and you're boiling it down and finding the patterns that make sense and serve the needs of your client. Now, let's go to a wonderful quote from our friend Albert Einstein here, who said, Everything should be made as simple as possible, but not simpler. That's true in presenting your analysis. Or if you want to go see the architect and designer Ludwig Mies van der Rohe, who said less is more. It's actually Robert Browning who originally said that, but Mies van der Rohe popularized it. Or if you want another way of putting a principle that comes from my field, I'm actually a psychological researcher. They talk about being minimally sufficient just enough to adequately answer the question. If you're in commerce, you know about a minimal viable product is sort of the same idea with an analysis here, the minimum viable analysis. So here's a few tips. When you're giving a presentation, more charts, less text, great. And then simplify the charts, remove everything that doesn't need to be in there. Generally, you want to avoid tables of data because those are hard to read. And then one more time because I want to emphasize it less text again, charts tables can usually carry the message. And so let me give you an example here. I'm going to give a very famous data set Berkeley admissions. Now these are not stairs to Berkeley, but it gives the idea of trying to get into something that's far off and distant. Here's the data. This is graduate school admissions in 1973. So it's, you know, it's over 40 years ago. But the idea is that men and women were both applying for graduate school at the University of California at Berkeley. And what we found is that 44% of the men who applied were admitted that they're part in green and that of the women, only 35% were admitted when they applied. So really, at first glance, this is bias and it actually led to a lawsuit. It was it was a major issue. So what Berkeley then tried to do is find out, well, which programs are responsible for this bias. And then you got a very curious set of results. If you break the applications down by program, and here we're just calling them a through F six different programs. What you find actually is that in each of these male applicants are on the left, female applicants are on the right. If you look at program a, women actually got accepted at a higher rate. And the same is true for B. And the same is true for D. And the same is true for F. And so this is a very curious set of responses. And it's something that requires explanation. Now in statistics, this is known as Simpson's paradox. But here's the paradox bias may be negligible at the department level. And in fact, as we saw in for the departments, there was a possible bias in favor of women. And the problem is that women applied to more selective programs programs with lower acceptance rates. Now, some people stop right here and say, therefore, nothing's going on. Nothing's going on. But you know, that's still ending the story a little bit early. There are other questions that you can ask. And as producing a data driven story, this is stuff that you would want to do. So for instance, you may want to ask, why do the programs vary in overall class size? Why do the acceptance rates differ from one program to the other? So you may want to ask, why do the programs vary in overall class size? Why do the acceptance rates differ from one program to the other? Why do men and women apply to different programs? And you might want to look at things like the admissions criteria for each of the programs, the promotional strategies, how they advertise themselves to students. You might want to look at the kinds of prior education students have in each of the programs. And you really want to look at funding levels for each of the programs. And so really, you get one answer, it leads to more questions, maybe some more answers and more questions, and you need to address enough of this to provide a comprehensive overview and solution to it for your client. In sum, let's say this, stories give value to data analysis. And when you tell the story, you need to make sure that you are addressing your client's goals in a clear, unambiguous way. And the overall principle here is be minimally sufficient. Get to the point, make it clear, say what you need to, provide a comprehensive overview and solution to it for your client. In sum, let's say this, stories give value to data analyses. And when you tell the story, you need to make sure that you are addressing your client's goals in a clear, unambiguous way. And the overall principle here is be minimally sufficient. Get to the point, make it clear, say what you need to, but otherwise be concise and make your message clear. The next step in discussing data science and communicating is to talk about actionable insights or information that can be used productively to accomplish something. Now to give sort of a bizarre segue here, you look at a game controller, it may be a pretty thing, it may be a nice object. But remember, game controllers exist to do something, they exist to help you play the game, and to do it as effectively as possible. They have a function, they have a purpose. Same way, data is for doing. Now that's a paraphrase from one of my favorite historical figures. And this is William James, the father of American psychology and pragmatism and philosophy. And he has this wonderful quote, he said, my thinking is first and last and always for the sake of my doing. And the idea applies to analysis, your analysis and your data is for the sake of your doing. And so you're trying to get some sort of specific insight in how you should proceed. What you want to avoid is the opposite of this from one of my other favorite cultural heroes, the famous Yankees catcher, Yogi Berra, who said, we're lost, but we're making good time. And so the idea here is that frantic activity does not make up for a lack of direction. You need to understand what you're doing so you can reach the particular goal. And your analysis is supposed to do that. So when you're giving your analysis, you're going to try to point the way. Remember, why was the project conducted? The goal is usually to direct some kind of action reach some kind of goal for your client. And that the analysis should be able to guide that action in an informed way. One thing you want to do is you want to be able to give the next steps to your client, give the next steps, tell them what they need to do now. You want to be able to justify each of those recommendations with the data and your analysis. As much as possible, be specific, tell them exactly what they need to do. Make sure it's doable by the client that it's within their range of capability, and that each step should build on the previous step. Now, that being said, there is one really fundamental sort of philosophical problem here. And that's the difference between correlation and causation. Basically, it goes this way, your data gives you correlation, you know that this is associated with that. But your client doesn't simply want to know what's associated, they want to know what causes something, because if they're going to do something, that's an intervention is designed to produce a particular result. So really, how do you get from the correlation, which is what you have in the data to the causation, which is what your client wants? Well, there's a few ways to do that. One is experimental studies. These are randomized controlled trials. Now, that's theoretically the simplest path to causality, but it can be really tricky in the real world. There are quasi experiments. And these are methods, a whole collection of methods that use non randomized data, usually observational data, adjusted in particular ways to get an estimate of causal inference. Or there's the theory and experience. And this is research based theory and domain specific experience. And this is where you actually get to rely on your client's information. They can help you interpret the information, especially if they have greater domain expertise than you do. Another thing to think about are the social factors that affect your data. Now, you remember the data science Venn diagram, we've looked at it lots of times, it's got these three elements. Some people have proposed adding a fourth circle to this Venn diagram and we'll kind of put that in there and say that social understanding is also important critical really to valid data science. Now, I love that idea. And I do think that it's important to understand how things are going to play out. There's a few kinds of social understanding you want to be aware of your client's mission. You want to make sure that your recommendations are consistent with your client's mission. Also, that your recommendations are consistent with your client's identity, not just this is what we do, but this is really who we are. You need to be aware of the business context, sort of the competitive environment and the regulatory environment that they're working in, as well as the social context. And that can be outside of the organization, but even more often within the organization, your recommendations will affect relationships within the client's organization. And you're going to try to be aware of those as much as you can to make it so that your recommendations can be realized the way they need to be. So in sum, data science is a goal focused. And when you're focusing on that goal for your client, you need to give specific next steps that are based on your analysis and justifiable from the data. And in doing so, be aware of the social, political and economic context that gives you the best opportunity of getting something really useful out of your analysis. When you're working in data science and trying to communicate your results, presentation graphics can be an enormously helpful tool. Think of it this way, you were trying to paint a picture for the benefit of your client. Now when you're working with graphics, there can be a couple of different goals. It depends on what kind of graphics you're working with. There's the general category of exploratory graphics. These are ones that you were using as the analyst. And for exploratory graphics, you need speed and responsiveness. And so you get very simple graphics, this is a base histogram and R, and they can get a little more sophisticated. And this is done in GG plot, and then you can break it down a couple of histograms or you can make it a different way or make them see through or split them apart into small multiples. But in each case, this is done for the benefit of you as the analysts understanding the data. These are quick, they're effective now, they're not very well labeled, and they're usually for your insight, and then you do other things as a result of that. On the other hand, presentation graphics, which are for the benefit of your client, those need clarity, and they need a narrative flow. Now, let me talk about each of those characteristics very briefly. Clarity versus distraction. There are things that can go wrong in graphics. Number one is colors, colors can actually be a problem. Also, three dimensional or false third dimensions are nearly always a distraction. One that gets a little touchy for some people is interaction, we think of interactive graphics as really cool, great things to have. But you run the risk of people getting distracted by the interaction and start playing around with it. We're like, Oh, I press here, it does that. And that distracts from the message. So actually, it may be important to not have interaction. And then the same thing is true of animation. Flat static graphics can often be more informative because they have fewer distractions in them. Let me give you a quick example of how not to do things. Now, this is a chart that I made, I made it in Excel, and I did it based on some of the mistakes I've seen in graphics submitted to me when I teach. And I guarantee you everything in here I have seen in real life, just not necessarily combined all at once. Let's zoom in on this a little bit so we can see the full badness of this graphic. And let's see what's going on here, we've got a scale here that starts at eight goes to 28% and it's tiny doesn't even cover the range of the data. We've got this bizarre picture on the wall, we have no access lines on the walls. We come down here, the labels for educational levels are in alphabetical order instead of the more logical higher levels of education. Then we've got the data represented as cones, which are difficult to read and compare. And it's only made worse by the colors and the textures. You know, if you want to take an extreme this one for grad degrees doesn't even make it to the floor value of 8%. And this one for high school grad is cut off at the top at 28%. This by the way is a picture of a sheep and people do this kind of stuff and it drives me crazy. If you want to see a better chart with the exact same data, this is it right here. It's a straight bar chart, it's flat, it's as simple, it's as clean as possible. And this is better in many ways. Most effective here is that it communicates clearly, there's no distractions. It's a logical flow. This is going to get the point across so much faster. And I can give you another example of it. Here's a chart I showed previously about salaries for incomes. I have a list here, I've got data scientists in it. If I want to draw attention to it, I have the option of like putting a circle around it. And I can put a number next to it to explain it. That's one way to make it easy to see what's going on. But you don't even have to get fancy. You know, I just got out of pen and a post it note and I drew a bar chart of some real data about life expectancy. This tells the story as well that there is something terribly amiss in Sierra Leone. But now let's talk about creating narrative flow in your presentation graphics. To do this, I'm going to pull some charts from my most cited academic paper, which is called a third voice a review of empirical research on the psychological outcomes of restorative justice. Think of that as mediation for juvenile crimes, mostly juvenile. And this paper is interesting because really it's about 14 bar charts with just enough text to hold them together. And you can see there's a flow. The charts are very simple. This is judgments about whether the criminal justice system was fair. The two bars on the left are victims, the two bars on the right are offenders. And for each group on the left are people who participated in restorative justice or victim offender mediation or mediation for crimes. And for each set on the right are people who went through standard criminal procedures. It says court but it usually means plea bargaining. Anyhow, it's really easy to see that in both cases, restorative justice bar is higher. People were more likely to say it was fair. They also felt that they had an opportunity to tell their story. That's one reason they might think it's fair. They also felt the offender was held accountable more often. In fact, if you go to court on the offenders, that lines below 50%. And that's the offenders themselves making the judgment. Then you can go to forgiveness and apologies. And again, this is actually a simple thing to code. And you can see there's an enormous difference. In fact, one of the reasons there's such a big difference is because in standard court proceedings, the offender very rarely meets the victim. Now, it also turns out that I need to qualify this a little bit because a bunch of the studies included drunk driving with no injuries or accidents. When we take them out, we see a huge change. And then we can go to whether a person satisfied with the outcome. Again, we see an advantage for restorative justice, whether the victim is still upset about the crime. Now the bars are a little different. And whether they're afraid of revictimization, that's over a two to one difference. And then finally, recidivism for offenders are reoffending. And you see a big difference there. And so what I have here is a bunch of charts that are very, very simple to read. And they kind of flow in how they're giving the overall impression and then detailing it a little bit more. There's nothing fancy here. There's nothing interactive. There's nothing animated. There's nothing kind of flowing in 17 different directions. It's easy, but it follows a story and it tells a narrative about the data. And that should be your major goal with presentation graphics. In some presenting or the graphics they use for presenting are not the same as the graphics you use for exploring, they have different needs and different goals. But no matter what you're doing, be clear in your graphics and be focused in what you're trying to tell. And above all, create a strong narrative that gives a different level of perspective and answers questions as you go to anticipate a client's question and to give them the most reliable, solid information and the greatest confidence in your analysis. The final element of data science and communicating that I wanted to talk about is reproducible research. And you can think of it as this idea, you want to be able to play that song again. And the reason for that is data science projects are rarely one and done, rather they tend to be incremental, they tend to be cumulative, and they tend to adapt to the circumstances that they're working in. So one of the important things here and probably if you want to summarize it very briefly is this show your work. There's a few reasons for this. You may have to revise your research at a later date your own analyses, you may be doing another project and you want to borrow something from previous studies. More likely you'll have to hand it off to somebody else at a future point and they're going to have to be able to understand what you did. And then there's a very significant issue in both scientific and economic research of accountability. You have to be able to show that you did things in a responsible way and that your conclusions are justified. That's for clients, funding agencies, regulators, academic reviewers, any number of people. Now, you may be familiar with the concept of open data, but you may be less familiar with the concept of open data science. And that's more than open data. So for instance, I'll just let you know that there is something called the open data science conference and odse.com. And it meets three times a year in different places. And this is entirely, of course, devoted to open data science, using both open data, but making the methods transparent to people around them. One thing that can make this really simple is something called the open science framework, which is that OSF.io. It's a way of sharing your data and your research with an annotation of how you got through the whole thing with other people. It makes the research transparent, which is what we need. One of my professional organizations, the Association for Psychological Science has a major initiative on this called open practices, where they are strongly encouraging people to share their data as much as is ethically permissible and to absolutely share their methods before they even conduct a study. That's a way of getting rigorous intellectual honesty and accountability. Now, another step in all of this is to archive your data and make that information available, put it on the shelf. And what you want to do here is you want to archive all of your data sets, both the totally raw before you did anything with a data set and every step in process until your final clean data set. Along with that, you want to archive all of the code that you used to process and analyze the data. If you use a programming language like our Python, that's really simple. If you use a program like SPSS, you need to save the syntax files and then it can be done that way. And again, no matter what, make sure to comment liberally and explain yourself. Now, part of that is you need to explain your process, you know, because you're not just this lone person sitting on the sofa working by yourself. You're worth other people. And you need to explain why you did it the way that you did. You need to explain the choices, the consequences of those choices, the times that you had to backtrack and try it over again. This all also works into the principle of future proofing your work. You want to do a few things here. Number one, the data. You want to store the data in non proprietary formats like a CSV or comma separated values file, because anything could read CSV files. If you stored it in the proprietary SPSS dot save format, you might be in a lot of trouble when somebody tries to use it later and they can't open it. Also, their storage, you want to place all of your files in a secure accessible location like GitHub, it's probably one of the best choices. And then the code, you may want to use something like a dependency management package like packrat for our or virtual environment for Python as a way of making sure that the packages that you use that there are always versions that work, because sometimes things get updated and it gets broken. This is a way of making sure that the system that you have will always work. Overall, you can think of this to you want to explain yourself and a neat way to do that is to put your narrative in a notebook. Now you can have a physical lab book, but you can also do digital books, a really common one, especially if you're using Python is Jupiter with a why they're in the middle. The Jupiter notebooks are interactive notebooks. So here's a screenshot of a very simple one I made in Python. And you have titles, you have text, you have the graphics. If you're working in our you can do this with something called our markdown, which works in the same way you do it in our studio use markdown and you can annotate the whole thing, get more information about that at our markdown rstudio.com. And so for instance, here's an R analysis I did. And is you see the code on the left and you see the markdown version on the right, what's neat about this is that this little bit of code here this title and this text and this little bit of our code then is displayed as this formatted heading as this formatted text and this turns into the entire our output right there. It's a great way to do things. And then if you do our markdown, you actually have the option of uploading the document into something called our pubs. And that's an online document that can be made accessible to anybody. Here's the same document. And if you want to go see it, you can go to this address, it's kind of long. So I'm going to let you write that one down yourself. But in some here's what we have. You want to do your work and archive the information in a way that supports collaboration. Explain your choices, say what you did show how you did it. This allows you to future proof your work so it will work in other situations and for other people. And as much as possible, no matter how you do it, make sure to share your narrative so people understand your process. And they can see that your conclusions are justifiable, strong and reliable. Now something that I've mentioned several times when talking about data science, and I'll do it again in this conclusion, is that it's important to give people next steps. So I'm going to do that for you right now. If you're wondering what to do after having watched this very general overview course, I can give you a few ideas. Number one, maybe you want to start trying to do some coding in our or Python, we have courses for those. You might want to try doing some data visualization, one of the most important things that you can do. You may want to brush up on statistics and maybe some math that goes along with it. And you may want to try your hand at machine learning. All of these will get you up and rolling in the practice of data science. You can also try looking at data sourcing find the information that you're going to do. But no matter what happens, try to keep it in context. So for instance, data science going to be applied to marketing and sports and health and education and the arts and really a huge number of other things. And we will have courses here at data lab.cc that talk about all of those. You may also want to start getting involved in the community of data science. One of the best conferences that you can go to is O'Reilly strata, which means several times a year around the globe. There's also predictive analytics world, again, several times a year around the world. Then there's much smaller conferences. I love tapestry or tapestry conference calm, which is about storytelling in data science, and extract a one day conference about data stories that's put on by import IO, one of the great data sourcing applications that's available for scraping web data. If you want to start working with actual data, a great choice is to go to Kaggle.com. And they sponsored data science competitions, which actually have cash rewards. But there's also wonderful data sets you can work with there to find out how they work and compare your results to those of other people. And once you're feeling comfortable with that, you may actually try turning around and doing some service. Data kind.org is the premier organization for data science as humanitarian service. They do major projects around the world. I love their examples. There are other things you can do. There's an annual event called do good data. And then data lab dot CC will be sponsoring twice a year data lab surets, which are opportunities for people in the Utah area to work with local nonprofits on their data. But above all of this, I want you to remember this one thing. Data science is fundamentally democratic. It's something that everybody needs to learn to do in some way shape or form. The ability to work with data is a fundamental ability, and everybody would be better off by learning to work with data intelligently and sensitively. Or to put it another way, data science needs you. Thanks so much for joining me for this introductory course. I hope it's been good. And I look forward to seeing you in the other courses here at data lab dot CC. Welcome to data sourcing. I'm Barton Poulsen. And in this course, we're going to talk about data opus, or that's Latin for data needed. The idea here is that no data, no data science. And that is what we're going to do. It's a sad thing. So instead of leaving it that we're going to use this course to talk about methods for measuring and evaluating data, and methods for accessing existing data, and even methods for creating new custom data. Take those together. And it's a happy situation. At the same time, we'll do all of this still at an accessible, conceptual and non technical level, because the technical hands on stuff will happen in later other courses. But for now, let's talk data. For data sourcing, the first thing we want to talk about is a measurement. And within that category, we're going to talk about metrics. The idea here is that you actually need to know what your target is, if you want to have a chance to hit it. There's a few particular reasons for this. First off, data science is action oriented. The goal is to do something as opposed to simply understand something, which I say as an academic practitioner. Also, your goal needs to be explicit. And that's important because the goals can guide your effort. So you want to say exactly what you're trying to accomplish. So you know, when you get there. Also, goals exist for the benefit of the client, they can prevent frustration, they know what you're working on, they know what you have to do to get there. And finally, the goals and the metrics exist for the benefit of the analyst, because they help you use your time well, you know, when you're done, you know when you can move ahead with something. And that makes everything a little more efficient and a little more productive. Now, when we talk about this, the first thing you want to do is try to define success in your particular project or domain. Depending on where you are in commerce that can include things like sales or click through rates or new customers. In education, it can include scores on tests, it can include graduation rates or retention. In government and include things like housing and jobs. In research, it can include the ability to serve the people that you're trying to better understand. So whatever domain you're in, there will be different standards for success. And you're going to need to know what applies in your domain. Next are specific metrics or ways of measuring. Now, again, there are a few different categories here. There are business metrics, there are key performance indicators or KPIs. There are smart goals, that's an acronym. And there's also the issue of having multiple goals. I'll talk about each of those for just a second now. First off, let's talk about business metrics. If you're in the commercial world, there are some common ways of measuring success. A very obvious one is sales revenue. Are you making more money? Are you moving the merchandise? Are you getting sales? Also, there's the issue of lead generated new customers or new potential customers, because that then in turn is associated with future sales. There's also the issue of customer value or lifetime customer value. So you may have a small number of customers, but they all have a lot of revenue. And you can use that to really predict the overall profitability of your current system. And then there's churn rate, which has to do with, you know, losing and gaining new customers and having a lot of turnover. So any of these are potential ways of defining success and measuring it. These are potential metrics, there are others, but these are some really common ones. Now, I mentioned earlier, something called a key performance indicator, or KPI. KPIs come from David Parmenter, and he's got a few ways of describing them. He says a key performance indicator for business number one should be non financial. So not just the bottom line, but something else that might be associated with it or that measures the overall productivity of the association. They should be timely, for instance, weekly, daily or even constantly gathered information. They should have a CEO focus. So the senior management team are the ones who generally make the decisions that affect how the organization acts on the KPIs. They should be simple. So everybody in the organization, everybody knows what they are and knows what to do about them. They should be team based. So teams can take joint responsibility for meeting each one of the KPIs. They should have significant impact. What that really means is they should affect more than one important outcome. So you can do profitability and market reach or improve manufacturing time and fewer defects. And finally, an ideal KPI has a limited dark side. That means there's fewer possibilities for reinforcing the wrong behaviors and rewarding people for sort of exploiting the system. Next, there are smart goals, where smart stands for specific, measurable, assignable to a particular person, realistic meaning you can actually do it with the resources you have at hand, and time bound so you know when it can get done. So whenever you form a goal, you should try to assess it on each of these criteria. And that's a way of saying that this is a good goal to be used as a metric for the success of our organization. Now, the trick, however, is when you have multiple goals, multiple possible endpoints. And the reason that's difficult is because, well, it's easy to focus on one goal, if you're just trying to maximize revenue, or if you're just trying to maximize, you know, graduation rate, there's a lot of things you can do. It becomes more difficult when you have to focus on many things simultaneously, especially because some of these goals may conflict, the things that you do to maximize one may impair the other. And so when that happens, you actually need to start engaging in a deliberate process of optimization, you need to optimize. And there are ways you can do this. If you have enough data, you can do a mathematical optimization to find the ideal balance of efforts to pursue one goal, and the other goal at the same time. Now, this is a very general summary. And let me finish with this in some metrics or methods for measuring can help awareness of how well your organization is functioning and how well you're reaching your goals. There are many different methods available for defining success and measuring progress towards those things. The trick, however, comes when you have to balance efforts to reach multiple goals simultaneously, which can bring in the need for things like optimization. When talking about data sourcing and measurement, one very important issue has to do with the accuracy of your measurements. The idea here is that you don't want to have to throw away all your ideas, you don't want to waste effort. One way of doing this in a very quantitative fashion is to make a classification table. So what that looks like is this, you talk about, for instance, positive results, negative results. And in fact, let's start by looking at the top here, the middle two columns here talk about whether an event is present, whether your house is on fire, whether a sale occurs, whether you've got a tax evader, whatever. So that's whether a particular thing is actually happening or not. On the left here is whether the test or the indicator suggests that a thing is or is not happening. And then you have these combinations of true positives, where the test says it's happening and it really is. And false positive where the test says it's happening, but it's not. And then below that, true negatives where the test says it isn't happening. And that's correct. And then false negatives where the test says there's nothing going on, but there is in fact the event occurring. And then you start to get the column totals, the total number of events present or absent, and the row totals that talk about the test results. Now, from this table, what you get is four kinds of accuracy, or really four different ways of quantifying accuracy using different standards. And they go by these names, sensitivity, specificity, positive, predictive value, and negative predictive value. I'll show you very briefly how each of them works. sensitivity can be expressed this way. If there's a fire does the alarm ring, you want that to happen. And so that's a matter of looking at the true positives, and dividing that by the total number of alarms. So the test positive means there's an alarm, and the event present means there's a fire, you want to always to have an alarm when there's a fire. specificity, on the other hand, is sort of the flip side of this. If there isn't a fire, does the alarm stay quiet? This is where you're looking at the ratio of true negatives to total absent events where there's no fire and the alarm's not ringing. And that's what you want. Now, those are looking at columns, you can also go sideways across rows. So the first one there is positive predictive value, often just abbreviated as PPV. And we flip around the order a little bit. This one says if the alarm rings, was there a fire? So now you're looking at the true positives and dividing it by the total number of positives, total number of positives is anytime the alarm rings, true positives is because there was a fire. And negative predictive value or NPV says, if the alarm doesn't ring, does that in fact mean that there is no fire? Well, here you're looking at true negatives and dividing it by total negatives the time that it doesn't ring. And again, you want to maximize that so the true negatives account for all of the negatives the same way you want the true positives to account for all of the positives and so on. Now, you can put numbers on all of these going from 0% to 100%. And the idea is to maximize each one as much as you can. So in some from these tables, we get four kinds of accuracy, and there's a different focus for each one. But the same overall goal, you want to identify the true positives and true negatives, and avoid the false positives and false negatives. And this is one way of putting numbers on an index really, on the accuracy of your measurement. Now data sourcing may seem like a very quantitative topic, especially when we're talking about measurement. But I want to measure one important thing here. And that is the social context of measurement. The idea here really is that people are people, and they all have their own goals, and they're going their own ways. And we all have our own thoughts and feelings that don't always coincide with each other. And this can affect measurement. And so for instance, when you're trying to define your goals, and you're trying to maximize them, you want to look at things like, for instance, the business model, an organization's business model, the way they conduct their business, the way they make their money, is tied to its identity, and its reason to be. And if you make a recommendation that's contrary to their business model, that can actually be perceived as a threat to their core identity. And people tend to get freaked out in that situation. Also restrictions. So for instance, there may be laws, policies, and common practices, both organizationally and culturally, that may limit the ways that goals can be met. Now, most of these make a lot of sense. So the idea is you can't just do anything you want, you need to have these constraints. And when you make your recommendations, maybe you'll work creatively in them as long as you're still behaving legally and ethically. But you do need to be aware of these constraints. Next is the environment. And the idea here is that competition occurs both between organizations that company here is trying to reach a goal, but they're competing with company BU over there. But probably even more significantly, there is competition within the organization. This is really a recognition of office politics, and that when you as a consultant make a recommendation based on your analysis, you need to understand, you're kind of dropping a little football into the office and things are going to further one person's career, maybe to the detriment of another. And in order for your recommendations to have the maximum effectiveness, they need to play out well in the office. That's something that you need to be aware of as you're making your recommendations. Finally, there's the issue of manipulation. And sad truism about people is that any reward system, any reward system at all, will be exploited. And people will generally game the system. This happens, especially when you have a strong cutoff, you need to get at least 80% or you get fired. And people will do anything to make their numbers appear to be 80%. This happens an awful lot when you look at executive compensation systems, it looks a lot when you have very high stakes school testing. It happens in an enormous number of situations. And so you need to be aware of the risk of exploitation and gaming. Now, don't think then that all is lost, don't give up. You can still do really wonderful assessment, you can get good metrics, just be aware of these particular issues and be sensitive to them as you both conduct your research and as you make your recommendations. So in some social factors affect goals and they affect the way that you meet those goals. There are limits and consequences both on how you reach the goals and out really what the goal should be. And that when you're making advice on how to reach those goals, please be sensitive to how things play out with metrics and how people will adapt their behavior to meet the goals. That way, you can make something that's more likely to be implemented the way you meant, and more likely to predict accurately what can happen with your goals. When it comes to data sourcing, obviously, the most important thing is to get data. But the easiest way to do that, at least in theory, is to use existing data. Think of it as going to the bookshelf and getting the data that you have right there at hand. Now, there's a few different ways to do this. You can get in house data, you can get open data, and you can get third party data. Another nice way to think of that is proprietary, public and purchased data, the three P's I've heard it called. Let's talk about each of these a little bit more. So in house data, that's stuff that's already in your organization. What's nice about that is it can be really fast and easy. It's right there. And the format may be appropriate for the kind of software in the computer that you're using. If you're fortunate, there's good documentation, although sometimes when it's in house, people just kind of throw it together. So you have to watch out for that. And there's the issue of quality control. Now, this is true with any kind of data, but you need to pay attention with in house, because you don't know the circumstances necessarily under which people gathered the data and how much attention they were paying to something. There's also an issue of restrictions. There may be some data that while it's in house, you may not be allowed to use or you may not be able to publish the results or share the results with other people. So these are things that you need to think about when you're going to use in house data, in terms of how can you use it to facilitate your data science projects? Specifically, there are few pros and cons in house data, potentially quick, easy, free, hopefully standardized, maybe even the original team that conducted this study is still there. And you might have identifiers in the data, which make it easier for you to do an individual level analysis. On the con side, however, the in house data simply may not exist, maybe it's just not there, or the documentation may be inadequate. And of course, the quality may be uncertain, always true, but maybe something you have to pay more attention to when you're using in house data. Now, another choice is open data, like going to the library and getting something. This is prepared to data that's freely available consists of things like government data and corporate data and scientific data from a number of sources. Let me show you some of my favorite open data sources just so you know where they are and that they exist. Probably the best one is data.gov here in the US. That is the says right here, the home of the US government's open data. Or you may have a state level one. For instance, I'm in Utah, and we have data.utah.gov, also a great source of more regional information. If you're in Europe, you have open dash data dot Europa dot EU, the European Union Open Data portal. And then there are major nonprofit organizations. So the UN has UNICEF dot org slash statistics for their statistical and monitoring data. The World Health Organization has the global health observatory at who dot and slash GHO. And then there are private organizations that work in the public interest such as the Pew Research Center, which shares a lot of its data sets, and the New York Times, which makes it possible to use API to access a huge amount of the data of things they've published over a huge time span. And then two of the mother loads, there's Google, which at Google.com has public data, which is a wonderful thing. And then Amazon at aws.amazon.com data sets has gargantuan data set. So if you needed a data set that was like five terabytes in size, this is the place that you would go to get it. Now, there's some pros and cons to using this kind of open data. First is that you can get very valuable data sets that maybe cost millions of dollars to gather into process. And you can get a very wide range of topics and times and groups of people and so on. And often the data is very well formatted and well documented. There are, however, a few cons. Sometimes there's by a sample say, for instance, you only get people who have internet access. And that can mean, you know, not everybody. Sometimes the meaning of the data is not clear, or it may not mean exactly what you wanted to. A potential problem is that sometimes you may need to share your analysis. And if you're doing proprietary research, well, it's going to have to be open research instead. And so that can create a cramp with some of your clients. And then finally, there are issues with privacy and confidentiality. And in public data, that usually means that the identifiers are not there. And you're gonna have to work at a larger aggregate level of measurement. Another option is to use data from a third party. These go by the name data as a service or Das, you can also call them data brokers. And then thing about data brokers is they can give you an enormous amount of data on many different topics. Plus, they can save you some time and effort by actually doing some of the processing for you. And that can include things like consumer behaviors and preferences, they can get contact information, they can do marketing identity and finances, there's a lot of things. There's a number of data brokers of round. Here's a few of them. Axiom is probably the biggest one in terms of marketing data. There's also Nielsen, which provides data primarily for media consumption. And there's another organization, data sift, that's a smaller newer one. And there's a pretty wide range of choices. But these are some of the big ones. Now the thing about using data brokers, there's some pros and there's some cons. The pros are first that it can save you a lot of time and effort. It can also give you individual level data, which can be hard to get from open data. Open data is usually at the community level. They can give you information about specific consumers. They can even give you summaries and inferences about things like credit scores and marital status, possibly even whether a person gambles or smokes. Now the con is this, number one, it can be really expensive. I mean, this is a huge service, it provides a lot of benefit and is priced accordingly. Also, you still need to validate it, you still need to double check that it means what you think it means and that it works in with what you want. And probably a real sticking point here is the use of third party data is distasteful to many people. And so you have to be aware of that as you're making your choices. So in some as far as data sourcing existing data goes, obviously data science needs data. And there's the three P's of data sources proprietary and public and purchased. But no matter what source you use, you need to pay attention to quality and to the meaning and the usability of the data to help you along in your own projects. When it comes to data sourcing, a really good way of getting data is to use what are called API's. Now I like to think of these as the digital version of proof rocks mermaids, if you're familiar with the love song of J alpha proof rock by TS Eliot, he says, I have heard the mermaids singing each to each, that's TS Eliot. And I like to adapt that to say, API's have heard apps singing each to each and that's by me. Now, more specifically, when we talk about an API, what we're talking about is something called an application programming interface. And this is something that allows programs to talk to each other. It's most important use in terms of data science is it allows you to get web data, it allows your program to directly go to the web on its own, grab the data, bring it back in, almost as though it were local data. And that's a really wonderful thing. Now, the most common version of API's for data science are called rest API's that stands for representational state transfer, that's the software architectural style of the worldwide web. And it allows you to access data on webpages via HTTP, that's the hypertext transfer protocol that, you know, runs the web as we know it. And when you download the data, you usually get it in JSON format that stands for JavaScript, object notation. The nice thing about that is that's human readable, but it's even better for machines. Then you can take that information and you can send it directly to other programs. And the nice thing about rest API's is that they're what's called language agnostic, meaning any programming language can call a rest API can get data from the web and can do whatever it needs to with it. Now, there are a few kinds of API's that are really common. The first is what are called social API's. These are ways of interfacing with social networks. So for instance, the most common is Facebook, there's also Twitter, Google talk has been a big one and Four Square as well. And then SoundCloud, these are on lists of the most popular ones. And then there are also what are called visual API's, which are for getting visual data. So for instance, Google Maps is the most common, but YouTube is something that accesses YouTube on a particular website, or AccuWeather, which is for getting weather information, Pinterest for photos and Flickr for photos as well. So these are some really common API's and you can program your computer to pull in data from any of these services and sites and integrate it into your own website or here into your own data analysis. Now there's a few different ways you can do this. You can program it in our the statistical programming language, you can do it in Python. Also, you can even do it in the very basic bash command line interface. And there's a ton of other applications. Basically, anything can access an API one way or another. Now, I'd like to show you how this works in R. So I'm going to open up a script in our studio. And then I'm going to use it to get some very basic information from a web page. Let me go to our studio and show you how this works. I've opened up a script and our studio that allows me to do some data sourcing here. Now I'm just going to use a package called JSON light, I'm going to load that one up. And then I'm going to go to a couple of websites, I'm going to be getting historical data from formula one car races. And I'm going to be getting it from airgas.com. Now if we go to this page right here, I can just go straight to my browser right now. And this is what it looks like, it gives you the API documentation. So what you're doing for an API is you're just entering a web address. And to end that web address, it includes the information that you want. I'll go back to our here for a second. And if I want to get information about 1957 races in JSON format, I go to this address, I can skip over to that for a second. And what you see is it's kind of a big long mess here, but it is all labeled and it's clear to the computer, what's going on here. I'll go back to our. And so what I'm going to do is I'm going to save that URL into an object here in our, and then I'm going to use the command from JSON to read that URL and save it into our and which it is now done. And I'm going to zoom in on that so you can see what's happened. I've got this sort of mess of text, this is actually a list object in our and then I'm going to get just the structure of that object. So I'm going to do this one right here. And you can see that it's a list and it gives you the names of all the variables within each one of the lists. And what I'm going to do is I'm going to convert that list to a data frame by I went through the list and found exactly where the information I wanted was located, you have to use this big long statement here. That'll give me the names of the drivers. Let me zoom in on that again. There they are. And then I'm going to get just the column names for that bit of the data frame. And so what I have here is six different variables. And then what I'm going to do is I'm going to pick just the first five cases, and I'm going to select some variables and put them in a different order. And when I do that, this is what I get. I'll zoom in on that again. And the first five people listed in this data set that I pulled in from 1957, are Juan Fangio makes sense, one of the greatest drivers ever, and other people who competed in that year. And so what I've done is by using this API call in our very simple thing to do, I was able to pull data off that web page in a structured format, and do a very simple analysis with it. And let's sum up what we've learned from all this. First off, APIs make it really easy to work with web data. They structure, they call it for you. And then they feed it straight into the programs for you to analyze. And they're one of the best ways of getting data and getting started in data science. When you're looking for data, another great way of getting data is through scraping and what that means is pulling information from webpages. I like to think of it as when data is hiding in the open. It's there, you can see it, but there's not an easy immediate way to get that data. Now, when you're dealing with scraping, you can get data in several different formats, you can get HTML text from webpages, you can HTML tables, the rows and columns that appear on webpages. You can scrape data from PDFs and you can scrape data from all sorts of media like images and video and audio. Now, we'll make one very important qualification before we say anything else. Pay attention to copyright and privacy. Just because something is on the web doesn't mean you're allowed to pull it out. Information gets copyrighted. And so when I use examples here, I make sure that this is stuff that's publicly available. And you should do the same when you're doing your own analyses. Now, if you want to scrape data, there's a couple of ways to do it. Number one is to use apps that are developed for this. So for instance, import dot io is one of my favorites. It's both a web page that sits address and it's a downloadable app. There's also a scraper wiki there's an application called tabula and you can even do scraping and Google Sheets, which I'll demonstrate in a second and Excel. Or if you don't want to use an app or if you want to do something that apps don't really let you do, you can code your scraper, you can do it directly in R or Python, or bash, or even Java or PHP. Now, what you're going to do is you're going to be looking for information on the web page. If you're looking for HTML text, what you're going to do is you're going to pull structured text from webpages similar to how a reader view works in a browser. It uses HTML tags on the web page to identify what's the important information. So that's things like body and h1 for header one and P for paragraph in the angle brackets. You can also get information from HTML tables, although this is a physical table of rows and columns I'm showing you. This also uses HTML table tags that's like table and TR for table row and TD for table data that's a cell. The trick is when you're doing this, you need the table number and sometimes you just have to find that through trial and error. Let me give you an example of how this works. Let's take a look at this Wikipedia page on the Iron Chef America competition. I'm going to go to the web right now and show you that one. So here we are in Wikipedia, Iron Chef America. And if you scroll down a little bit, you see we got a whole bunch of text here, we got our table of contents. And then we come down here, we have a table that lists the winners, the statistics for the winners. And let's say we want to pull that from this webpage into another program for us to analyze. Well, there's an extremely easy way to do this with Google Sheets. All we need to do is open up a Google Sheet and then sell a one of that Google Sheet, we paste in this formula. It's import HTML, then you give the web page, then you say that you're importing a table, you have to put that stuff in quotes and the index number for the table. I had to poke around a little bit to figure out that this one was table number two. So let me go to Google Sheets and show you how this works. Here I have a Google Sheet. And right now it's got nothing in it. But watch this, if I come here to this cell, and I simply paste in that information, all this stuff to sort of magically propagates into the sheet makes it extremely easy to deal with. And now I can for instance, save this as a CSV file, put in another program, lots of options. And so this is one way that I'm scraping the data from a web page, because I didn't use an API, but I just use a very simple one link command in Google Sheets to get the information. Now that was an HTML table, you can also scrape data from PDFs. You have to be aware of whether it's a native PDF, I call that a text PDF or a scanned or image PDF. And what it does with native PDFs is it looks for text elements. Again, those are like code that indicates this is text. And you can deal with raster images, that's pixel images or vector, which draws the lines. And that's what makes them infinitely scalable in many situations. And then in PDFs, you can deal with tabular data, but you probably have to use a specialized program like scraper wiki or tabula in order to get that. And then finally, media like images and video and audio, getting images is easy, you can download them in a lot of different ways. And then if you want to read data from them, say for instance, you have a heat map of a country, you can go through but you'll probably have to write a program that loops through the image pixel by pixel to read the data and then encode it numerically into your statistical program. Now that's my very brief summary. And let's summarize that. First off, if the data you're trying to get at doesn't have an existing API, you can try scraping. And you can use specialized apps for scraping, or you can write code in a language like our Python. But no matter what you do be sensitive to issues of copyright and privacy. So you don't get yourself in hot water. But instead, you make an analysis that can be of great use to you or to your client. The next step in data sourcing is making data. And specifically, we're talking about getting new data. I like to think of this as you're getting your hands on you're getting data de novo new data. So can't find the data that you need for your analysis. Well, one simple solution is do it yourself. And we're going to talk about a few general strategies used for doing that. Now, these strategies vary in a few dimensions. First off is the role. Are you passive and simply observing stuff that's happening already? Or are you active where you play a role in creating the situation to get the data? And then there's the QQ question. And that is, are you going to get quantitative or numerical data? Or are you going to get qualitative data, which usually means text, paragraph sentences, as well as things like photos and videos and audio? And also, how are you going to get the data? Do you want to get it online? Or do you want to get it in person? Now, there's other choices in these. But these are some of the big delineators of the different methods. When you look at those, you get a few possible options. Number one is interviews, and I'll say more about those. Another one is surveys. A third one is card sorting. And the fourth one is experiments, although I actually want to split experiments into two kinds of categories. The first one is laboratory experiments. And that's in person projects, where you shape the information or an experience for the participants as a way of seeing how that involvement changes their reactions. Doesn't necessarily mean that you're a participant, but you create the situation. And then there's also a B testing. This is automated online testing of two or more variations on a web page. It's a very, very simple kind of experimentation. There's actually very useful for optimizing websites. So in some from this very short introduction, make sure you can get exactly what you need get the data you need to answer your question. And if you can't find it somewhere, then make it. And as always, you have many possible methods, each of which have their own strengths and their own compromises. And we'll talk about each of those in the following sections. The first method of data sourcing where you're making new data that I want to talk about is interviews. And that's not because it's the most common, but because it's the one you would do for the most basic problem. Now, basically, an interview is nothing more than a conversation with another person or a group of people. And the fundamental question is, why do interviews as opposed to doing a survey or something else? Well, there's a few good reasons to do that. Number one, you're working with a new topic, and you don't know what people's responses will be how they'll react. And so you need something very open ended. Number two, you're working with a new audience, you don't know how they will react in particular to what it is you're trying to do. And number three, something's going on with the current situation, it's not working anymore. And you need to find what's going on and you need to find ways to improve the open ended information where you get past your existing categories and boundaries can be one of the most useful method for getting that data. If you want to put it another way, you want to do interviews when you don't want to constrain responses. Now, when it comes to interviews, you have one very basic choice. And that's whether you do a structured interview. And with a structured interview, you have a predetermined set of questions. And everyone gets the same questions in the same order, it gives a lot of consistency, even though the responses are open ended. And then you can also have what's called an unstructured interview. And this is a whole lot more like a conversation where you as the interviewer and the person you're talking to, your questions arise in response to their answers. Consequently, an unstructured interview can be different for each person that you talk to. Also, interviews are usually done in person, but not surprisingly, they can be done over the phone or often online. Now, a couple of things to keep in mind about interviews. Number one is time interviews can range from just a few minutes to several hours per person. Second is training interviewing to special skill that usually requires specific training. Now, asking the questions is not necessarily the hard part. The really tricky part is the analysis. The hardest part of interviews by far is analyzing the answers for themes and way of extracting the new categories and the dimensions that you need for your further research. The beautiful thing about interviews is that you allow you to learn things that you never expected. So in some interviews are best for new situations or new audiences. On the other hand, they can be time consuming. And they also require special training, both to conduct the interview, but even more to analyze the highly qualitative data that you get from them. An interesting topic in data sourcing when you're making data is card sorting. Now, this isn't something that comes up very often in academic research, but in web research, this can be a really important method. Think of it as what you're trying to do is like building a model of a molecule here, you're trying to build a mental model or a model of people's mental structures, put more specifically, how do people organize information intuitively? And also, how does that relate to the things that you're doing online? Now, the basic procedure goes like this, you take a bunch of little topics, and you write each one on a separate card. And you can do this physically with like three by five cards, or there's a lot of programs that allow you to do a digital version of it. Then what you do is you give this information to a group of respondents and the people sort those cards, so they put similar topics with each other, different topics over here and so on. And then you take that information and from that you're able to calculate what's called dissimilarity data. Think of it as like the distance or the difference between various topics. And that gives you the raw data to analyze how things are structured. Now, there are two very general kinds of card sorting tasks. They're generative and there's a evaluative. A generative card sorting task is one in which respondents create their own sets, their own piles of cards, using any number of groupings they like. And this might be used, for instance, to design a website. If people are going to be looking for one kind of information next to another one, then you want to put that together on the website so they know where to expect it. On the other hand, if you've already created a website, then you can do an evaluative card sorting. This is where you have a fixed number or fixed names of categories, like, for instance, the way you've set up your menus already. And then what you do is you see if people naturally put the cards into these various categories that you've created. That's a way of verifying that your hierarchical structure makes sense to people. Now, whichever method you do generative or evaluative, what you end up with when you do a card structure is an interesting kind of visualization. It's called a dendrogram that actually means branches. And what we have here is actually 150 data points. If you're familiar with the Fisher's Iris data, that's what's going on here. And it groups it from one giant group on the left and then splits it in pieces and pieces and pieces until you end up with lots of different observed, well, actually individual level observations at the end, but you can cut things off into two or three groups or wherever it's most useful for you here as a way of visualizing the entire collection of similarity or dissimilarity between the individual pieces of information that you had people sort. Now, I'll just mention very quickly, if you want to do digital card sorting, which makes your life infinitely easier because keeping track of physical cards is really hard. You can use something like Optimal Workshop or User Zoom or UX Suite. These are some of the most common choices. Now, let's just sum up what we've learned about card sorting in this extremely brief overview. Number one, card sorting allows you to see intuitive organization of information in a hierarchical format. You can do it with physical cards or you also have digital choices for doing the same thing. And when you're done, you actually get this hierarchical or branched visualization of how the information is structured and related to each other. When you're doing your data sourcing and you're making data, sometimes you can't get what you want through the easy ways and you got to take the hard way. And you can do what I'm calling laboratory experiments. Now, of course, when I mentioned laboratory experiments, people start to think of stuff like, you know, Dr. Frankenstein in his lab, but lab experiments are less like this. And in fact, they're a little more like this. Nearly every experiment I have done in my career has been a paper and pencil one with people in a well-lighted room. And it's not been the threatening kind. Now, the reason you do a lab experiment is because you want to determine cause and effect. And this is the single most theoretically viable way of getting that information. Now, what makes an experiment an experiment is the fact that researchers play active roles in experiments with manipulations. Now, people get a little freaked out when they hear manipulations things that you're coercing people and messing with their mind. All that means is you are manipulating the situation you're causing something to be different for one group of people or one situation than another. It's a benign thing. But it allows you to see how people react to those different variations. Now, you're going to want to do an experiment, you're going to want to have focused research, it's usually done to test one thing or one variation at a time. And it's usually hypothesis driven. Usually don't do an experiment until you've done enough background research to say, I expect people to react this way to the situation and this way to the other. A key component of all of this is that experiments almost always have random assignments. So regardless of how you got your sample, when they're in your study, you randomly assign them to one condition or another. And what that does is it balances out the preexisting differences between groups. And that's a great way of taking care of confounds and artifacts, the things that are unintentionally associated with differences between groups that provide alternate explanations for your data. If you've done good random assignment and you have a large enough people, then those confounds and artifacts are basically minimized. Now, some places where you're likely to see a laboratory experiments in this version are for instance, eye tracking and web design. That's where you have to bring people in front of a computer and you stick a thing there that sees where they're looking. That's how we know for instance, that people don't really look at ads on the site of web pages. Another very common place is research and medicine and education. And in my field psychology, and in all of these, what you find is that experimental research is considered the gold standard for reliable valid information about cause and effect. On the other hand, while it's a wonderful thing to have, it does come at a cost. Here's how that works. Number one, experimentation requires extensive specialized training. It's not a simple thing to pick up. Two, experiments are often very time consuming and labor intensive. I've known some that take hours per person. And number three, experiments can be very expensive. So what that all means is you want to make sure that you've done enough background research, and you need to have a situation where it's sufficiently important to get really reliable cause and effect information to justify these costs for experimentation. In some laboratory experimentation is generally considered the best method for causality or assessing causality. That's because it allows you to control for confounds through randomization. On the other hand, it can be difficult to do. So be careful and thoughtful when considering whether you need to do an experiment and how to actually go about doing it. There's one final procedure I want to talk about in terms of data sourcing and making new data. It's a form of experimentation and it's simply called AB testing. And it's extremely common in the web world. So for instance, I just barely grabbed a screenshot of amazon.com's homepage. And you've got these various elements on the homepage. And I just noticed, by the way, when I did this, that this woman is actually a animated gift. So she moves around. That was kind of weird, never seen that before. But the thing about this is this entire layout, how things are organized and how they're on there will have been determined by variations on AB testing by Amazon. Here's how it works. For your web page, you pick one element like, what's the headline or what are the colors or what's the organization or how do you word something? And you create multiple versions, maybe just to version A and version B, which is why he called AB testing. And then when people visit your web page, you randomly assign those visitors to one version or another, you have software that does that for you automatically. And then you compare the response rates on some response, I'll show you those in a second. And then once you have enough data, you implement the best version, you sort of set that one solid and then you go on to something else. Now, in terms of response rates, there's a lot of different outcomes you can look at. You can look at how long a person's on a page, you can actually do mouse tracking if you want to. You can look at click throughs, you can also look at shopping cart value and or abandonment, a lot of possible outcomes. All of these contribute through AB testing to the general concept of website optimization, to make your website as effective as it can possibly be. Now, the idea also is that this is something you're going to do a lot. You can perform AB tests continually. In fact, I've seen one person say that what AB testing really stands for is always be testing, kind of cute. But it does give you the idea that improvement is a constant process. Now if you want some software to do AB testing to the most common choices are optimistically, and VWO which stands for visual web optimizer. Now, many others are available, but these are especially common. And when you get the data, you're going to use statistical hypothesis testing to compare the differences or really the software does it for you automatically. But you may want to adjust the parameters because most software packages cut off testing a little too soon and the information is not quite as reliable as it should be. But in some, here's what we can say about AB testing is a version of website experimentation done online, which makes it really easy to get a lot of data very quickly. It allows you to optimize the design of your website for whatever outcome is important to you. And it can be done as a series of continual assessments, testing and development to make sure that you're accomplishing what you want to as effectively as possible for as many people as possible. The next logical step in data sourcing and making data is surveys. Now think of this, if you want to know something, just ask, that's the easy way. And you want to do a survey under certain situations. The real question is, do you know your topic and your audience well enough to anticipate their answers to know what the range of their answers and the dimensions and the categories that are going to be important? If you do, then a survey might be a good approach. Now, just as there were a few dimensions for interviews, there's a few dimensions for surveys, you can do what's called a closed ended survey. That's also called a forced choice. It's where you give people just particular options like a multiple choice. You can have an open ended survey where you have the same questions for everybody, but you allow them to ride in a free form response. You can do surveys in person. And you can also do them online or over the mail or phone or however. And now it's very common to use software when doing surveys, some really common applications for online surveys are survey monkey and Qualtrics or at the very simple end, there's Google Forms, and at the simple and pretty end, there's type form. There's a lot more choices, but these are some of the major players and how you can get data from online participants in survey format. Now, the nice thing about surveys is, you know, they're really easy to do, they're very easy to set up, and they're really easy to send out to large groups of people, you can get tons of data really fast. On the other hand, the same way that they're easy to do, they're also really easy to do badly. The problem is that the questions you ask, they can be ambiguous, they can be double barreled, they can be loaded. And the response scales can be confusing. So if you say, I never think this particular way in a person put strongly disagree, they may not know exactly what you're trying to get at. So you have to take special effort to make sure that the meaning is clear, unambiguous, and that the rating scale, the way that people respond is very clear, and they know where their answer falls, which gets us into one of the things about people behaving badly. And that is, beware the push pull. Now, especially during election time, like we're in right now, a push pull is something that sounds like a survey. But really what it is, is a very biased attempt to get data just fodder for social media campaigns, or I'm going to make a chart that says that 98% of people agree with me. A push pull is one that's so biased, there's really only one way to answer to the questions. This is considered extremely irresponsible and unethical from a research point of view. Just hang up on them. Now, aside from that egregious violation of research ethics, you do need to do other things like watch out for bias in the question wording, in the response options, and also in the sample selection, because any one of those can push your responses off one way or another, without you really being aware that it's happening. So in some let's say this about surveys, you can get lots of data quickly. On the other hand, requires familiarity with the possible answers in your audience. So you know sort of what to expect. And no matter what you do, you need to watch for bias to make sure that your answers are going to be representative of the group that you're really concerned about understanding. The very last thing I want to talk about in terms of data sourcing is to talk about the next steps. And probably the most important thing is, you know, don't just sit there. I want you to go and see what you already have, try to explore some open data sources. And if it helps, check with a few data vendors. And if those don't give you what you need to do your project, then consider making new data. Again, the idea here is get what you need and get going. Thanks for joining me and good luck on your own projects. Welcome to coding and data science. I'm Bart Polson. And what we're going to do in this series of videos is we're going to take a little look at the tools of data science. So I'm inviting you to know your tools, but probably even more important than that is to know their proper place. Now, I mention that, because a lot of the times when people talk about data tools, they talk about it as though that were the same thing as data science as though they were the same set. But I think if you look at it for just a second, that's not really the case. Data tools are simply one element of data science, because data science is made up of a lot more than the tools that you use. It includes things like business knowledge, it includes the meaning making and interpretation includes social factors. And so there's much more than just the tools involved. That being said, you will need at least a few tools. And so we're going to talk about some of the things that you can use in data science, if it works well for you. In terms of getting started, the basic things. Number one is spreadsheets, it's the universal data tool. And I'll talk about how they play an important role in data science. Number two is a visualization program called Tableau, there's Tableau public, which is free. And there's Tableau desktop, and there's also something called Tableau server. But Tableau is a fabulous program for data visualization. And I'm convinced for most people provides the great majority of what they need. And though, while it's not a tool, I do need to talk about the formats used in web data, because you have to be able to navigate that when doing a lot of data science work. Then we can talk about some of the essential tools for data science, those include the programming language R, which is specifically for data. There's the general purpose programming language Python, which has been well adapted to data. And there's the database language SQL or SQL for a structured query language. Then if you want to go beyond that, there are some other things that you can do. They're the general purpose programming languages C, C++ and Java, which are very frequently used to form the foundation of data science and sort of high level production code is going to rely on those as well. There's the command lined interface language bash, which is very common as a very quick tool for manipulating data. And then there's the sort of wild card supercharged, regular expressions or regex. We'll talk about all of these in separate courses. But as you consider all the tools that you can use, don't forget the 80 20 rule also known as the Pareto principle. And the idea here is that you're going to get a lot of bang for your buck out of a small number of things. And I'm going to show you a little sample graph here. Imagine that you have 10 different tools and we'll call them A through B. A does a lot for you B does a little bit less and it kind of tapers down to you've got a bunch of tools that do just a little bit of stuff that you need. Now, instead of looking at the individual effectiveness, look at the cumulative effectiveness, how much are you able to accomplish with a combination of tools? Well, the first ones right here at 60% where the tool started, then you add on the 20% from B and it goes up and then you add on C and D and you add up little smaller smaller pieces. And by the time you get to the end, you've got 100% of effectiveness from your 10 tools combined. The important thing about this is you only have to go to the second tool that's two out of 10. So that's B that's 20% of your tools. And in this made up example, you've got 80% of your output. So 80% of the output from 20% of the tools. That's a that's a fictional example of the Pareto principle. But I find in real life, it tends to work something approximately like that. And so you don't necessarily have to learn everything and you don't have to learn how to do everything in everything. Instead, you want to focus on the tools that will be most productive and specifically most productive for you. So in sum, let's say these three things. Number one, coding or simply the ability to manipulate data with programs and computers, coding is important. But data science is much greater than the collection of tools that's used in it. And then finally, as you're trying to decide what tools to use and what you need to learn and how to work, remember the 8020 rule, you're going to get a lot of bang from a small set of tools. So focus on the things are going to be most useful for you in conducting your own data science projects. As we begin our discussion of coding and data science actually want to begin with something that's not coding, I want to talk about applications or programs that are already created that allow you to manipulate data. And we're going to begin with the most basic of these spreadsheets. We're going to do the rows and columns and cells of Excel. And the reason for this is you need spreadsheets. Now, you may be saying to yourself, no, no, no, not me, because you know what, I'm fancy. I'm working in my big set of servers, I got fancy things going on. But you know what, you two fancy people, you need spreadsheets as well. There's a few reasons for this. Most importantly, spreadsheets can be the right tool for data science in a lot of circumstances. There are a few reasons for that. Number one, spreadsheets, they're everywhere, they're ubiquitous, they're installed on a billion machines around the world. And everybody uses them, they probably have more data sets in spreadsheets than anything else. And so it's a very common format. Importantly, it's probably your clients format. A lot of your clients are going to be using spreadsheets for their own data. I've worked with billion dollar companies that keep all of their data in spreadsheets. And so when you're working with them, you need to know how to manipulate that and how to work with it. Also, regardless of what you're doing, spreadsheets or specifically CSV comma separated value files, are sort of the lingua franca, the universal interchange format for data transfer to allow you to take it from one program to another. And then truthfully, in a lot of situations, they're really easy to use. And if you want a second opinion on this, let's take a look at this ranking. There's a survey of data mining experts, the Katie Nuggets data mining poll. And these are the tools they most use in their own work. And look at this lowly Excel is fifth on the list. And in fact, what's interesting about it, it's above Hadoop and Spark two of the major big data fancy tools. And so Excel really does have place of pride in a toolkit for data analysts. Now, since we're going to go into sort of the low tech end of things, let's talk about some of the things that you can do with a spreadsheet. Number one, they're really good for data browsing, you actually get to see all the data in front of you, which isn't true if you're doing something like our Python, they're really good for sorting data, sort by this column than this column than this column, they're really good for rearranging columns and cells and moving things around they're good for finding and replacing and seeing what happens so you know that it worked right. Some more uses they're really good for formatting, especially conditional formatting, they're good for transposing data switching the rows and the columns, they make that really easy. They're good for tracking changes. Now it's true if you're a big fancy data scientist, you're probably using GitHub, but for everybody else in the world, spreadsheets and the tracking changes is a wonderful way to do it. You can make pivot tables that allows you to explore the data in a very hands on way, in a very intuitive way. And they're also really good for arranging the output for consumption. Now, when you're working with spreadsheets, however, there's one thing you need to be aware of they're really flexible, but that flexibility can be a problem. And that when you're working in data science, you specifically want to be concerned about something called tidy data. That's a term I borrowed from Hadley Wickham, very well known developer in the our world. Tidy data is for transferring data and making it work well. There's a few rules here that undo some of the flexibility inherent in spreadsheets. Number one, what you want to do is have a column be equivalent to the same thing as a variable columns variables, they are the same thing. And then rows are equal, exactly the same thing as cases. And then you have one sheet per file, and then you have one level of measurement, say individual, then organization, then state per file. Again, this is undoing some of the flexibility that's inherent in spreadsheets. But it makes it really easy to move the data from one program to another. Let me show you how all this works. You can try this in Excel. If you've downloaded the files for this course, we simply want to open up this spreadsheet. Let me go to Excel and show you how it works. So when you open up the spreadsheet, what you get is totally fictional data here that I made up, but it's showing sales over time of several products at two locations, like if you're selling stuff at a baseball field. And this is the way spreadsheets often appear, we got blank rows and columns, we got stuff arranged in a way that makes it easy for the person to process it. And we got totals here. And with formulas, putting them all together. And that's fine. That works well for the person who made it. And then that's for one month. And then we have another month right here and we have another month right here. And then we combine them all for the first quarter of 2014. We've got some headers here, we've got some conditional formatting and changes. And if we come to the bottom, we've got a very busy line graphic that eventually loads. It's not a good graphic, by the way. But similar to what you will often find. So this is the stuff that well, it may be useful for the client's own personal use, you know, you can't feed this into our Python, it'll just choke and it won't know what to do with it. And so you need to go through a process of tidying up the data. And what this involves is undoing some of this stuff. So for instance, here's data that is almost tidy. Here we have a single column for the date, a single column for the day, a column for the site that so we have two locations, A and B. And then we have six columns for the six different things that are sold and how many were sold on each day. Now, in certain situations, you would want the data laid out exactly like this. If you're doing, for instance, the time series, you'll do something vaguely similar to this. But for true tidy stuff, we're going to collapse it even further. Let me come here to the tidy data. And now what I've done is I've created a new column that says what is the item being sold. And so by the way, what this means is that we've got a really long data set. Now it's got over 1000 rows, come back up to the top here. But what that shows you is that now it's a in a format that's really easy to import from one program to another. That makes it tidy. And you can reminipulate it however you want, once you get to each of those. So let's sum up our little presentation here in a few lines. Number one, no matter who you are, no matter what you're doing in data science, you need spreadsheets. And the reason for that is that spreadsheets are often the right tool for data science. Keep one thing in mind, though, and that is as you're moving back and forth from one language to another, tidy data or well formatted data is going to be important for exporting data into your analytical program or language of choice. As we move through coding and data science and specifically the applications that can be used, there's one that stands out for me more than almost anything else. And that's Tableau and Tableau public. Now, if you're not familiar with these, these are visualization programs. The idea here is that when you have data, the most important thing you can do is to first look and see what you have and work with it from there. And in fact, I'm convinced that for many organizations, Tableau might be all that they really need, it will give them the level of insight that they need to work constructively with data. So let's take a quick look by going to Tableau.com. Now there are a few different versions of Tableau. Right here, we have Tableau desktop and Tableau server. And these are the paid versions of Tableau. They actually cost a lot of money unless you work for a nonprofit organization, in which case you can get them for free, which is a beautiful thing. What we're usually looking for, however, is not this paid version, but we're looking for something called Tableau public. And if you come in here and go to products, and we've got these three paid ones over here to Tableau public, when we click on that, it brings us to this page, it's public tableau.com. And this is the one that has what we want. It's a free version of Tableau with one major caveat, you don't save files locally to your computer, which is why I didn't give you a file to open. Instead, it saves them to the web in a public form. So if you're willing to trade privacy, you can get an immensely powerful application for data visualization. That's a catch for a lot of people, which is why people are willing to pay a lot of money for the desktop version. And again, if you work for a nonprofit, you can get the desktop version for free. But I'm going to show you how things work in Tableau public. So that's something that you can work with personally. The first thing you want to do is you want to download it. And so you put in your email address, you download, it's going to know what you're on to pretty big download. And once it's downloaded, you can install and open up the application. And here I am in Tableau public right here, this is the blank version. By the way, you also need to create an account with Tableau in order to save your stuff online and to see it. We'll show you what that looks like. But you're presented with a blank thing right here. And the first thing you need to do is you need to bring in some data. I'm going to bring in an Excel file. Now if you've downloaded the files for the course, you'll see that there's this one right here, DSO three, two, two Tableau public dot Excel sx, it's a Excel file. And in fact, it's the one that I used in talking about spreadsheets in the first video on this course. I'm going to select that one. And I'm going to open it. And a lot of programs don't like bringing in Excel because it's got all the worksheets and all the weirdness in it. This one works better with it. But what I'm going to do is I'm going to take the tidy data. By the way, you see that it put them in alphabetical order here. And I'm going to take tiny data, I'm just going to drag it over to let it know that it's the one that I want. And now what it does is it shows me a version of the data set, along with things that you can do here, you can rename it, you can I like you can create bin groups, there's a lot of things that you can do here. I'm going to do something very, very quick with this particular one. Now, I've got the data set right here, what I'm going to do now is I'm going to go to a worksheet. That's where you actually create so cancel that and go to worksheet one. Okay, this is a drag and drop interface. And so what we're going to do is we're going to pull the bits and pieces of information we want to make graphics. There's immense flexibility here. I'm going to show you two very basic ones. I'm going to look at the sales of my fictional ballpark items. So I'm going to grab sales right here. And I'm going to put that as the field that we're going to measure. Okay, and you see it put it down right here. And this is our total sales. We're going to break it down by item and by time. So let me take item right here, and you can drag it over here or I can put it right up here into rows. Those will be my rows. And that's how many we've sold total of each of the items. Fine, that's really easy. And then let's take date and we'll put that here in columns to spread it across. Now by default, this unit by year, I don't want to do that only have three months of data. And so what I can do is I can click right here, and I can choose a different timeframe. I can go to quarter, but that's not going to help because I only have one quarter's worth of data. That's three months. I'm going to come down to week. Actually, let me go to day. If I do day, you see it gets enormously complicated. So that's no good. So I'm going to back up to week. And I've got a lot of numbers there. But what I want is a graph. And so to get that, I'm going to come over here and click on this and tell it that I want to graph. And so we're seeing the information, except it lost item. So I'm going to bring an item and I'm going to put it back up into this graph to say this is a row for the data. And now I've got rows for sales by week for each of my items. That's great. I want to break it down one more by putting in the site, the place that it sold. So I'm going to grab that. And I'm going to put it right over here. And now you see I've got it broken down by the item that is sold and the different sites. And I'm going to color the sites and all I got to do to that is I'm going to grab site and drag it onto color. Now I've got two different colors for my sites. And this makes it a lot easier to tell what's going on. And in fact, there's some other cool stuff you can do. One of the things I'm going to do is I can come over here to analytics. And I can tell it for instance to put an average line through everything. So I'll just drag this over here. Say, now we have the average for each line. That's good. And I can even do forecasting. Let me get a little bit of a forecast right here. I'll drag this on. And if you go over here, I can get this out of the way for a second. Now I have a forecast for the next few weeks. And that's a really convenient, quick and easy thing. And again, for some organizations, that might be all that they really need. And so what I'm showing you here is the absolute basic operation of Tableau, which allows you to do an incredible range of visualizations and manipulate the data and create interactive dashboards. There's so much to it. We'll show that in another course. But for right now, I want to show you one last thing about Tableau public. And that is saving the files. So now when I come here and save it, it's going to ask me to sign into Tableau public. Now I sign in and ask me how I want to save this same name as the video. There we go. And I'm going to hit save. And then that opens up a web browser. And since I'm already logged into my account, see, here's my account, my profile. Here's the page that I created. And it's got everything I need there. I'm going to edit just a few details. I'm going to say, for instance, I'm going to leave its name like that I could put more of a description in there if I wanted. I can allow people to download the workbook and its data. I'm going to leave that there so you can download it if you need to. If I had more than one tab, I would do this thing that says show the different sheets as tabs, hit save. And there's my data set. And also, it's published online and people can now find it. And so what you have here is an incredible tool for creating interactive visualizations. You can create them with drop down menus and you can rearrange things and you can make an entire dashboard. It's a fabulous way of presenting information. And as I said before, I think that for some organizations, this may be as much as they need to get really good useful information out of their data. And so I strongly recommend that you take some time to explore with Tableau, either the paid desktop version, or the public version, and see what you can do to get some really compelling and insightful visualizations out of your work in data science. For many people, their first experience of coding and data science is with the application SPSS. Now, I think of SPSS. And the first thing that comes my mind is sort of life in the ivory tower, though this looks more like, you know, Harry Potter. But if you think about it, the package name SPSS comes from statistical package for the social sciences, although if you ask IBM about it now, they'll act like it doesn't stand for anything. But it has its background in social science research, which is generally academic and truthfully, I'm a social psychologist, and that's where I first learned how to use SPSS. But let's take a quick look at their webpage, IBM.com slash SPSS. If you type that in, that'll just be a alias that'll take you to IBM's main webpage. Now IBM didn't create SPSS, but they bought it around version 16. And it was very briefly known as PASW predictive analytics software that only lasted briefly. And now it's back to SPSS, which is where it's been for a long time. SPSS is a desktop program. It's pretty big. It does a lot of things. It's very powerful. It's used in a lot of academic research. It's also used in a lot of business consulting, management, and even some medical research. And the thing about SPSS is it looks like a spreadsheet, but it has dropdown menus to make your life a little bit easier compared to some of the programming languages that you can use. Now, you can get a free temporary version. If you're a student, you can get a cheap version. Otherwise, SPSS costs a lot of money. But if you have it one way or the other, when you open it up, this is what it's going to look like. I'm showing SPSS version 22. Now it's currently on 24. And the thing about SPSS versioning is in any other software package, these would be point updates. So I sort of feel like we should be on 17.3 as opposed to 22 or 24 because the variations are so small that anything you learn from the earlier ones is going to work in the later ones. And there's a lot of backwards and forwards compatibility. So I'd almost say that this one, the version you have practically doesn't matter. You get this little welcome splash screen. And if you don't want to see it anymore, you can get rid of it. I'm just going to hit cancel here. And this is our main interface looks a lot like a spreadsheet. The difference is you have a separate pane for looking at variable information. And then you have separate windows for output and then an optional one for something called syntax. But let me show you how this works by first opening up a data set. SPSS has a lot of sample data sets in them, but they're not easy to get to and they're really well hidden. On my Mac, for instance, let me go to where they are. In my Mac, I go to the finder, I have to go Mac to applications to the folder IBM to SPSS to statistics to 22 the version number to samples. Then I have to say I want the ones that are in English. And then it brings them up. The dot SAV files are the actual data files. There are different kinds in here. So dot SPS is a different kind of file. And then we have a different one about planning analysis. So there are versions of it. I'm going to open up a file here called market values dot SAV, the small data set in SPSS format. And if you don't have that, you can open up something else. It really doesn't matter for now. By the way, in case you haven't noticed SPSS tends to be really, really slow when it opens. It also, despite being a version 24, tends to be kind of buggy and crashes. And so when you work with SPSS, you want to get in the habit of saving your work constantly, and also being patient when it's time to open the program. So here's a data set that just shows addresses and house values for and square feet for some information. This I don't even know if this is real information, it looks it looks artificial to me. But SPSS lets you do point and click analysis, which is unusual for a lot of things. So I'm going to come up here and I'm going to say, for instance, make a graph, I'm going to make a, I'm actually going to use what's called a legacy dialogue to get a histogram of house prices. So I simply click values, put that right there and I'll put a normal curve on top of it and hit OK. And then it's going to open up a new window and it opened up microscopic version of it here. So I'm going to make that bigger. This is the output window. And so this is a separate window and it has a navigation pane here on the side. It tells me where the data came from and it saves the command here. And then, you know, there's my default histogram. And so we see most of the houses were right around one hundred and twenty five thousand dollars. And then they went up to at least four hundred thousand. I have a mean of two hundred and fifty six thousand the standard deviation of about eighty thousand and there's ninety four houses in the dataset. Fine, that's great. The other thing I can do is if I want to do some analyses, let me go back to the data just for a moment. For instance, I can come here to analyze and I can do descriptives. I'm actually going to do excuse me. I'm going to do one here called explore. And I'll take the purchase price and I'll put it right here and I'm going to get a whole bunch of stuff just by default. I'm going to hit OK. And it goes back to the output window. Once again, made it tiny. And so now you see beneath my chart, I now have a table and I've got a bunch of information to stem and leaf plot and I've got a box plot to great way of checking for outliers. And so this is a really convenient way to save things. You can export this information as images. You can export the entire file as an HTML. You can do it as a PDF or a PowerPoint. There's a lot of options here and you can customize everything that's on here. Now, I just want to show you one more thing that makes your life so much easier in SPSS. You see right here that it's putting down these commands. It's actually saying graph and then histogram normal equals value. And then down here, we've got this little command right here. Most people don't know how to save their work in SPSS and they kind of just have to do it over again every time. But there's a very simple way to do this. What I'm going to do is I'm going to open up something called a syntax file. I'm going to go to new syntax. And this is just a blank window that's a programming window. It's for saving code. And let me go back to my analysis I did a moment ago. I'll go back to analyze. I can still get at it right here. And descriptives and Explorer and my information is still there. And what happens here is even though I set it up with drop down menus and point of click if I do this thing paste then what it does is it takes the code that creates that command and it saves it to the syntax window. And this is just a text file. It saves it as dot SPS. But it's a text file that can be open in anything. And what's beautiful about this is it's really easy to copy and paste. And you can even take this into like Word and do find and replace on it. And it's really easy to replicate the analysis. And so for me, SPSS is a good program. But until you use syntax, you don't know the true power of it and it makes your life so much easier as a way of operating it. Anyhow, this is my extremely brief introduction to SPSS. All I want to say is that it's a very common program. Kind of looks like a spreadsheet, but it gives you a lot more power and options and you can use both drop down menus and text based syntax commands as well to automate your work and make it easier to replicate it in the future. I want to take a look at one more application for coding and data science that's called JASP. This is a new application, not very familiar to a lot of people and still in beta, but with amazing promise. You can basically think of it as a free version of SPSS. And you know what? We love free. But JASP is not just free. It's also open source and it's intuitive and it makes analyses replicable. And it even includes Bayesian approaches. And so take that all together. You know, we're pretty happy and we're jumping for joy. So before we move on, you just may be asking yourself, you know, JASP, what is that? Well, the creators emphatically deny that it stands for just another statistics program. But be that as it may. We'll just go ahead and call it JASP and use it very happily. You can get to it by going to JASP dash stats dot org. And let's take a look at that right now. JASP is a new program. They say a low fat alternative to SPSS, but it is a really wonderful, great way of doing statistics. You're going to want to download it. Best of find your platform. It even comes in Linux format, which is beautiful. And again, it's beta. So stay posted. Things are updating regularly. And if you're on Mac, you're going to need to use X Quartz. But that's an easy thing to install. It makes a lot of things work better. And it's a wonderful way to do analyses. When you open up JASP, it's going to look like this. It's a pretty blank interface, but it's really easy to get going with it. So for instance, you can come over here to file and you can even choose some example data sets. So for instance, here's one called Big Five. That's personality factors. And you've got data here. That's really easy to work with. Let me scroll this over here for a moment. So there's our five variables. And let's do some quick analyses with these. Say, for instance, we want to get descriptives. We can pick a few variables. Now, if you're familiar with SPSS, the layout feels very much the same and the output looks a lot the same. You know, all I have to do is select what I want and it immediately pops up over here. And then I can choose additional statistics. I can get quartiles. I can get the median. And you can choose plots. Let's get some plots. All I do is click on it and they show up. And that's a really beautiful thing. And you can modify these things a little bit. So for instance, I can take the plots and let's see if I can drag that down. And if I make it small enough, I can see the five plots. Well, I went a little too far on that one. Anyhow, you can do a lot of things here and I can hide this. I can collapse that and I can go on and do other analyses. Now, what you really need, though, is when you navigate away from it. So I just clicked in the blank area of the results pane or back to the data here. But if I click on one of these tables like this one right here, it immediately brings up the commands that produced it and I can just modify it some more if I want. Say I want skewness and kurtosis. Boom, they're in there. It's an amazing thing. And then I can come back out here. I can click away from that and I can come down to the plots, expand those. And if I click on that, it brings up the commands that made them. It's an amazingly easy and intuitive way to do things. Now, there's another really nice thing about JASP, and that is that you can share the information online really well through a program called osf.io. That stands for the Open Science Foundation. That's its web address osf.io. So let's take a quick look at what that's like. Here's the Open Science Framework website. And it's a wonderful service. It's free and it's designed to support open, transparent, accessible, accountable, collaborative research. And I really can't say enough nice things about it. What's neat about this is once you sign up for OSF, you can create your own area. And I've got one of my own. I'll go to that right now. So for instance, here's the data lab page in Open Science Framework. And what I've done is I created a version of this JASP analysis and I've saved it here. In fact, let's open up my JASP analysis in JASP. And then I'll show you what it looks like in OSF. So let's first go back to JASP. And when we're here, we can come over to File and click Computer. And I just saved this file to the desktop. Look on desktop. And you should have been able to download this with all the other files. DSO324JASP. I'm going to double click on that to open it. And now it's going to open up a new window. And you see I was working with the same data set, but I did a lot more analyses. I've got these graphs. I have correlation scatter plots. Come down here. I did a linear regression. And we just click on that and you can see the commands that produced it as well as the options. Didn't do anything special for that, but I did do some confidence intervals and specified that. And it's really a great way to work with all this. I'll click back in an empty area and you can see the commands go away. And so I've got my output here in JASP. When I saved it, though, I had the option of saving it to OSF. In fact, if you go to this web page, osf.io slash 3T2JG, you'll actually be able to go to a page where you can see and download the analysis that I conducted. Let's take a look. This is that page. There's the address I just barely gave you. And what you see here is the same analysis that I conducted. It's all right here. So if you're collaborating with people or if you want to show things to people, this is a wonderful way to do it. Everything's right there. Now this is a static image. But up at the top, people have the option of downloading the original file and working with it on their own. So in case you can't tell, I'm really enthusiastic about JASP and about its potential, still in beta, still growing rapidly. I see it really as an open source, free and collaborative replacement SPSS. And I think it's going to make data science work so much easier for so many people. I strongly recommend you give JASP a close look. Let's finish up our discussion of coding and data science, the applications part of it by just briefly looking at some other software choices. And I'll have to admit, it gets kind of overwhelming because there are just so many choices. Now, there's this in addition to the spreadsheets and Tableau and SPSS and JASP that we've already talked about. I mean, there's so much more than that. I'm going to give you a range of things that I'm aware of. And I'm sure I've left out some important ones or things that other people like really well. But these are some common choices and some less common but interesting ones. Number one, in terms of things that I did not mention is SAS. SAS is an extremely common analytical program, very powerful, used for a lot of things. It's actually the first program that I learned. And on the other hand, it can be kind of hard to use and it can be expensive. But there's a couple of interesting alternatives. SAS also has something called the SAS University Edition. If you're a student, this is free. And it's slightly reduced in what it does. But the fact that it's free and also it runs in a virtual machine, which makes it an enormous download. But it's a good way to learn SAS if it's something that you want to do. SAS also makes a program that I really love where it's not so extraordinarily expensive. And that is called jump. And it's a visualization software. Think a little bit of like Tableau, how we saw you work with it visually. And this one you can drag things around. It's a really wonderful program. I personally find it prohibitively expensive. Another very common choice among working analysts is Stata. And some people use Minitab. Now for mathematical people, there's Matlab. And then of course there's Mathematica itself. But that's really more a language than a program. On the other hand, Wolfram, who makes Mathematica, is also the people who give us Wolfram Alpha. Most people don't think of this as a stats application because you can run it on your iPhone. But Wolfram Alpha is in fact, incredibly capable. And especially if you pay for the pro account, you can do amazing things in this, including analyses, regression models, visualizations. And so it's worth taking a little closer look at that also, because it actually provides a lot of the data that you need. So Wolfram Alpha is an interesting one. Now, several applications that are more specifically geared towards data mining. So you don't want to do your regular, you know, little t tests and stuff on these. But there's rapid miner. And there's nine and orange. And those are all really nice to use because they are control languages where you drag nodes onto a screen and you connect them with lines and you can see how things run through. All three of them are free or have free versions and all three of them work in pretty similar manners. There's also big ML, which is for machine learning. And this is unusual because it's a browser based, it runs on their servers. There's a free version, although you can't download a whole lot, doesn't cost a lot to use big ML, and actually is a very friendly, very accessible program. Then in terms of programs, you can actually install for free on your own computer. There's one called SOFA statistics. That means statistics open for all kind of a cheesy title, but it's a good program. And then one with a web page straight out of 1990 is past three. This is paleontological software. On the other hand, does do very general stuff. It runs on many platforms and it's a really powerful thing and it's free, but it is relatively unknown. And then speaking of relatively unknown, one that's near and dear to my heart is a web application called Stat Crunch. It costs but it costs like six or 12 bucks a year. It's it's really cheap and it's very good, especially for basic statistics and for learning. I used it in some of the classes that I was teaching. And then if you're deeply wedded to Excel and you just can't stand to leave that environment, you can purchase add-ins like Excel stat, which give you a lot of statistical functions within the Excel environment itself. That's a lot of choices. And the most important thing here is don't get overwhelmed. There's a lot of choices, but you don't even have to try all of them. Really, the important question is what works best for you and the projects that you're working on. There's a few things you might want to consider in that regard. First off is functionality. Does it actually do what you want? Or does it even run on your machine? You don't need everything that a program can do. I mean, think about all the stuff that Excel can do. People probably use one 5% of what it's available. Then there's also ease of use. Some of these programs are a lot easier to use than the others. And I personally find that the ones that are easy to use, I like them. And so you might say, no, I need to program because I need to do custom stuff. But I'm willing to bet that 95% on what people do does not require anything custom. Also, the existence of a community constantly, when you're working, you come across problems, don't know how to solve it and being able to simply get online and do a search for an answer and have enough of a community that there are people there who have put answers up and discuss these things. Those are wonderful. Some of these programs have very substantial communities. Some of them it's practically non-existent. And you get to decide how important that is to you. And then finally, of course, there's the issue of cost. Many of these programs I mentioned are free. Some of them are very cheap. Some of them run on sort of a freemium model and some of them are outrageously expensive. So you don't buy them unless somebody else is paying for it. So these are some of the things that you want to keep in mind when you're trying to look at various programs. Also, let's mention this, don't forget the 80 20 rule, you're going to be able to do most of the stuff that you need to do with only a small number of tools. One or two, maybe three will probably be all that you ever need. So you don't need to explore the range of every possible tool. Find something that does what you need, find something you're comfortable with, and really try to extract as much value as you can out of that. So in some in our discussion of available applications for coding and data science. First, remember, applications are tools, they don't drive you, you use them, and that your goals are what drive the choice of your applications and the way that you do it. And the single most important thing is remember, what works for you may work well for somebody else. If you're not comfortable with it, if it's not the questions you address, then it's more important to think about what works for you and the projects that you're working on, as you make your own choices for tools for working in data science. When you're coding in data science, one of the most important things you can do is be able to work with web data. And if you work with web data, you're going to be working with HTML. Now, in case you're not familiar with it, HTML is what makes the worldwide web go round. What it stands for is hyper text markup language. And if you've never dealt with web pages before, here's a little secret. Web pages are just text. It's just a text document, but it uses tags to define the structure of the document. And a web browser knows what those tags are and it displays them in the right way. So for instance, some of the tags, they look like this, they're in angle brackets, and you have angle bracket and then a beginning tag, so body, then you have the body, the main part of your text. And then you have an angle brackets and a backslash body to let the computer know that you're done with that part. You also have P and backslash P for paragraphs. H one is for header one, and you put it in between that text. TD is for table data or the cell in a table. And you mark it off that way. If you want to see what it looks like, just go to this document DSO three, three one HTML dot text. I'm going to go to that one right now. Now, depending on what text editor you open this up, it may actually give you the web preview. I've opened it up and textmate. And so it actually is showing the text the way I typed it. I typed this manually, just typed it all in there. And I have HTML to see what a document is, I have an empty header, but that sort of needs to be there. This I say what the body is. And then I have some texts, allies for list items. I have headers. This is for a link to a web page. Then I have a small table. And if you want to see what this looks like when it's actually displayed as a web page, we'll just go up here to window and show web preview. This is the same document, but now it's in a browser. And that's how you make a web page. Now, I know this is very fundamental stuff. But the reason this is important is because if you're going to be extracting data from the web, you have to understand how that information is encoded in the web and is going to be in HTML, most of the time for a regular web page. Now I will mention something that there's another thing called CSS and web pages use CSS to define the appearance of a document. HTML is theoretically there to give the content and CSS gives the appearance. And that stands for cascading style sheets. I'm not going to worry about that right now because we're really interested in the content. And now you have the key to being able to read web pages and pull data from web pages for your data science projects. So in some, first, the web runs on HTML, that's what makes the web pages that are there. HTML defines the page structure and the content that's in the page. And you need to learn how to navigate the tags and the structure in order to get data from the web pages for your data science projects. The next step in coding and data science when you're working with web data is to understand a little bit about XML. I like to think of this as the part of web data that follows the imperative data define thyself XML stands for extensible markup language. And what it is XML is semi structured data. What that means is that tags define data so a computer knows what a particular piece of information is. But unlike HTML, the tags are free to be defined any way you want. And so you have this enormous flexibility in there, but you're still able to specify so the computer can read it. Now there's a couple of places where you're going to see XML files. Number one is in web data. HTML defines the structure of a web page, but if they're feeding data into it, then that will often come in the form of an XML file. Interestingly, Microsoft Office files, if you have doc x or xls x, x part at the end stands for a version of XML that's used to create these documents. If you use iTunes, the library information that has all of your artists and your genres and your ratings and stuff, that's all stored in an XML file. And then finally data files that often go with particular programs can be saved as XML as a way of representing the structure of the data to the program. And for XML tags use opening and closing angle brackets just like HTML did. Again, the major difference is that you're free to define the tags however you want. So for instance, thinking about iTunes, you can define a tag that's genre, and you have the angle brackets and genre to begin that information. And then you have the angle brackets with the backslash to let it know you're done with that piece of information. Or you can do it for composer, or you can do it for rating, or you can do it for comments. And you can create any tags you want and you put the information in between those two things. Now let's take an example of how this works. I'm going to show you a quick data set that comes from the web. It's at airgast.com and API. This is a website that stores information about automobile Formula One racing. Let's go to this web page and take a quick look at what it's like. So here we are at airgast.com and it's the API for Formula One. And what I'm bringing up is the results of the 1957 season in Formula One racing. And here you can see who the competitors were in each race and how they finished and so on. So this is a data set that's being displayed in a web page. If you want to see what it looks like in XML, all you have to do is type XML onto the end of this dot XML. I've done that already. So I'm just going to go to that one. You see is only this little bit that I've added dot XML. Now it looks exactly the same because the web page is structuring XML data by default. But if you want to see what it looks like in its raw format, just do an option click on the web page and go to view page source. At least that's how it works in Chrome. And this is the structured XML page. And you can see we've got tags here. It says race name, circuit name, location. And obviously these are not standard HTML tags that are defined for the purposes of this particular data set. But we begin with one, we have circuit name right there. And then we close it using the backslash right there. And so this is structured data, the computer knows how to read it, which is exactly this is how it displays it by default. And so it's a really good way of displaying data. And it's a good way to know how to pull data from the web. You can actually use what's called an API, an application programming interface to access this XML data. And it pulls it in along with its structure, which makes working with it really easy. What's even more interesting is how easy it is to take XML data and convert it between different formats, because it's structured and the computer knows what you're dealing with. So for example, one, it's really easy to convert XML to CSV or comma separated value files, that's a spreadsheet format, because it knows exactly what the headings are, and what piece of information goes in each column. Example two, it's really easy to convert HTML documents to XML, because you can think of HTML with its restricted set of tags, that's sort of a subset of the much freer XML. And three, you can convert CSV or your spreadsheet comma separated value to XML vice versa, you can bounce them all back and forth because the structure is made clear to the programs that you're working with. So in some, here's what we can say. Number one, XML is semi structured data. What that means is it has tags to tell the computer what the piece of information is, but you can make the tags whatever you want them to be. And XML is very common for web data. And it's really easy to translate the formats XML, HTML, CSV, so on and so forth. It's really easy to translate them back and forth, which gives you a lot of flexibility in manipulating data. So you can get into the format you need for your own analysis. The last thing I want to mention about coding and data science and web data is something called JSON. And I like to think of it as a version of smaller is better. Now what JSON stands for is JavaScript object notation, although JavaScript is supposed to be one word. And what it is is that like XML, JSON is semi structured data. That is, you have tags that define the data. So the computer knows what each piece of information is. But like XML, the tags can vary freely. And so there's a lot in common between XML and JSON. So XML is a markup language, that's what the ML stands for. And that gives meaning to the text that lets the computer know what each piece of information is. Also, XML allows you to make comments in the document. And it allows you to put metadata in the tags, you can actually put some information there in the angle brackets to provide additional context. JSON, on the other hand, is specifically designed for data interchange. And so it's got that special focus. And the structure JSON corresponds with data structures, you know, it directly represents objects and arrays and numbers and strings and booleans. And that works really well with the programs that used to analyze data. Also, JSON is typically shorter than XML because it does not require the closing tags. Now there are ways to do that with XML, but that's not typically how it's done. As a result of these differences, JSON is basically taking XML's place in web data. XML still exists, still used for a lot of things. But JSON is slowly replacing it. And we'll take a look at the comparison between the three by going back to the example we used in XML. This is data about Formula One car races in 1957 from airgas.com. And so you can just go to the first web page here and then we'll navigate to the others from that. So this is the general page. This is if you just type in without the dot XML or dot JSON or anything. So it's a table of information about races in 1957. And we saw earlier that if you add just dot XML to the end of this, it looks exactly the same. That's because this browser is displaying XML properly by default. But if you were to right click on it and go to view page source, you would get this instead. And you can see the structure. This is still XML. And so everything has an opening tag and a closing tag and some extra information in there. But if you type in dot JSON, what you really get is this jumbled mess. Now, that's unfortunate because there is a lot of structure to this. So what I'm going to do is I'm actually going to copy all of this data. Then I'm going to go to a little web page. There's a lot of things you can do here. And seek you phrases called JSON pretty print. And that is make it look structured. So it's easier to read. I just paste that in there and hit pretty print JSON. And now you can see the hierarchical structure of the data. The interesting thing is that the JSON tags only have tags at the beginning. So it says series in quotes and then a colon. And then it gives the piece of information in quotes and a comma and it moves on to the next one. And this is a lot more similar to the way that data would be represented in something like our or Python. And so it's also more compact. Again, there's things you can do with XML. But this is one of the reasons that JSON is generally becoming preferred as a data carrier for websites. And as you might have guessed, it's really easy to convert between the formats. It's easy to convert between XML, JSON, CSV, etc. And so you can get a webpage where you just paste one version in and you get the other version out. There are some differences, but for the vast majority of situations, they're just kind of interchangeable. So in some, what do we get from this? Like XML, JSON is semi structured data, where there are tags that say what the information is, but you can define the tags however you want. And JSON is specifically designed for data interchange. And because it reflects the structure of the data in the programs, that makes it really easy. And then also because it's relatively compact, JSON is replacing gradually XML on the web as a container for data on web pages. If we're going to talk about coding and data science and the languages that are used, the first and foremost is are the reason for that is according to many standards, R is the language of data and data science. For example, take a look at this chart. This is a ranking based on a survey of data mining experts of the software that they use in doing their work. And R is right there at the top, R is first. And in fact, that's important because there's Python, which is usually taken hand in hand with R for data science. But R sees 50% more use than Python does, at least in this particular list. Now there's a few reasons for that popularity. Number one, R is free and its open source, both of which make things very easy. Second, R is specially developed for vector operations. That means it's able to go through an entire list of data without having to write for loops to go through. If you've ever had to write for loops and you know that that would be kind of disastrous having to do that with data analysis. Next, R has a fabulous community behind it. It's very easy to get help on things with our you Google it, you're going to end up in a place where you're going to be able to find good examples of what you need. And probably most importantly, R is very capable on its own, but there are 7000 packages actually as many more than that 7000 packages that add capabilities to R. Essentially, it can do anything. Now, when you're working with R, you actually have a choice of interfaces. That is, how do you actually do the coding and how do you get your results? R comes with its own IDE or interactive development environment. You can do that. Or if you're on a Mac or Linux, you can actually do R through the terminal through a command line. If you've installed R, you just type R and it starts up. There's also a very popular development environment called R studio. And that's actually the one that I use and I'll be using for all my examples. But another new competitor is Jupiter which very commonly used for Python is what I use for examples there. It works in a browser window even though it's locally installed. And our studio in Jupiter, there's pluses and minuses to each one of them. I'll mention them as we get to them. But no matter what interface you use, R's command line, you are typing lines of code in order to get the commands. Some people get really scared about that. But really, there's some advantages to that in terms of the replicability and really the accessibility, the transparency of your commands. So for instance, here's a short example of so commands in R. You can enter them into what's called the console. And that's just like one line at a time that's called an interactive way. Or you can save scripts and run bits and pieces of them selectively. That makes your life a lot easier. No matter how you do it, if you're familiar with programming other languages, then you can find that R is a little weird. It has an idiosyncratic model. It makes sense once you get used to it. But it is a different approach. And so it does take some adaptation if you're accustomed to programming in other languages. Now, once you do your programming to get your output, what you're going to get is graphs in a separate window, you're going to get text and numbers numerical output in the console. And no matter what you get, you can save the output to files. So that makes it portable, you can do it in other environments. But most importantly, I like to think of this, here's our box of chocolates where you never know what you're going to get. The beauty of R is in the packages that are available to expand its capabilities. Now there are two sources of packages for R. One goes by the name of CRAN, and that stands for the comprehensive R archive network. And that's at CRAN.RStudio.com. And what that does is it takes the 7000 or so different packages that are available, and it organizes them in topics that they call task views. And for each one, if they've done their homework, you have data sets that come along with that package, you have a manual in PDF format. And you can even have vignettes where they run through examples of how to do it. Another interface is called CRANTASIC and the exclamation point is part of the title. That's at CRANTASIC.org. And what this is is an alternative interface that links to CRAN. So if you find something like in CRANTASIC and you click on the link, it's going to open in CRAN. But the nice thing about CRANTASIC is it shows the popularity of packages, and it also shows how recently they were updated. And that can be a nice way of knowing that you're getting sort of the latest and greatest. Now from this very abstract presentation, we can say a few things about R. Number one, according to many, R is the language of data science. And it's a command line interface, you're typing lines of code. So that gives it both a strength and is a challenge for some people. But the beautiful thing is the thousands and thousands of packages of additional code and capability that are available for her that make it possible to do nearly anything in this statistical programming language. When talking about coding and data science in the languages, along with R, we need to talk about Python. Now, Python, the snakes, is a general purpose program that can do it all. And that's its beauty. If we go back to this survey of the software used by data mining experts, you see that Python's there. It's number three on the list. And what's significant about that is on this list, Python is the only general purpose programming language, it's the only one that can theoretically be used to develop any kind of application that you want. That gives it some special power compared to all these others, most of which are very specific to data science work. So the nice things about Python are number one, it's general purpose. It's also really easy to use. And if you have a Macintosh or a Linux computer, Python is built into it. Also, Python has a fabulous community around it with hundreds of thousands of people involved. And also Python has thousands of packages. Now, it actually has something like 70 or 80,000 packages. But in terms of ones that are specific to data, there are still thousands available. They give it some incredible capabilities. Now, a couple of things to know about Python. First is about versions. There are two versions of Python that are in wide circulation. There's 2.x. So that means like 2.5, 2.6. And there's 3.x 3.1 3.2. Version two and version three are similar. But they're not identical. And in fact, the problem is this, there's some compatibility issues where code that runs in one does not run in the other. And consequently, most people have to choose between one or the other. And what this leads to is that many people still use 2.x. I have to admit the examples that I use, I'm using 2.x, because so many of the data science packages are developed with that in mind. Now, let me say a few things about interfaces for Python. First, Python does come with its own interactive development and learning environment. They call it idle. You can also run it from the terminal or command line interface or any IDE that you have. Now, a very common and very good choice is Jupiter. Jupiter is a browser based framework for programming. And it was originally called I Python, and that served as its initial version. So a lot of times when people talk about I Python, what they're really talking about is Python in Jupiter. And the two are sometimes used interchangeably. One of the neat things you can do is there are two companies, there's continuum and thought, both of which have made special distributions of Python with hundreds and hundreds of packages preconfigured to make it very easy to work with data. I personally prefer continuum anaconda is the one that I use a lot of other people use it, but either one's going to work and it's going to get you up and running. And like I said with our no matter what interface you use, all of them are command line, you're typing line to code. Again, there are some tremendous strengths to that, but it can be intimidating to some people at first. In terms of the actual commands of Python, you have some examples here on the side. The important thing to remember is that it's a text interface. On the other hand, Python is familiar to millions of coders because it's very often a first programming language that people learn to do general purpose programming. And there are a lot of very simple adaptations for data that make it very powerful for data science work. So let me say something else again, data science loves Jupiter. And Jupiter is the browser based framework. It's a local installation, but you access it through a web browser that makes it possible to really do some excellent work in data science. There's a few reasons for this. When you're working in Jupiter, you get text output and you can use what's called Markdown as a way of formatting documents. You can get inline graphics where the graphics just show up directly beneath the code that you did it. Also, it's really easy to organize and present and to share analyses that are done in Jupiter, which makes it a strong contender for your choices in how you do data science programming. Another one of the beautiful things about Python like are is that there are thousands of packages available. In Python, there's one main repository. It goes by the name Pi Pi, which is for the Python package index. Right here, it says there's over 80,000 packages and seven or 8,000 of those are for data specific purposes. Now some of the packages that you'll get to be very familiar with are NumPy and SciPy, which are for scientific computing in general. Matplotlib, and a development of it called Seaborn are for data visualization and graphics. Pandas is the main package for doing statistical analysis. And for machine learning, almost nothing beats scikit learn. And when I go through hands on examples in Python, I will be using all of these as a way of demonstrating the power of the program for working with data. So in sum, we can say a few things. Number one, Python is a very popular program familiar to millions of people and that makes it a good choice. Second, of all the languages we use for data science on a frequent basis, this is the only one that's general purpose, which means it can be used for a lot of things other than processing data. And it gets its power like R does from having thousands of contributed packages, which greatly expand its capabilities, especially in terms of doing data science work. A choice for coding in data science, one of the languages that may not come immediately to mind when they think data science is SQL or SQL. And SQL is the language of databases. And we think, why do we want to work in SQL? Well, to paraphrase the famous bank robber Willie Sutton, who apparently explained why he robbed banks and said, because that's where the money is. The reason we would work with SQL in data science is because that's where the data is. And so let's take another look at our ranking of software among data mining professionals. And their SQL is third on the list. And also, of this list, it's the first database tool. Other tools in there, for instance, get much fancier and they're much newer and shinier, but SQL's been around for a while is very, very capable. Now there's a few things to know about SQL. By the way, you'll notice I'm saying SQL, even though that stands for something that stands for structured query language. SQL is a language, it's not an application, there's not a program SQL. It's a language that can be used in different applications. Primarily SQL is designed for what are called relational databases. And those are special ways of storing structured data that you can pull in, you can put things together, you can join them in special ways, you can get some summary statistics. And then what you usually do is you then export that data into your analytical application of choice. So the big word here is our DBMS. And that stands for relational database management system. And that's where you will usually see SQL as a query language being used. In terms of relational database management systems, there are a few very common choices. In the industrial world where people have some money to spend, there's Oracle database is a very common one. And Microsoft SQL or SQL server. In the open source world, two very common choices are My SQL, even though we generally say SQL when it's here, you generally say My SQL. And then another one is PostgreSQL. These are both open source free versions of the language. They're sort of dialects of each that make it possible for you to be working with databases and get your data out. The neat thing about them, no matter what you do, is that databases minimize data redundancy by using connected tables. Each table has rows and columns, and they store different levels or different of abstraction or measurement, which mean you only have to put the information in one place and then it can refer to lots of other tables. Makes it very easy to keep things organized and up to date. When you're looking at a way of working with a relational database management system, you get to choose in part between using a graphical user interface or GUI. And some of those include SQL developer and SQL server management studio to very common choices. And there are a lot of other ones like Toad and some other choices that are graphical interfaces for working with these databases. And there are also text based interfaces. So really, any command line interface, and any interactive development environment or programming tool is going to be able to do that. Now you can think of yourself being on the command deck of your ship and learn a few basic commands that are very important for working with SQL. They're just a handful of commands that can get you most of where you need to go. There is the select command where you're choosing the cases that you want to include from says which tables are you going to be extracting them from. Where is a way of specifying conditions and then order by obviously it's just a way of putting it all together. This works because usually when you're in a SQL database, you're just pulling out the information. You want to select it, you want to organize it. And then what you're going to do is you're going to send the data to your program of choice for further analysis like our Python or whatever. So in some, here's what we can say about SQL. Number one, as a language, it's generally associated with relational databases, which are very efficient and well structured ways of storing data. Just a handful of basic commands can be extremely useful when working with databases. You don't have to be a super ninja expert. Really a handful five, 10 commands will probably get you everything you need out of a SQL database. And then once you get the data organized, it's typically exported to some other program for analysis. When you talk about coding in any field, one of the languages or one of the groups of languages that come up most often are C, C plus plus and Java. Now these are extremely powerful applications and very frequently used for sort of professional production level coding. In data science, the place where you're going to see these languages most often is in the bedrock, the absolute fundamental layer that makes the rest of data science possible. So for instance, C and C plus plus sees from the 60s C plus plus is from the 80s. And they have extraordinarily wide usage. And their major advantage is that they're really, really fast. In fact, C is usually used as the benchmark for how fast is a language. They're also very, very stable, which makes them really well suited to production level code. And for instance, server use. What's really neat is that in certain situations, if time is really important of speeds important, then you can actually use C code in our or other statistical languages. Next is Java. Java is based on C plus plus. Its major contribution was the war or right once run anywhere. The idea that you are going to be able to develop code that is portable to different machined and different environments. And because of that, Java is actually the most popular computer programming language, overall against all tech situations. And the place where you would use these in data sciences, like I said, when time is of the essence, when something has to be fast, it has to get the job accomplished quickly, and it has to not break. Then these are the ones that you're probably going to use. The people who are going to use it are primarily going to be engineers. So the engineers and the developers, the software developers who deal with the inner workings of the algorithms in data science, or the back end of data science, the servers and the mainframes and the entire structure that makes analysis possible. In terms of analysts, people who are actually analyzing the data typically don't do hands on work with the foundational elements, they don't usually touch C or C plus plus, more the work is on the front end or closer to the high level languages like R or Python. In some C, C plus plus and Java form a foundational bedrock and the back end of data and data science. And they do this because they're very fast, and they are very reliable. On the other hand, given their nature, that work is typically reserved for the engineers who are working with the equipment that runs in the back that makes the rest of the analysis possible. I want to finish our extremely brief discussion of coding and data sciences and the languages that can be used by mentioning one other that's called bash. And bash really is a great example of old tools that have survived and are still being used actively and productively with new data. You can think of it this way. It's almost like typing on your typewriter. You're working at the command line, you're typing out code through a command line interface or CLI. Now, this method of interacting with computers practically goes back to the typewriter phase because it predates monitors. So before you even had a monitor, you would type in the code and it would print it out on a piece of paper. And the important thing to know about the command line is it's simply a method of interacting. It's not a language because lots of different languages can run at the command line. So for instance, it's important to talk about the concept of a shell. Now, computer science, a shell is a language or something that wraps around the computer to shell around the language that is the interaction level for the user to get things done at the lower levels that aren't really human friendly. On Mac computers and Linux, the most common is bash, which is short for born again shell. On Windows computers, the most common version is PowerShell. But whatever you do, there actually are a lot of choices. There's the born shell, there's the seashell, which is why I have a seashell right here. The seashell, there's fish for a friendly interactive shell and a whole bunch of other choices. But bash is the most common on Mac and Linux and PowerShell is the most common on Windows as a method of interacting with the computer at the command line level. Now there's a few things you need to know about this. First, you have a prompt of some kind and bash it's a dollar sign. And that just means type your command here. Then the other thing is you type one line at a time. It's actually amazing how much you can get done with what's called a one liner program by sort of piping things together. So one feeds into the other, you can run more complex commands if you use a script. And so you call a text document that has a bunch of things in it. And you can get much more elaborate analysis done. Now we have our tools here. In bash, we talk about utilities. And what these are are specific programs that accomplish specific tools. Bash really thrives on do one thing and do it very well. There are two general categories of utilities for bash. Number one is the built ins. These are the ones that come installed with it. And so you're able to use them at any time by simply calling in their name. Some of the most common ones are cat, which is for cat and eight. And that's to put information together. There's awk, which is its own interpreted programming language. But it's often used for text processing from the command line. By the way, the name awk comes from the initials of the people who created it. Then there's grep, which is for global search with a regular expression and print. It's a way of searching for information. And then there's said, which stands for stream editor. And its main use is to transform text, you can do an enormous amount with just these four utilities. A few more are head and tail that display the first or last 10 lines of the document sort and unique, which sort and count the number of unique answers in a document, WC, which is for word count, and print f, which formats the output that you get in your console. And while you can get a huge amount of work done with just this small number of built in utilities, there are also a wide range of installables or other command line utilities that you can add to bash or to whatever program you're using. So for instance, some really good ones that have been recently developed are jq, which is for pulling in JSON or JavaScript object notation data from the web. And then there's JSON to CSV, which is a way of converting JSON to CSV format, which is what a lot of statistical programs are going to be happier with. There's real, which allows you to run a wide range of commands from the statistical programming language are in the command line as part of bash. And then there's Big MLR. And this is a command line tool that allows you to access Big ML's machine learning servers through the command line. Normally, you do it through a web browser and it accesses their servers remote. It's an amazingly useful program. But to be able to just pull it up when you're in the command line is an enormous benefit. What's interesting is that even though you have all these opportunities, all these different utilities, you can do amazing things and that there still is active development of utilities for the command line. So let's say this in some, despite being in one sense as old as the dinosaurs, the command line survives because it is extremely well evolved and well suited to its purpose of working with data, the utilities, both the built in and the installables are fast and they are easy. And generally they do one thing and they do it very, very well. And then surprisingly, there is an enormous amount of very active development of command line utilities for these purposes, especially with data science. One critical task when you're coding in data science is to be able to find the things that you're looking for. And reg X, which is short for regular expressions is a wonderful way to do that. You can think of it as the supercharged method for finding needles in haystacks. Now, reg X tends to look a little cryptic. So for instance, here's an example of something that's designed to determine whether something is a valid email address. And it specifies what can go on the beginning, you have the at sign in the middle, then you've got a certain number of numbers and letters, and then you have to have a dot something at the end. And so this is a special kind of code for indicating what can go where. Now, regular expressions or reg X are really a form of pattern matching in text. And it's a way of specifying exactly what needs to be where what can vary and how much it can vary. And you can write both specific patterns, say I only want a one letter variation here, or very general like the email validator that I showed you. And the idea here is you can write this search pattern, your little wildcard thing, you can find the data. And then once you identify those cases, then you can export them into another program for analysis. So here's a short example of how it can work. What I've done is I've taken some text documents, they're actually the text to Emma and to Pygmalion two books I got off of Project Gutenberg. And this is the command grep caret l dot ve space asterisk dot txt. So when I'm looking for our lines in either of these books that start with L, then they have one character can be whatever. And then that's followed by ve. And then the dot txt means search for all of the text files in that particular folder. And what it found is lines that began with love and lived and lovely, and so on. Now in terms of the actual nuts and bolts of regular expressions, there are some certain elements. There are literals. And those are things that mean exactly what they are. You type the letter L, you're looking for the letter L. There are also meta characters, which specify, for instance, things needs to go here, they're characters, but they actually really code that to give representations. There are also escape sequences, which are something that you use to say, well, normally this character is used as a variable, but I actually want to really look for a period as opposed to a placeholder. Then you have the entire search expression that you create. And then you have the target string, the thing that it's searching through. So let me give a few very short examples. This is the carrot. It's the sometimes called a hat or in French a circle flex. And what that means is you're looking for something that's at the beginning of the text that you're searching. So for example, you can have carrot and capital M. That means you need something that begins with a capital M. So, for instance, the word Mac true, it will find that. But if you have iMac, there's a capital M, but it's not the first thing. So that'll be false. It won't find that the dollar sign means you're looking for something that is at the end of the string. So for example, ing and then dollar sign, that'll find the word fling because it ends with ing. But it won't find the word flings because it actually ends with an s. And then the dot the period simply means we're looking for one letter and it can be anything. So for example, you can write a t period. And that will find data because it has an a t and then one letter after it. But it won't find flat because flat doesn't have anything after the a t. And so these are extremely simple examples of how it can work. Obviously, it gets more complicated. And the real power is when you start combining these bits and elements. Now, one interesting thing about this is you can actually treat this as a game. I love this website. It's called regx golf. And it's at regx.alf.nu. And what it does is it brings up lists of words, two columns, and your job is to write a regular expression in the top that matches all the words on the left column. And none of the words on the right. And that uses the fewest characters possible, you get a score. And it's a great way of learning how to do regular expressions and learning how to search in a way that's going to get you the data that you need for your projects. So in some regx or regular expressions help you find the right data for your project. They're very powerful and they're very flexible. Now on the other hand, they are cryptic, at least when you first look at them. But at the same time, it's like a puzzle and it can be a lot of fun if you practice it. And you see how you can find what you need. I want to thank you for joining me in coding and data science. And we'll wrap up this course by talking about some of the specific next steps that you can take for working in data science. The idea here is that you want to get some tools and you want to start working with those tools. Now, please keep in mind something that I've said at another time. Data tools and data science are related, they're important, but don't make the mistake of thinking that if you know the tools that you have done the same thing has actually conducted data science. That's not true. People sometimes get a little enthusiastic and they get a little carried away. What you need to remember is the relationship really is this data tools are an important part of data science, but data science itself is much bigger than just the tools. Now, speaking of tools, remember, there's a few kinds that you can use and that you might want to get some experience with these. Number one, in terms of apps or just specific built applications, Excel and Tableau are really fundamental for both getting the data from clients or doing some basic data browsing. And Tableau is really wonderful for interactive data visualization. I strongly recommend that you get very comfortable with both of those. In terms of code, it's a good idea to learn either R or Python or ideally to learn both because you can use them hand in hand. In terms of utilities, it's a great idea to learn how to work with bash the command line utility and to use regular expressions or reg X, you can actually use those in lots and lots of programs, regular expressions. And so they can have a very wide application. And then finally, data science requires some kind of domain expertise, you're going to need some sort of field experience, or intimate understanding of a particular domain and the challenges that come up and what constitutes workable answers and the kind of data that's available. Now, as you go through all of this, you don't need to build this monstrous list of things. Remember, you don't need everything, you don't need every tool, you don't need every function, you don't need every approach. Instead, remember, get what's best for your needs, and for your style. But no matter what you do, remember, tools or tools, there are means to an end. Instead, you want to focus on the goal of your data science project, whatever it is. And I can tell you really, the goal is meaning extracting meaning out of your data to make informed choices. In fact, I'll say a little more, the goal is always meaning. And so with that, I strongly encourage you get some tools, get started in data science and start finding meaning in the data that's around you. Welcome to mathematics and data science. I'm Barton Polson. And we're going to talk about how mathematics matters for data science. Now, you may be saying to yourself, why math? And computers can do it. I don't need to do it. And really fundamentally, I don't need math. I'm just here to do my work. Well, I'm here to tell you, no, you need math. That is, if you want to be a data scientist, and I assume that you do. So we're going to talk about some of the basic elements of mathematics really at a conceptual level and how they apply to data science. There are a few ways that math really matters to data science. Number one, it allows you to know which procedures to use and why so you can answer your questions in a way that's the most informative and most useful. Two, if you have a good understanding of math, then you know what to do when things don't work right, that you get impossible values or things won't compute. And that makes a huge difference. And then three, an interesting thing is that some mathematical procedures are easier and quicker to do by hand than by actually firing up the computer. And so for all three of these reasons, it's helpful to have at least a grounding in mathematics, if you're going to do work in data science. Now, probably the most important thing to start with is algebra. And there are three kinds of algebra that we want to mention. The first is elementary algebra, that's the regular x plus y. Then there's linear or matrix algebra, which looks more complex, but is conceptually simple and is used by computers to actually do the calculations. And then finally, I'm going to mention systems of linear equations where you have multiple equations simultaneously that you're trying to solve. Now there's more math than just algebra, a few other things that I'm going to cover in this course, a little bit of calculus, a little bit of big O or order, which has to do with the speed or the complexity of operations, a little bit of probability theory, and a little bit of Bayes or Bayes theorem, which is used for getting posterior probabilities and changes the way that you interpret the results of an analysis. And for the purposes of this course, I'm going to demonstrate the procedures by hand. Of course, you would use software to do this in the real world, but we're dealing with simple problems at conceptual levels. And really, the most important thing to remember is, even though a lot of people get put off by math, really, you can do it. And so in some, let's say these three things about math. First off, you do need some math to do good data science, it helps you diagnose problems, it helps you choose the right procedures. And interestingly, you can do a lot of it by hand, or you can use software computers to do the calculations as well. As we begin our discussion of the role of mathematics and data science, we'll of course begin with the foundational elements. And in data science, nothing is more foundational than elementary algebra. Now, I'd like to begin this with really just a little bit of history. In case you're not aware, the first book on algebra was written in 820 by Muhammad Ibn Musa al-Qarizmi. And it was called the Compendious Book on Calculation by Completion and Balancing. Actually, it was called this, which if you transliterate that comes out to this, but look at this word right here. That's the algebra, which means restoration. Any case, that's where it comes from. And for our concerns, there are several kinds of algebra that we're going to talk about. There's elementary algebra, there's linear algebra. And there are systems of linear equations, we'll talk about each of those in different videos. But to put it into context, let's take an example here of salaries. Now, this is actually based on real data from a survey of the salary of people employed in data science. And to give a simple version of it, the salary was equal to a constant. That's sort of an average value that everybody started with. And to that you added years and then you added some measure of bargaining skills and how many hours they worked per week. And that gave you a prediction. But because it wasn't exact, there's also some error to throw into it to get to the precise value that each person has. Now, if you want to abbreviate this, you can write it kind of like this s plus c plus y plus b plus h plus e, although it's more common to write it. And although it's more common to write it symbolically like this. And let's go through this equation very quickly. The first thing we have is outcome, we call that y, the variable y for person I, I stands for each case in our observations. So here's outcome y for person I. This letter right here is a Greek beta. And it represents the intercept or the average. That's way it has a zero because we don't multiply it times anything. But right next to it, we have the coefficient for variable one. So beta, which means a coefficient sub one for the first variable. And then we have variable one and then x one means variable one and then the I means it's the score on that variable for person I whoever we're talking about. Then we do the same thing for variables two and three. And then at the end, we have little epsilon here with an I for the error term for person I was says how far off the prediction was from their actual score. Now I'm going to run through some of these procedures and we'll see how they can be applied to data science. But for right now, let's just say this in some first off, algebra is vital to data science. It allows you to combine multiple scores, get a single outcome, do a lot of other manipulations. And really, the calculations are easy for one case at a time, especially when you're doing it by hand. The next step in mathematics for data science foundations is to look at linear algebra or an extension of elementary algebra. And depending on your background, you may know this by another name. And I like to think welcome to the matrix because it's also known as matrix algebra because we're dealing with matrices. Now let's go back to an example I gave in the last video about salary, where salary is equal to a constant plus years plus bargaining plus hours plus error. Okay, that's a way to write it out in words. And if you want to put it in symbolic form, it's going to look like this. Now, before we get started with matrix algebra, we need to talk about a few new words. Maybe you're familiar with them already. The first is scalar. And this means a single number. And then a vector is a single row or a single column of numbers that can be treated as like a collection. That usually means a variable. And then finally, a matrix consists of many rows and columns sort of a big rectangle of numbers. The plural that by the way is matrices. And the thing to remember is that machines love matrices. Now let's take a look at a very simple example of this. Here is a very basic representation of matrix algebra or linear algebra, where we're showing data on two people on four variables. So over here on the left, we have the outcomes for cases one and two are people one and two. And you put them in the square brackets to indicate that it's a vector or a matrix. Here on the far left, it's a vector because it's a single column of values. Next to that is a matrix that has here on the top the scores for case one, which I've written as x is x one is for variable one x two is for variable two. And the second subscript is indicated as for a person one. Below that are the scores for case two, the second person. And then over here and another vertical column are the regression coefficients. That's a beta there that we're using. And then finally, we've got a tiny little vector here at the end, which contains the error terms for cases one and two. Now, even though you would not do this by hand, it's kind of helpful to run through the procedure. So I'm going to show it to you by hand. And we're going to take two fictional people. This will be fictional person number one, we'll call her Sophie. We'll say that she's 28 years old. And we'll say that she has good bargaining skills of four on a scale of five, and that she works 50 hours a week, and that her salary is 118,000. Our second fictional person, we'll call him Lars, and we'll say that he's 34 years old, and then he has moderate bargaining skills, three out of five, works 35 hours per week and has a salary of $84,000. And so if we're trying to look at salaries, we can go back to our matrix representation that we had here with our variables indicated with their Latin and sometimes Greek symbols. And we're going to replace those variables with actual numbers. So we can get the salary for Sophie our first person. So let me plug in the numbers here. And let's start with the result here. Sophie salaries 118,000. And here's how these numbers all add up to get that. The first thing here is the intercept and we just multiply that times one. So that's sort of the starting point. And then we get this number 10, which actually has to do with years over 18. She's 28. So that's 10 years over 18. We multiply each year by 1395. Next is bargaining skills. She's got a four out of five. And for each step up, you get $5,900. By the way, these are real coefficients from study of survey of salary of data scientists. And then finally, hours per week. For each hour, you get $382. Now we can add those up and we can get a predicted value for her, but it's a little low. It's 30,000 low, which you may say, well, that's really messed up. Well, that's because there's like 40 variables in the equation including she might be the owner. And if she's the owner, yeah, she's going to make a lot more. And then we do a similar thing for the second case. But what's neat about matrix algebra or linear algebra is that you can use matrix notation. And this means the same stuff. And what we have here are these bolded variables that stand in for entire vectors or matrices. So for instance, this y a bold y stands for the vector of outcome scores. This bolded x is the entire matrix of values that each person has on each variable. This bolded beta is all of the regression coefficients. And then this bolded epsilon is the entire vector of error terms. And so it's really super compact way of representing the entire collection of data and coefficients that you use in predicting values. So in some, let's say this, first off, computers use matrices, they like to do linear algebra to solve problems. And it's conceptually simpler because you can put it all there in this tight formation. In fact, it's a very compact notation and allows you to manipulate entire collections of numbers pretty easily. And that's the major benefit of learning a little bit about linear or matrix algebra. Our next step in mathematics for data science foundations is systems of linear equations. And maybe you're familiar with this, but maybe you're not. And the idea here is there are times when you actually have many unknowns, and you're trying to solve for all of them simultaneously. And what makes this really tricky is a lot of these are interlocked. Specifically, that means x depends on y, but at the same time, y depends on x. What's funny about this is it's actually pretty easy to solve these by hand. And you can also use linear matrix algebra to do it. So let's take a little example here of sales. Let's imagine that you've got a company and you've sold 1000 iPhone cases of the not running around naked like in this picture, and that some of the cases sold for $20 and other sold for $5. You made a total of $5,900. And so the question is how many were sold at each price? Now, hopefully you were keeping your records, but you can also calculate it from this little bit of information. And to show you, I'm going to do it by hand. Now, we're going to start with this. We know that sales, the two price points x and y add up to 1000 total cases sold. And for revenue, we know that if you multiply a certain number times $20 and another number times $5, that it all adds up to $5,900. Between the two of those, we can figure out the rest. Let's start with sales. Now, what I'm going to do is I'm going to try to isolate the values. And I'm going to do that by putting in this minus y on both sides. And then I can take that and I can subtract it. So I'm left with x is equal to 1000 minus y. Normally you solve for y, but I solve for x, you'll see why in just a second. Then we go to revenue. And we know from earlier that our sales of these two price points add up to $5,900 total. Well, what we're going to do is we're going to take this x that's right here and we're going to replace it with the equation that we just got, which is 1000 minus y. Then we multiply that through and we get 20,000 minus 20 y plus 5 y equals 5900. Well, we can subtract these two because they're on the same thing. So 20 y and we get 15 y. And then we subtract 20,000 from both sides. So there it is right there on the left. And that disappears. And then I get it over on the right side. And I do the math there and I get minus $14,100. Well, then I divide both sides by negative $15. And when we do that, we get y is equal to 940. Okay, so that's one of our values for sales. So let's go back to sales. We have x plus y equals 1000. We take the value that we just got 940, we stick that into the equation. And then we can solve for x, just subtract 940 from each side. There we go. We get x is equal to 60. So let's put it all together, just to recap what happened. What this tells us is that 60 cases were sold at $20 each. And that 940 cases were sold at $5 each. Now what's interesting about this is you can also do this graphically, we're going to draw it. So I'm going to graph the two equations. Here are the original ones we had this one predicts sales this one gives price. The problem is these really aren't in the canonical form for creating grass that needs to be y equals something else. So we're going to solve both of these for y. We subtract x from both sides there it is on the left we subtract that and then we have y is equal to minus x plus 1000. That's something that we can graph. Then we do the same thing for price. Let's divide by five all the way through that gets rid of that. And then we've got this 4x and then let's subtract 4x from each side. And what we're left with is minus 4x plus 1180. That's also something that we can graph. So here's the first line this indicates cases sold. It originally said x plus y equals 1000 but we rearranged it to y is equal to minus x plus 1000. And so that's the line we have here. And then we have another line which indicates earnings. And this one was originally written as $20 times x plus $5 times y equals $5900 total. We rearrange that to y equals minus 4x plus 1180. That's the equation for the line. And then the solution is right here at the intersection. There's our intersection and it's at 60 on the number of cases sold at $20 and 940 on the number of cases sold at $5. And that also represents the solution of these joint equations. And so it's a graphical way of solving a system of linear equations. So in some systems of linear equations allow us to balance several unknowns and find the unique solution. And in many cases it's easy to solve by hand. And it's really easy with linear algebra when you use software to do it at the same time. As we continue our discussion of mathematics and data science and the foundational principles, the next thing we want to talk about is calculus. And I'm going to give a little more history right here. The reason I'm showing you pictures of stones is because the word calculus is Latin for stone as in a stone used for tallying where people would actually have a little bag of stones and they would move them and they would use it to count sheep or whatever. And the system of calculus was formalized in the 1600s simultaneously independently by Isaac Newton and Gottfried Wilhelm Leibniz. And there are three reasons why calculus is important for data science. Number one, it's the basis of most of the procedures that we do. Things like least squares regression and probability distributions, they use calculus in getting those answers. Second one is if you're studying anything that changes over time. So if you're measuring quantities or rates that change over time, then you have to use calculus. Calculus is used in finding the maximum and the minimum of functions, especially when you're optimizing, which is something I'll show you separately. Also, it's important to keep in mind, there are two kinds of calculus. The first is differential calculus, which talks about rates of change at a specific time. It's also known as the calculus of change. The second kind of calculus is integral calculus. And this is where you're trying to calculate the quantity of something at a specific time, given the rate of change. And it's also known as the calculus of accumulation. So let's take a look at how this works. And we're going to focus on differential calculus. So I'm going to graph an equation here, I'm going to do y is equal to x squared, a very simple one. But it's a curve, which makes it harder to calculate things like the slope. So let's take a point here. That's at minus two, that's my little red dot, we have it x is equal to minus two. And because y is equal to x squared, if we want to get the y value, all we got to do is take that negative two in square and that gives us four. So that's pretty easy. So the coordinates for that red pointer minus two on x and plus four on y. Here's a harder question. What is the slope of the curve at that exact point? Well, it's actually a little tricky because the curve is always curvy and there's no flat part on it. But we can get the answer by getting the derivative of the function. Now, there are several different ways of writing this, I'm using the one that's easiest to type. And let's start by this, what we're going to do is the n here. And that is the squared part. So we had x squared. And you see that same n turns into the squared. And then we come over here and we put that same value two in right there. And we put the two in right here. And then we can do a little bit of subtraction two minus one is one and truthfully you can just ignore that and you get two x. That is the derivative. So what we have here is the derivative of x squared is two x. That means the slope at any given point of the curve is two x. So let's go back to what we had a moment ago. Here's our curve. Here's our point at x minus two. And so the slope is equal to two x. Well, we put in the minus two when we multiply it and we get minus four. So that is the slope at this exact point on the curve. Okay, what if we choose a different point, let's say we come over here to x is equal to three, while the slope is equal to two x. So that's two times three is equal to six. Great. And on the other hand, you might be saying to yourself, and why do I care about this? There's a reason that this is important. And what it is is that you can use these procedures to optimize decisions. And if that seems a little too abstract to you, that means you can use them to make more money. And I'm going to demonstrate that in the next video. But for right now in some, let's say this calculus is vital to practical data science. It's the foundation of statistics and it forms the core that's needed for doing optimization. In our discussion of mathematics and data science foundations, the last thing I want to talk about right here is calculus and how it relates to optimization. I'd like to think of this in other words as the place where math meets reality or it meets Manhattan or something. Now if you remember this graph I made in the last video, y is equal to x squared, that shows this curve here. And we have the derivative that the slope can be given by 2x. And so when x is equal to three, the slope is equal to six, fine. And this is where this comes into play. Calculus makes it possible to find values that maximize or minimize outcomes. And if you want to make something a little more concrete out of this, let's think of an example here. By the way, that's Cupid and Psyche. Let's talk about pricing for online dating. Let's assume you've created a dating service and you want to figure out how much can you charge for it that will maximize your revenue. So let's get a few hypothetical parameters involved. First off, let's say that subscriptions, annual subscriptions currently cost $500 a year, and you can charge that for a dating service. And let's say you sell 180 new subscriptions every week. On the other hand, based on your previous experience manipulating prices around, you have some data that suggests that for each $5 you discount from the price of $500, you will get three more sales. Also, because it's an online service, let's just make our lives a little simpler right now and assume that there is no increase in overhead. It's not really how it works, but we'll do it for now. And I'm actually going to show you how to do all this by hand. Now let's go back to price first. We have this $500 is the current annual subscription price. And you're going to subtract $5 for each unit of discount. That's what I'm giving D. So one discount is $5, two discounts is $10 and so on. And then we have a little bit of data about sales, that you're currently selling 180 new subscriptions per week, and that you will add three more for each unit of discount that you give. So what we're going to do here is we're going to find sales as a function of price. Now to do that, the first thing we have to do is get the Y intercept. So we have price here $500 is the current annual subscription price minus $5 times D. And what we're going to do is we're going to get the Y intercept by solving when does this equals zero? Okay, well, we take the 500 we subtract that from both sides. And then we end up with minus 5D is equal to minus 500 divide both sides by minus five. And we're left with D is equal to 100. That is, when D is equal to 100, X is zero. And that tells us how we can get the Y intercept. But to get that we have to substitute this value into sales. So we take D is equal to 100. And the intercept is equal to 180 plus three 180 is the number of new subscriptions per week. And then we take the three, and then we multiply that times our 100. So 180 times three times 100 is equal to 300. Add those together. And you get 480. And that is the Y intercept in our equations. So when we've discounted sort of price to zero, when prices zero, then the expected sales is 480. Of course, that's not going to happen in reality. But it's necessary for finding the slope of the line. And so now let's get the slope. The slope is equal to the change in Y on the Y axis divided by the change in X. One way we can get this is by looking at sales, we get our 180 new subscriptions per week, plus three for each unit of discount. And we take our information on price, $500 per year minus $5 for each unit of discount. And then we take these, the 3d and the 5d, and those will give us the slope. So it's plus three divided by minus five, and that's just minus 0.6. And so that is the slope of the line. Slope is equal to minus 0.6. And so what we have from this is sales as a function of price, where sales is equal to 480, because that's the Y intercept. When X is equal to zero when price is zero, minus 0.6 times price. So this isn't the final thing. Now what we have to do is we turn this into revenue. So there's another stage to this. Now revenue is equal to sales time the price, you know, how many things did you sell? And how much did it cost? Well, we can substitute in some information here. If we take sales, and we put it in as a function of price, because we just calculated that a moment ago, we get this, and then we do a little bit of multiplication, and then we get that revenue is equal to 480 times the price minus 0.6 times the square of the price. Okay, that's a lot of stuff going on there. What we're going to do now is we're going to get the derivative that's the calculus that we talked about. Well, the derivative of 48 in the price where price is sort of the X, the derivative is simply 480. And the minus 0.6 times the square of the price. Well, that's very similar to the thing we did with the curve. And what we end up with is 0.6 times two is equal to 1.2 times the price. This is the derivative of the original equation, we can solve that for zero now. And just in case you're wondering, why do we solve it for zero? Because that is going to give us the place when y is at a maximum. Now, we had a minus squared, so we have to invert the shape. And we're trying to look for this value right here when it's at the very tippy top of the curve, because that will indicate maximum revenue. Okay, so what we're going to do is we're going to solve for zero. Let's go back to our equation here. We want to find out when is that equal to zero? Well, we subtract 480 from each side. There we go. And we divide by minus 1.2 on each side. And this is our price for maximum revenue. So we've been charging $500 a week, but this says we'll have more total income if we charge $400 instead. And if you want to find out how many sales we can get, currently we have $480. And if you want to know what the sales volume is going to be for that, well, you take the 480, which is the hypothetical y intercept when the price is zero, but then we put in our actual price of $400, multiply that we get 240, do the subtraction and we get 240 total. So that would be 240 new subscriptions per week. So let's compare this. The current revenue is 180 new subscriptions per week at $500 per year. And that means that our current revenue is $90,000 per year. I know it sounds it sounds really good. But we can do better than that. Because the formula for maximum revenue is 240 times 400. When you multiply those you get 96,000. And so the improvements just the ratio of those two 96,000 divided by 90,000 is equal to 1.07. And what that means is a 7% increase in anybody would be thrilled to get a 7% increase in their business simply by changing the price and increasing the overall revenue. So let's summarize what we found here. If you lower the cost by 20% go from $500 per year to $400 per year, assuming all of our other information is correct, then you can increase sales by 33%. That's more than the 20 that you had. And that increases total revenue by 7%. And so we can optimize the price to get the maximum total revenue. And it has to do with this little bit of calculus and the derivative of a function. So in some calculus can be used to find the minimum and the maximum of functions, including prices. It allows for optimization. And that in turn allows you to make better business decisions. Our next topic in mathematics and data principles is something called big O. And if you're wondering what big O is all about, well, it is about time. Or you can think of it as how long does it take to do a particular operation? It's the speed of the operation. If you want to be really precise, the growth rate of a function, how much more it requires as you add elements is called its order. That's why it's big O. That's for order. And big O gives the rate of how things grow as a number of elements grows. And what's funny is there can be really surprising differences. Let me show you how it works with a few different kinds of growth rates or big O. First off, there's the ones that I say are sort of just on the spot, you can get stuff done right away. The simplest one is 01. And that is a constant order. And that's something that takes the same amount of time, no matter what, you can send out an email to 10,000 people, just hit one button. It's done. The number of elements, number of people, the number of operations, it just takes the same amount of time. Up from that is logarithmic where you take the number of operations, you get the logarithm of that. And you see, it's increased, but it's really only a small increase. And it tapers off really quickly. So an example is finding an item in a sorted array, not a big deal. Next one up from that now, this looks like a big change, but in the grand schemes, it's not a big change. This is a linear function, where each operation takes the same unit of time. And so if you have 50 operations, it takes 50 units of time. If you're storing 50 things, it takes 50 units of space. So find an item in an unsorted list, it's usually going to be linear time. Then we have the functions where I say, you know, you better just pack a lunch because it's going to take a little while. The best example of this is what's called log linear. That's where you take the number of items and you multiply that number times the log of the items. And an example of this is something called a fast Fourier transform, which is used for dealing, for instance, with sound or anything that's over time. You can see it takes a lot longer. If you got 30 elements, you're way up there at the top of this particular chart at 100 units of time or 100 units of space or every one to put it. And it looks like a lot. But really, that's nothing compared to the next set where I say, you know, you're just gonna be camping out, you might as well go home. That includes something like the quadratic, you square the number of elements. And see how that just kind of shoots straight up, that's quadratic growth. And so multiplying two n digit numbers. So if you're multiplying two numbers that each have 10 digits, it's going to take you that long, it's going to take a long time. Even more extreme is this one, this is the exponential to raise to the power of the number of items you have. You'll see by the way, the red line here doesn't even go to the top. That's because the graphing software that I'm using doesn't draw when it goes above my upper limit there. So it kind of cuts it off. But this is a really demanding kind of thing is for instance, finding an exact solution to what's called the traveling salesman problem using dynamic programming. That's an example of exponential rate of growth. And then one more I want to mention, which is sort of catastrophic is factorial. You take the number of elements and raise that to the exclamation point factorial. And you see that one cuts off really soon because it basically goes straight up. You have any number of elements of any size, it's going to be hugely demanding. And for instance, if you're familiar with the traveling salesman problem, that's trying to find a solution through the brute force search, it just takes an extraordinary amount of time. And so you know before something like that's done, you're probably just gonna, you know, turn this down and wish you never even started. The other thing to know about this is not only do some things take longer than others, some of these methods, some functions are more variable than others. So for instance, if you're working with data that you want to sort, there are different kinds of sorts or sorting methods. So for instance, there's something called an insertion sort. And what you find is that on its best day, it's linear, it's O of n. That's not bad. On the other hand, the average is quadratic. And that's a huge difference between the two selection sorts. On the other hand, the best is quadratic and the average is quadratic, it's always consistent. So it's kind of funny. It takes a long time, but at least you know how long it's going to take versus the variability of something like an insertion sort. So in some, let me say a few things about big O number one, you need to know that certain functions or procedures vary in speed. And the same thing applies to making demands on a computer's memory or storage space or whatever, they vary in their demands. Also, some of them are inconsistent. Some of them are really efficient sometimes and really slow or really difficult. The others, probably the most important thing here is to be aware of the demands of what you're doing that you can't for instance, just run through every single possible solution or you know, your company will be dead before you get an answer. So be mindful of that. So you can use your time well and get the insight you need in the time that you need it. A really important element of the mathematics and data science and one of its foundational principles is probability. Now one of the things that probability comes in intuitively for a lot of people is something like rolling dice, or looking at sports outcomes. And really the fundamental question, what are the odds of something that gets at the heart of probability. Now let's take a look at some of the basic principles we got our friend Albert Einstein here to explain things. The principles of probability work this way. Probabilities range from zero to one, that's like 0% to 100% chance. When you put P that stands for probability and then in parentheses here a that means the probability of whatever in parentheses. So P a means the probability of A and then P of B is the probability of B. When you take all of the probabilities together, you get what's called the probability space. And that's why we have S. And it all adds up to one because you've now covered 100% of the possibilities. Also, you can talk about the complement. The tilde here is used to say probability of not a is equal to one minus the probability of A because those have to add up. So let's take a look at something also about conditional probabilities, which is really important in statistics. A conditional probability is the probability of something if something else is true. You write it this way, the probability of and that vertical line is called a pipe and it's read as assuming that or given that. So you can read this as probability of a given B is the probability of A occurring if B is true. And so you can say, for instance, what's the probability of something's orange? What's the probability that's a carrot given in this picture. Now the place where this comes in really important for a lot of people is the probabilities of type one and type two errors and hypothesis testing, which we'll mention at some other point. But I do want to say a few things about arithmetic with probabilities, because it doesn't always work the way that people think it will. Let's start by talking about adding probabilities. Let's say you have two events, A and B. And let's say you want to find the probability of either one of those events. So that's like adding the probabilities of the two events. Well, it's kind of easy. Take the probability of event A and you add the probability of event B. However, you may have to subtract something, you may have to subtract this little piece, because maybe there's some overlap between the two of them. On the other hand, if A and B are disjoint, which means they never occur together, then that's equal to zero. And then you can, you know, subtract zero, which is you get back to the original probabilities. But let's take a really easy example of this. I've created my super simple sample space. I have 10 shapes, I got five squares on the top, five circles on the bottom, I've got a couple of red shapes on the right side. Let's say we want to find the probability of a square or a red shape. So we are adding the probabilities, but we have to adjust for the overlap between the two. Well, here's our squares on top five out of the 10 or squares. And over here on the right, we have two red shapes, two out of 10. So let's go back to our formula here. And let's change a little bit, change the A and the B to S and R for square and red. Now we can start this way. Let's get the probability that something is a square. Well, we go back to our probability space, you see we have five squares out of 10 shapes total. So we do five over 10, that reduces to 0.5. Okay, next step, the probability of something read in our sample space. Well, we have 10 shapes total, two of them on the far right are red. So that's two over 10. And you do the division, you get 0.2. Now the trick is the overlap between these two categories, do we have anything that is both square and red, because we don't want to count that twice. So we have to subtract it. So let's go back to our sample space. And we're looking for something that is square, there's the squares on top. And there's the things that are red on the side. And you see they overlap. And this is our little overlapping red square. So there's one shape that meets both of those one out of 10. So we come back here, we do one out of 10, that reduces to 0.1. And then we just do the addition and subtraction here, 0.5 plus 0.2 minus 0.1 gets us 0.6. And so what that means is, there's a 60% chance of an object being square or red. And you can look at it right here, we got six shapes outlined now. And so that's the visual interpretation that lines up with the mathematical one we just did. Now let's talk about multiplication for probabilities. Now the idea here is you want to get what are called joint probabilities or the probability of two things occurring together simultaneously. And what you need to do here is you need to multiply the probabilities. And we can say probability of A and B because we're asking about A and B occurring together a joint occurrence. And it's equal to the probability of A times the probability of B. That's easy. But you do have to expand it just a little bit because you can have the problem of things overlapping a little bit. And so you actually need to expand it to an conditional probability, the probable rephrase, the probability of B given A again, that's the vertical pipe there. On the other hand, if A and B are independent, if they never co occur, or they, B is no more likely to occur if A happens, then it just reduces to the probability of B and you get your slightly simpler equation. But let's go and take a look at our sample space right here. So we've got our 10 shapes, five of each kind, and then two that are red. And we're going to look at originally the probability of B squared or red. Now we're going to look at the probability of it being square and red. Now I know we can have all this one really easy, but let's run through the math. The first thing we need to do is get the ones that are square. There's those five on the top, and the ones that are red, and there's those two on the right. In terms of the ones that are both square and red, obviously, there's just this one red square at the top right. Let's do the numbers here. We change our formula to be SNR for square and red, we get the probability of square. Again, that's those five out of 10. So we do five out of 10 reduces to point five. And then we need the probability of red given that it's a square. So we only need to look at the squares here. There's the squares, five of them. And one of them is red. So that's one over five. That reduces to point two. You multiply those two numbers point five times point two. And when you get is point one, or a 10% chance or 10% of our total sample space is red squares. And you come back and you look at it, you say, yeah, there's one out of 10. So that just confirms what we were able to do intuitively. So that's our short presentation on probabilities and in some what do we get out of that? Number one, probability, it's not always intuitive. And also the idea that conditional values can help in a lot of situations, but they may not work the way you expect them to. And really, the arithmetic of probability can surprise people. So pay attention when you're working with it, so you can get a more accurate conclusion in your own calculations. Welcome to statistics and data science. I'm Barton Paulson. And what we're going to be doing in this course is talking about some of the ways that you can use statistics to see the unseen to infer what's there, even when most of it's hidden. Now, this shouldn't be surprised if you remember the data science Venn diagram that we talked about a while ago, we have math up here in the top right corner. But if you were to go to the original description of this Venn diagram, its full name was math and stats. And let me just mention something in case it's not completely obvious about why statistics matters to data science. And the idea is this, counting is easy. It's easy to say how many times a word appears in a document. It's easy to say how many people voted for a particular candidate in one part of the country. Counting is easy, but summarizing and generalizing, those things are hard. And part of the problem is there's no such thing as a definitive analysis. All analyses really depend on the purposes that you're dealing with. So as an example, let me give you a couple of pairs of words and try to summarize the difference between them in just two or three words. I mean, in a word or two, how is a souffle different from a quiche? Or how is an aspen different from a pine tree? Or how is baseball different from cricket? And how are musicals different from opera? It really depends on who you're talking to depends on your goals. And it depends on sort of the shared knowledge. And so there's not a single definitive answer. And then there's a matter of generalization. Think about again, take music, listen to three concerted by Antonio Vivaldi. And do you think you can safely and accurately describe all of his music? Now I actually chose Vivaldi on purpose because Igor Stravinsky said you could he said he didn't write 500 concertos you wrote the same concerto 500 times. But take something more real world like politics. If you talk to 400 registered voters in the US, can you then accurately predict the behavior of all of the voters? There's about 100 million voters in the US. And that's a matter of generalization. And that's the sort of thing that we try to take care of with inferential statistics. Now, there are different methods that you can use in statistics and all of them are designed to give you sort of a map, a description of the data you're working with. They're descriptive statistics. They're inferential statistics. There's the inferential procedure hypothesis testing. And there's also estimation. And I'll talk about each of those in more depth. There are a lot of choices that have to be made. And some of the things I'm going to discuss in detail are, for instance, the choice of estimators that's different from estimation, different measures of fit, feature selection for knowing which variables are the most important in predicting your outcome. Also common problems that arise when trying to model data and the principles of model validation. But through this all the most important thing to remember is that analysis is functional, is designed to serve a particular purpose. And there's a very wonderful quote within the statistics world that says all models are wrong, all statistical descriptions of reality are wrong, because they're not exact depictions, there's summaries. But some are useful. And that's from George Epox. And so really the question is you're not trying to be totally completely accurate because in that case, you just wouldn't do an analysis. The real question is, are you better off doing your analysis than not doing it? And truthfully, I bet you are. So in some, we can say three things. Number one, you want to use statistics to both summarize your data and to generalize from one group to another if you can. On the other hand, there's no one true answer with data, you got to be flexible in terms of what your goals are and the shared knowledge. And no matter what you're doing, the utility of your analysis should guide you in your decisions. The first thing we want to cover in statistics and data science is the principle of exploring data. And this video is just designed to give an exploration overview. So we like to think of this the intrepid explorers, they're out there exploring and seeing what's in the world, you can see what's in your data. More specifically, you want to see what your data set is like, you want to see if your assumptions are met so you can do a valid analysis with your chosen procedure. And really something that may seem very weird is you want to listen to your data is something's not working out. If it's not going the way you want, then you need to pay a little more attention and exploratory data analysis is going to help you do that. Now, there are two general approaches to this. First off, there's a graphical exploration. So you use graphs and pictures and visualizations to explore your data. The reason you want to do this is that graphics are very dense and information. They're also really good. In fact, the best way to get the overall impression of your data. Second to that, there's a numerical exploration. I make it very clear. This is the second step, do the visualization first, then do the numerical part. Now you want to do this because it can give greater precision. And this is also an opportunity to try variations on the data, you can actually do some transformations, move things around a little bit and try different methods and see how that affects the results and see how it looks. So let's go first to the graphical part. They're very quick and simple plots that you can do. Those include things like bar charts and histograms and scatter plots, very easy to make, and very quick way of getting the understanding of the variables in your dataset. In terms of numerical analysis, again, after the graphical methods, you can do things like transform the data, that is take like the logarithm of your numbers, you can do empirical estimates of population parameters. And you can use robust methods. And I'll talk about all of those in more length in later videos. But for right now, I can sum it up this way. The purpose of exploration is to help you get to know your data. And also, you want to explore your data thoroughly before you start modeling it before you build statistical models. And all the way through, you want to make sure you listen carefully, so that you can find hidden or unassumed details and leads in your data. As we move in our discussion of statistics and exploring data, the single most important thing we can do is exploratory graphics. In the words of the late great Yankees catcher, Yogi Berra, you can see a lot by just looking that applies to data as much as it applies to baseball. Now there's a few reasons you want to start with graphics. Number one is to actually get a feel for the data. I mean, what's it distributed like? What's the shape? Are there strange things going on? Also, it allows you to check the assumptions and see how well your data match the requirements of the analytical procedures you hope to use. You can check for anomalies like outliers and unusual distributions and errors. And also you can get suggestions. If something unusual is happening in the data, that might be a clue that you need to pursue a different angle or do a deeper analysis. Now we wanted to graphics first for a couple of reasons. Number one is their very information dense and fundamentally humans are visual. It's our single highest bandwidth way of getting information. It's also the best way to check for shape and gaps and outliers. There are few ways you can do this if you want to the first is with programs that rely on code. So you can use the statistical programming language are the general purpose programming language Python, you can actually do a huge amount in JavaScript, especially in D three.js, or you can use apps that are specifically designed for exploratory analysis. That includes Tableau, both the desktop and the public versions, click and even excel is a good way to do this. And then finally, if you really want to know, you can do this by hand. John Tukey, who's the father of exploratory data analysis wrote his seminal book, a wonderful book, where it's all hand graphs and actually it's a wonderful way to do it. But let's start the process for doing these graphics. We start with one variable that is univariate distributions. And so you're going to get something like this, the fundamental chart is the bar chart. This is when you're dealing with categories, and you're simply counting how many cases there are in each category. The nice thing about bar charts is there's really easy to read, put them into sending order and maybe have them vertical, maybe have them horizontal, horizontal can be nice to make the labels a little easier to read. This is about psychological profiles of the United States. This is real data, and that we have the most states in the friendly and conventional, a smaller number and temperamental and uninhibited, and the least common of the United States is relaxed and creative. Next, you can do a box plot or sometimes called a box and whiskers plot. This is when you have a quantitative variable, something that's measured and you can say how far apart scores are. A box plot shows quartile values. It also shows outliers. So for instance, this is Google searches for modern dance and that's Utah at five standard deviations above the national average. That's where I'm from. And I'm glad to see that there. Also, it's a nice way to show many variables side by side if they're on approximately similar scales. Next, if you have quantitative variables, you're going to want to do a histogram again, quantitative, so interval or ratio level or measured variables. And these let you see the shape of a distribution and potentially compare many. So here are three histograms for Google searches on data science and entrepreneur and modern dance. And you can see mostly for the part normally distributed with a couple of outliers. Once you've done one variable or the univariate analysis, you're going to want to do two variables at a time, that is bivariate distributions or joint distributions. Now, one easy way to do this is with grouped plots. So what you can do grouped bar charts and box plots. What I have right here is grouped box plots. I have my three regions psychological regions of the United States. And I'm showing how they rank on openness. That's a psychological characteristic. And what you can see is that the relaxing creative or highest and the friendly conventional tend to go to the lowest. And that's kind of how that works. It's also a good way of seeing the association between a categorical variable like region of the United States psychologically, and a quantitative outcome, which is what we have here with openness. Next, you can also do a scatter plot. That's where you have two quantitative variables. And what you're looking for here is, is it a straight line? That is, is it linear? And do we have outliers? And also the strength of association, how closely do the dots all come to the regression line that we have here in the middle? And this is an interesting one for me, because we have openness across the bottom. So more open as you go to the right and agreeableness. And what we see is there's a strong downhill association. The states in the United States that are the most open apparently are also the least agreeable. So we're going to have to do something about that. And then finally you want to go to many variables that is multivariate distributions. Now, one big question here is 3d or not 3d? Let me actually make an argument for not 3d. So what I have here is a 3d scatter plot of three variables about Google searches. Up the left, I have FIFA, which is for professional soccer. Down there on the bottom left, I have searches for NFL and on the right I have searches for NBA. Now, I did this in R and what's needed about this is you can click and drag and move it around. And you know, that's kind of fun. It kind of spin around and it gets kind of nauseating as you look at it. And this particular version I'm using plotly in R allows you to actually click on a point and see, let me see if I can get the floor in the right place. You got to click on a point and see where it ranks on each of these characteristics. You can see, however, this thing is hard to control. And once it stops moving, it's not much fun. And truthfully, most 3d plots I've worked with are just kind of nightmares. They seem like they're a good idea, but not really. So here's the deal. 3d graphics, like the one I just showed you, because they're actually being shown in 2D, they have to be in motion for you to tell what's going on at all. And fundamentally, they're hard to read and confusing. Now it's true, they might be useful for finding clusters in three dimensions, we didn't see that in the data we had. But generally, I just avoid them like the plague. What you want to do, however, is see the connection between several variables, you might want to use a matrix of plots. This is where you have, for instance, many quantitative variables, you can use markers for a group membership if you want. And I find it to be much clearer than 3d. So here I have the relationship between four search terms NBA NFL MLB for Major League Baseball and FIFA. You can see the individual distributions, you can see the scattered plots and get the correlation. Truthfully, this for me is a much easier kind of chart to read and get the richness that we need from a multidimensional display. So the questions you're trying to answer overall are number one, do you have what you need? Do you have the variables you need? You have the variability that you need? Are there clumps or gaps in the distributions? Are there exceptional cases anomalies that are really far out from everybody else or spikes in the scores? And are of course are there errors in data were there mistakes in coding? Did people forget to answer questions? Are there impossible combinations? And these kinds of things are easiest to see with a visualization that really just kind of puts it right there in front of you. And so in sum, I can say this about graphical exploration of data. It's a critical first step this basically where you always want to start. And you want to use the quick and easy methods again, bar charts, scatter plots are really easy to make and they're very easy to understand. And once you're done with the graphical exploration, then you can go to the second step, which is exploring the data through numbers. The next step in statistics and exploring data is exploratory statistics or numerical exploration of data. I like to think of this as go in order. First you do visualization, then you do the numerical part. And a couple of things to remember here. Number one is you're still exploring the data that you're not modeling yet, but you are doing a quantitative exploration. This might be an opportunity to get empirical estimates that is of population parameters as opposed to theoretically based ones. It's a good time to manipulate the data and explore the effects of manipulating the data looking at subgroups, looking at transforming variables. Also, it's an opportunity to check the sensitivity of your results. Do you get the same general results if you test under different circumstances? So we're going to talk about things like robust statistics and resampling data and transforming data. So we'll start with robust statistics. This by the way is Hercules, a robust mythical character. And the idea with a robust statistics is that they are stable is that even when the data varies in sort of unpredictable ways, you still get the same general impression. This is a class of statistics. It's an entire category that's less affected by outliers and by skewness and kurtosis and other abnormalities in the data. So let's take a quick look. This is a very skewed distribution I created. The meeting, which is the dark line there in the box, is right around one. And I'm going to look at two different kinds of robust statistics, the trimmed mean and the wind-rised mean. With the trimmed mean, you take a certain percentage of the data from the top and the bottom, you just throw it away and you compute the mean for the rest. With the wind-rised, you take those and then you move those scores into the highest non-outline score. Now, the 0% is exactly the same as the regular mean. And here it's 1.24. But as we trim off 5% or move in 5%, you can see that the mean shifts a little bit, then 10%. It comes in a little bit more to 25%. Now we're throwing away 50% of the data, 25% on the top, 25% on the bottom. And we get a mean here of 1.03 that's the trimmed mean and a wind-rised of 1.07. When we throw away 50%, when we trim 50%, that actually means that we're leaving just the median. Only the middle score is left. Then we get 1.01. What's interesting is how close we get to that, even when we have 50% of the data left. And so that's an interesting example of how you can use robust statistics to explore data even when you have things like strong skewness. Next is the principle of resampling. And that's like pulling marbles repeatedly out of a jar, counting the colors, putting them back in and trying again. That's an empirical estimate of sampling variability. So sometimes you get 20% red marble, sometimes you get 30, sometimes you get 22, and so on. There are several versions of this. They go by the names the jackknife and the bootstrap and the permutation. And the basic principle of resampling is also key to the process of cross validation. I'll have more to say about validation later. And then finally, there's transforming variables. Here's our caterpillars in the process of transforming into butterflies. But the idea here is you take a sort of difficult data set, and then you do what's called a smooth function. There's no jumps in it. And something that preserves the order and allows you to work on the full data set. So you can fix skewed data. And in a scatter plot, you might have a curved line, you can fix that. And probably the best way to look at this is with something called two keys ladder of powers. I mentioned before john to key, the father of exploratory data analysis, he talked a lot about transformations. This is his ladder starting at the bottom with the minus one over x squared up to the top with his x cubed. And here's how it works. This distribution over here is a symmetrical normally distributed variable. And as you start to move in one direction, and you apply the transformation, take the square root, you see how it moves the distribution over to one end, then the logarithm and you get to the end, you get this minus one over the square of the score. And that pushes it way, way, way over. If you go the other direction, for instance, the square of the scores, it pushes it down in the one direction, you cube it. And then you see how it can move it around in ways that allow you to, you can actually undo the skewness to get back to a more centrally distributed distribution. And so these are some of the approaches that you can use in the numerical exploration of data. In some, let's say this statistical or numerical exploration allows you to get multiple perspectives on your data. It also allows you to check the stability see how it works with outliers and skewness and mixed distributions and so on. And perhaps most importantly, it sets the stage for the statistical modeling of your data. As a final step of statistics and exploring data, I'm going to talk about something that's not usually considered exploring but is basic descriptive statistics. I like to think of it this way. You've got some data and you are trying to tell a story more specifically, you're trying to tell it your data's story. And with descriptive statistics, you can think of it as trying to use a little data to stand in for a lot of data using a few numbers to stand in for a large collection of numbers. And this is consistent with the advice we get from good old Henry David Thoreau who told us simplify, simplify. If you can tell your story with more carefully chosen and more informative data, go for it. So there's a few different procedures for doing this. Number one, you want to describe the center of your distribution of data. That's if you're going to pick a single number, use that. Two, if you can give a second number, give something about the spread or the dispersion of the variability. And three, it's also nice to be able to describe the shape of the distribution. Let me say more about each of these in turn. First, let's talk about center. We have the center of our rings here. Now there are a few very common measures of center or location or central tendency of a distribution. There's the mode, and there's the median, and there's the mean. Now there are many, many others, but those are the ones are going to get you most of the way. Let's talk about the mode first. Now I'm going to create a little data set here on a scale from one to 11. And I'm going to put individual scores. There is a one and another one and another one and another one. Then we have a two, two. Then we have a score way over at nine and another score over at 11. So we have eight scores. And this is the distribution. This is actually a histogram of the data set. The mode is the most commonly occurring score or the most frequent score. Well, if you look at how tall each of these go, we've got more ones than anything else. And so one is the mode because it occurs four times and nothing else comes close to that. The median is a little different. The median is looking for the score that is at the center if you split it into two equal groups. We have eight scores. So we want to get one group of four. That's down here. And then the other group of four is this really big one because it ranges way out. And the median is going to be the place on the number line that splits those into two groups. That's going to be right here at one and a half. Now the means a little more complicated, even though people understand means in general, it's the first one we have here that actually has a formula where m for the mean is equal to the sum of x. That's our scores on the variable divided by n, the number of scores. You can also write it out with Greek notation if you want like this where that's sigma a capital sigma is the summation sign sum of x divided by n. And with our little data set that works out to this one plus one plus one plus one plus two plus two plus nine plus 11. Add those all up and divide by eight, because that's how many scores there are. Well, that reduces to 28 divided by eight, which is equal to 3.5. If you go back to our little chart here, 3.5 is right over here. You'll notice there aren't any scores really exactly right there. That's because the mean tends to get very distorted by outliers, it follows the extreme scores. But a really nice, I say it's more than just a visual analogy is that if this number line were a seesaw, then the mean is exactly where the balance point or the fulcrum would be for these to be equal. People understand that if somebody weighs more, they got to sit in closer to balance somebody who weighs less, who has to sit further out. And that's how the mean works. Now, let me give a little bit of the pros and cons of each of these for the mode, modes really easy to do. You just count how common it is. On the other hand, it may not be close to what appears to be the center of the data. The median, it splits the data into two same size groups, the same number of scores in each. And that's pretty easy to deal with. But unfortunately, it's hard to use that information in many statistics after that. And then finally, the mean of these three is the least intuitive. It's the most effective by outliers and skewness. And that may really struck against it. But it is, however, the most useful statistically. And so it's the one that gets used the most often. Next, there's the issue of spread, spread your tail feathers. And we have a few measures here that are very common also. There's the range. There are percentiles and interquartile range. And there's the variance and the standard deviation. I'll talk about each of those. First, the range, the range is simply the maximum score minus the minimum score. And in our case, that's just 11 minus one, which is equal to 10. So we have a range of 10. Now I can show you that here on our chart. It's just that line there at the bottom from the 11 down to the one, that's a range of 10. The interquartile range, which actually is usually referred to simply as the IQR is the distance between Q3, which is the third quartile score and Q1, which is the first quartile score. If you're not familiar with quartiles, it's the same as the 75th percentile score and the 25th percentile score. Really what it is, is you're going to throw away some of the data. So let's go to our distribution here. First thing we're going to do is we're going to throw away the two highest scores. There they are. They're grayed out now. And then we're going to throw away two of the lowest scores. They're out there. And then we're going to get the range for the remaining ones. Now, this is complicated by the fact that I've got this big gap in between two and nine. And different methods of calculating quartiles do something with that gap. So if you use a spreadsheet, it's actually going to do an interpolation process. And it'll give you a value of 3.75, I believe. And then down to one for the first quartile. So not so intuitive with this graph, but that is how it works usually. If you want to write it out, you can do it like this. The interquartile range is equal to q3 minus q1. And in our particular case, that's 3.75 minus one. And then of course, it's equal to just 2.75. And there you have it. Now our final measure of spread or variability or dispersion is two related measures, the variance and the standard deviation. These are a little harder to explain a little harder to show. But the variance, which is at least the easiest formula is this, the variance is equal to that's the sum, the capital sigma is the sum of x minus m. That's how far each individual score is from the mean. And then you take that deviation there, and you square it, you add up all the deviations, and then you divide by the number. So the variance is the average square deviation from the mean. I'll try to show you that graphically. So here's our data set. And there's our mean right there at three and a half. Let's go to one of these twos. We got a deviation there of one and a half. And if we make a square, that's one and a half points on each side. Well, there it is, we can do a similar square for the other score to if we're going down to one, then it's going to be two and a half squared and that can be that much bigger. And we can draw one of these squares for each of our eight points, the squares for the scores at 911 are going to be huge and go off the page. So I'm not going to show them. But once you have all those squares, you add up the area and you get the variance. So this is the formula for the variance. But now let me show the standard deviation, which is also a very common measure is closely related to this. Specifically, it's just the square root of the variance. Now, there's a catch here. The formulas for the variance in the standard deviation are slightly different for populations and samples and that you they use different denominators. But they give similar answers, not identical, but similar if the sample is reasonably large, say over 30 or 50, then that's going to be really a just a negligible difference. So let's do a little pro and con of these three things. First, the range. It's very easy to do. It only uses two numbers the high and the low. But it's determined entirely by those two numbers. And if they're outliers, you've got really a bad situation. The interquartile range or IQR, it's really good for skewed data. And that's because it ignores extremes on either end. So that's nice. And the variance in the standard deviation while they are the least intuitive, and they are the most affected by outliers, they are also generally the most useful because they feed into so many other procedures that are used in data science. Finally, let's talk a little bit about the shape of a distribution. You can have symmetrical or skewed distributions, unimodal, uniform or U shaped. You can have outliers. There's a lot of variations. Let me show you a few of them. First off is a symmetrical distribution. Pretty easy, they're the same on the left and on the right. And this little pyramid shape is an example of a symmetrical distribution. There are also skewed distributions where most of the scores are on one end and then they taper off. This right here is a positively skewed distribution where most of the scores are at the low end and the outliers are on the high end. This is unimodal. It's our shame pyramid shape. Unimodal means it has one mode or really kind of one hump in the data. That's contrasted, for instance, to bimodal where you have two modes and that usually happens when you have two distributions got mixed together. There's also uniform distributions where every response is equally common. There's U shaped distributions where people tend to pile up at one end of the other in a big dip in the middle. And so there's a lot of different variations and you want to get those the shape of the distribution to help you understand and put the numerical summaries like the mean and like the standard deviation and put those into context. In some, we can say this, when you use descriptive statistics, that allows you to be concise with your data. Tell the story and tell it succinctly. You want to focus on things like the center of the data, the spread of the data, the shape of the data. And above all, watch out for anomalies because they can exercise really undue influence on your interpretation. But this will help you better understand your data and prepare you for the steps that follow. The next step in our discussion of statistics and inference is hypothesis testing, very common procedure in some fields of research. I like to think of it as put your money where your mouth is and test your theory. Here's the right brothers out testing their plane. Now the basic idea behind hypothesis testing is this, you start with a question, and it's something like what is the probability of x occurring by chance, if randomness or meaning less sampling variation is the only explanation. Well the response is this, if the probability of that data arising by chance when nothing's happening is low, then you reject randomness as a likely explanation. Okay, there's a few things I can say about this. Number one, it's really common in scientific research, say for instance, in the social sciences, it's used all the time. Number two, this kind of approach can be really helpful in medical diagnostics, where you're trying to make a yes, no decision does a person have a particular disease. And three, really, anytime you're trying to make a go, no, go decision, which might be made, for instance with a purchasing decision for a school district, or implementing a particular law, you base it on the data and you have to make a yes, no. Hypothesis testing might be helpful in those situations. Now, you have to have hypotheses to do hypothesis testing. You start with h sub zero, which is the shorthand version for the null hypothesis. And what that is in larger rephrase. And what that is in lengthier terms is that there is no systematic effect between groups, there's no association between variables, and random sampling error is the only explanation for any observed differences that you see. And then contrast that with h sub a, which is the alternative hypothesis. And this really just says that there's a systematic effect that there is in fact a correlation between variables that there is in fact a difference between two groups that this variable does in fact predict the other one. Let's take a look at the simplest version of this statistically speaking. Now what I have here is a null distribution. This is a bell curve. It's actually the standard normal distribution, which shows these scores in relative frequency. And what you do with this is you mark off what are called regions of rejection. And so I've actually shaded off the highest two and a half percent of the distribution and the lowest two and a half percent. What's funny about this is even though I draw it to plus and minus three, it looks like it hit zero. It's actually infinite and asymptotic. But that's the highest and lowest two and a half percent collectively that leaves 95% in the middle. Now the idea is then you gather your data you calculate a score for your data and you see where it falls in this distribution. And I like to think of that as you have to go down one path or the other you have to make a decision. And you have to decide whether to retain your null hypothesis, maybe it is random or reject it and decide no, I don't think it's random. The trick is things can go wrong. You can get a false positive. This is when the sample shows some kind of statistical effect. But it's really randomness. And so for instance, this scatterplot I have right here, you can see a little downhill association here. But this is in fact drawn from data that has a true correlation of zero. And I just kind of randomly sampled from it until I got this, it took about 20 rounds. But it looks negative, but there's really nothing happening. The trick about false positives is that's conditional on rejecting the null. The only way you can get a false positive is if you actually conclude that there's a positive result. It goes by the highly descriptive name of a type one error. But you get to pick a value for it and 0.05 or 5% risk. If you reject the null hypothesis, that's the most common value. Then there's a false negative. This is when the data looks random, but in fact, it is systematic or there's a relationship. So for instance, this scatterplot, it looks like it's pretty much a zero relationship. But in fact, this came from two variables that were correlated at point 25. That's a pretty strong association. Again, I randomly sampled from the data until I got a set that happened to look pretty flat. And a false negative is conditional on not rejecting the null. You can only get a false negative. If you get a negative, you say there's nothing there. It's also called a type two error. And this is a value that you have to calculate based on several elements of your testing framework. So it's something to be thoughtful of. Now, I do have to mention one thing, big security notice, but wait, the problem with hypothesis testing, there's a few number one, it's really easy to misinterpret it. A lot of people say, well, if you get a statistically significant result, it means that it's something big and meaningful. And that's not true because it's confounded with sample size and a lot of other things that just don't really matter. Also, a lot of people take an exception with the assumption of a null effect or even a nil effect that there's zero difference at all. And that could be in certain situations could be an absurd claim. So got to watch out for that. There's also bias from the use of a cutoff. Anytime you have a cutoff, you're going to have problems where you have cases that would have been just slightly higher, slightly lower it would have switched on the dichotomous outcome. So that is a problem. And then a lot of people say that it just answers the wrong question because of what it's telling you is what's the probability of getting this data at random? That's not what most people care about. They want it the other way, which is why I mentioned previously Bayes theorem. And I'll say more about that later. That being said, hypothesis testing is still very deeply ingrained, very useful in a lot of questions. And it's gotten us really far in a lot of domains. So in some, let me say this, hypothesis testing is very common for yes, no outcomes. And it's the default in many fields. And I argue that it is still useful and informative despite many of the well substantiated critiques. We'll continue in statistics and inference by discussing estimation. Now as opposed to hypothesis testing, estimation is designed to actually give you a number, give you a value, not just a yes, no, go, no, go, but give you an estimate for a parameter that you're trying to get. I like to think of it as sort of a new angle, looking at something from a different way. And the most common approach to this is confidence intervals. Now the important thing to remember is this is still an inferential procedure, you're still using sample data and trying to make conclusions about a larger group or population. The difference here is instead of coming up with a yes, no, you instead focus on likely values for the population value. Most versions of estimation are closely related to hypothesis testing, sometimes seen as the flip side of the coin. And we'll see how that works in later videos. Now I like to think of this as an ability to estimate any sample statistic. And there's a few different versions. We have parametric versions of estimation and bootstrap versions, I got the boots here. And that's where you just kind of randomly sample from the data in an effort to get an idea of the variability. You can also have what are called central versus noncentral confidence intervals in estimation, but we're not going to deal with those. Now there are three general steps to this. First, you need to choose a confidence level anywhere from say, well, you can't have zero, it has to be more than zero. And it can't be 100%. Choose something in between 95% is the most common. And what it does is it gives you a range of numbers, a high and a low. And the higher your level of confidence, the more confident you want to be, the wider the range is going to be between your high and your low estimates. Now there's a fundamental tradeoff in what's happening here. And it's the tradeoff between accuracy, which means you're on target or more specifically, that your interval contains the true population value. And the idea is that leads you to the correct inference. There's a tradeoff between accuracy and what's called precision in this context. And precision means a narrow interval. And it's a small range of likely values. And what's important to emphasize is this is independent of accuracy, you can have one without the other or neither or both. In fact, let me show you how this works. What I have here is a little hypothetical situation. I've got a variable that goes from maybe, you know, 10 to 90. And I've drawn a thick black line at 50. If you think of this in terms of percentages and political polls, it makes a very big difference if you're on the left or the right of 50%. And then I've drawn a dotted vertical line at 55 to say that that's our theoretical true population value. Then what I have here is a distribution that shows possible values based on our sample data. And what you get here is it's not accurate because it's centered on the wrong thing. It's actually centered on 45 is supposed to 55. And it's not precise because it spread way out from maybe 10 up to almost 80. So this situation, the data is no help really at all. Now here's another one. This is accurate because it's centered on the true value. That's nice. But it's still really spread out. And you see that about 40% of the values are going to be on the other side of 50%. It might lead you to reach the wrong conclusion. So that's a problem. Now here's the nightmare situation. It's this is when you have a very, very precise estimate. But it's not accurate. It's wrong. And this leads you to a very false sense of security and understanding of what's going on. And you're going to totally blow it all the time. The ideal situation is this. You have an accurate estimate where the distribution of sample values is really close to the true population value. And it's precise. It's really tightly knit. And you can see that just that about 95% of it is on the correct side of 50. And that's good. If you want to see all four of them here at once, we have the precise two on the bottom, the unprecise ones on the top, the accurate ones on the right, the inaccurate ones on the left. And so that's a way of comparing it. But no matter what you do, you have to interpret a confidence interval. Now the statistically accurate way that has very little interpretation is this. You would say the 95% confidence interval for the mean is 5.8 to 7.2. Okay, so that's just kind of taking the output from your computer and sticking it into sentence form. The colloquial interpretation of this goes like this, there's a 95% chance that the population mean is between 5.8 to 7.2. Well, in most statistical procedures, specifically frequentists as opposed to Bayesian, you can't do that that implies that the population mean shifts. That's not usually how people see it. Instead, a better interpretation is this 95% of confidence intervals for randomly selected samples will contain the population mean. Now I can show you this really easily with a little demonstration. This is where I randomly generated data from a population with a mean of 55. And I got 20 different samples. And I got the confidence interval for each sample and I've charted the high and the low. And the question is, did it include the true population value? And you can see that of these 20, 19 of them included it. Some of them barely made it. If you look at sample number one on the far left, barely made it sample number eight, it doesn't look like it made it sample 20 on the far right, barely made it on the other end. Only one of them missed it completely that sample number two, which is shown in right on the left. Now, it's not always just one out of 20. I actually had to run the simulation about eight times, because it gave me either zero or three or one or two. And I had to run it until I got exactly what I was looking for here. But this is what you would expect on average. So let's say a few things about this. There are some things that affect the width of a confidence interval. The first is the confidence level, or CL. Higher confidence levels create wider intervals. The more certain you have to be, you're going to give a bigger range to cover your bases. Second, the standard deviation where larger standard deviations create wider intervals. If the thing that you're studying is inherently really variable, then of course, your estimate of the range is going to be more variable as well. And then finally there's the N or the sample size. This one goes the other way. Larger sample sizes create narrower intervals. The more observations you have, the more precise and the more reliable things tend to be. I can show you each of these things graphically. Here we have a bunch of confidence intervals where I'm simply changing the confidence level from 0.50 at the left side, up through 0.999. And you can see it gets much bigger as we increase. Next one is standard deviations. As the sample standard deviation increases from one to 16, you can see that the interval gets a lot bigger. And then we have sample size going from just two up to 512 I'm doubling it at each point. And you can see how the interval gets more and more and more precise as we go through. And so let's say this to sum up our discussion of estimation. Confidence intervals, which are the most common version of estimation, focus on the population parameter. And the variation in the data is explicitly included in that estimation. Also, you can argue that they're more informative because not only do they tell you whether the population value is likely, but they give you a sense of the variability of the data itself. And that's one reason that people argue that confidence intervals should nearly always be included in any statistical analysis. As we continue our discussion of statistics and data science, we need to talk about some of the choices that you have to make some of the tradeoffs and some of the effects that these things have. We'll begin by talking about estimators. That is different methods for estimating parameters. I like to think of this as what kind of measuring stick or standard are you going to be using? Now we'll begin with the most common. This is actually called OLS, which is short for ordinary least squares. This is very common approach. It's using a lot of statistics and it's based on what's called the sum of squared errors. And it's characterized by an acronym called blue, which stands for best linear unbiased estimator. Let me show you how that works. Let's take a scatterplot here of an association between two variables. This is actually the speed of a car and the distance to stop from about the 20s, I think. We have a scatterplot here and we can draw a straight regression line through it. Now the line that I've used is in fact the best linear unbiased estimate. But the way that we can tell that is by getting what are called the residuals. If you take each data point and draw a perfectly vertical line up or down to the regression line, because the regression line predicts what the value would be for that value on the x axis. Those are the residuals. Each of those individual vertical lines is residual. You square those and you add them up. And this regression line, the gray angled line here will have the smallest sum of squared residuals of any possible straight line that you can run through it. Now another approach is ML, which stands for maximum likelihood. And this is when you choose parameters that make the observed data most likely. It sounds kind of weird, but I can demonstrate it. And it's based on that kind of local search. It doesn't always find the best. I like to think of it like a person here with binoculars looking around them, trying hard to find something, but you could theoretically miss something. Let me give a very simple example of how this works. Let's assume that we're trying to find parameters that maximize the likelihood of this dotted vertical line here at 55. And I've got three possibilities. I've got my red distribution, which is off to the left, the blue, which is a little more center in the green, which is far to the right. And these are all identical except they have different means. And by changing the means, you see that the one that is highest where the dotted line is is the blue one. And so if only thing we're doing is changing the mean and we're looking at these three distributions, then the blue one is the one that has the maximum likelihood for this particular parameter. On the other hand, we could give them all the same mean right around 50 and vary their standard deviations instead. And so they spread out different amounts. In this case, the red distribution is highest at the dotted vertical line. And so it has the maximum value. Or if you want to, you can vary both the mean and the standard deviation simultaneously. And here the green gets a slight advantage. Now, this is really a caricature of the process, because obviously you would just want to center it right there on the 55 and be done with it. The question is when you have many variables in your data set, then it's a very complex process of choosing values that can maximize the association between all of them. But you get a feel for how it works with this. The third approach that's pretty common is something called MAP map for maximum a posteriori. This is a Bayesian approach to parameter estimation. And what it does is it adds the prior distribution and then it goes through sort of an anchoring and adjusting process. What happens by the way is that stronger prior estimates exert more influence on the estimate. And that might mean for instance larger sample or more extreme values. And those have a greater influence on the posterior estimate of the parameters. Now, what's interesting is that these three methods all connect with each other. Let me show you exactly how they connect. The ordinary least squares OLS, this is equivalent to maximum likelihood when it has normally distributed error terms. And maximum likelihood ML is equivalent to maximum posteriori or map with a uniform prior distribution. You want to put it another way. Ordinary least squares or OLS is a special case of maximum likelihood. And then maximum likelihood or ML is a special case of maximum posteriori. And just in case you like it, we can put it in set notation. OLS is a subset of ML is a subset of MAP. And so there are connections between these three methods of estimating population parameters. Let me just sum it up briefly this way. The standards that you use OLS, ML, MAP, they affect your choices and the ways that you determine what parameters best estimate what's happening in your data. Several methods exist. And there's obviously more than what I showed you right here. But many are closely related and under certain circumstances, they're all identical. And so it comes down to exactly what are your purposes and what do you think is going to best work with the data that you have to give you the insight that you need in your own project. The next step we want to consider in our statistics and data science and the choices that we have to make has to do with measures of fit or the correspondence between the data that you have and the model that you create. Now it turns out there's a lot of different ways to measure this. And one big question is how close is close enough? Or how can you see the difference between the model and reality? Well, there's a few really common approaches to this. The first one of us called R squared. It's got a longer name. That's the coefficient of determination. There's a variation adjusted R squared, which takes into consideration the number of variables. Then there's minus two LL, which is based on the likelihood ratio. And a couple of variations, the Ikeke information criterion or AIC and the Bayesian information criterion or BIC. And then there's also chi squared. Now that's actually a Greek C there. It looks like an X, but it says C. And that's chi squared. And so let's talk about each of these in turn. First off is R squared. This is the squared multiple correlation or the coefficient of determination. And what it does is it compares the variance of Y. So if you have an outcome variable, it looks at the total variance of that. It compares it to the residuals on Y after you've made your prediction. The score is on R squared range from zero to one and higher is better. The next is minus two log likelihood. That's the likelihood ratio or as I just said at the minus two log likelihood. And what this does is it compares the fit of nested models. We have a subset than a larger set than a larger set overall. This approach is used a lot in logistic regression when you have a binary outcome. And in general smaller values are considered better fit. Now, as I mentioned, there's some variations of this. I like to think of variations of chocolate. The minus two log likelihood, there's the Akeke information criterion, the AIC and the Bayesian information criterion BIC. And what both of these do is they adjust for the number of predictors. Because obviously if you have a huge number of predictors, you're going to get a really good fit. But you're probably going to have what's called overfitting where your model is tailored too specifically to the data you currently have and doesn't generalize well. These both attempt to reduce the effect of overfitting. And then there's chi squared again, it's actually a lower case Greek C looks like an X. And chi square is used for examining the deviations between two data sets, specifically between the observed data set and the expected values or model you create a we expect this many frequencies in each category. Now, I'll just mention like going to the store, there's a lot of other choices, but these are some of the most common standards particularly with the R squared. And I just want to say in some there are many different ways to assess the fit, the correspondence between a model and your data. And the choices affect the model, you know, especially are you going to penalize for throwing in too many variables relative to your number of cases? Are you dealing with a quantitative or a binary outcome? Those things all matter. And so the most important thing, as always, my standing advice is keep your goals in mind and choose a method that seems to fit best with your analytical strategy and the insight you're trying to get from your data. The statistics and data science offers a lot of different choices. One of the most important is going to be feature selection or the choice of variables to include in your model. It's sort of like confronting this enormous range of information and trying to choose what matters most, trying to get the needle out of the haystack. The goal of feature selection is to select the best features or variables and get rid of uninformative and noisy variables and simplify the statistical model that you're creating, because that helps avoid overfitting or getting a model that works too well with the current data and works less well with other data. The major problem here is multicollinearity, very long word that has to do with the relationship between the predictors and the model. I'm going to show it to you graphically here. Imagine for instance we've got a big circle here to represent the variability in our outcome variable. We're trying to predict it. And we've got a few predictors. So we've got predictor number one over here and you see it's got a lot of overlap. That's nice. Then we've got predictor number two here. It also has some overlap with the outcome, but it also overlaps with predictor one. And then finally down here we got predictor three which overlaps with both of them. And the problem rises the overlap between the predictors and the outcome variable. Now there's a few ways of dealing with this. Some of these are pretty common. So for instance, there is the practice of looking at probability values and regression equations. They are standardized coefficients and there's variations on sequential regression. They're also newer procedures for dealing with the disentanglement of the association between the predictors. There's something called commonality analysis. There's dominance analysis. And there are relative importance weights. Of course, there are many other choices in both the common and the newer, but these are a few that are worth taking special look at. First is p values or probability values. This is the simplest method because most statistical packages will calculate probability values for each predictor, and they'll put little asterisks next to it. And so what you're doing is you're looking at the p values, the probabilities for each predictor, or more often the asterisks next to it, which sometimes give its name of star search, just kind of cruising through a large output of data, and just looking for the stars or asterisks. This is fundamentally a problematic approach for a lot of reasons. The problem here is you're looking individually, and it inflates false positives. Say you have 20 variables, each is entered and tested with an alpha or false positive of 5%. You end up with nearly a 65% chance of at least a false one false positive in there. It's distorted by sample size because with a large enough sample, anything becomes statistically significant. And so relying on p values can be a seriously problematic approach. Slightly better approaches to use betas or standardized regression coefficients. And this is where you put all the variables on the same scale. So usually standardized from zero and then to either minus one plus one or with a standard deviation of one. The trick is though they're still in the same context of each other and you can't really separate them because those coefficients are only valid when you take that group of predictors as a whole. So one way to try to get around that is to do what they call stepwise procedures where you look at the variables in sequence. There are several versions of sequential regression that allow you to do that. You can put the variables into groups or blocks and enter them in blocks and look at how the equation changes overall. You can examine the change in fit at each step. The problem with a stepwise procedure like this is it dramatically increases the risk of overfitting, which again is a bad thing if you want to generalize your data. And so to do with this, there's a whole collection of newer methods. A few of them include commonality analysis, which provides separate estimates for the unique and shared contributions of each variable. Well, that's a neat statistical trick, but the problem is it just moves the problem of disentanglement to the analyst. So you're really not better off than you were as far as I can tell. There's dominance analysis, which compares every possible subset of predictors. Again, sounds really good, but you have the problem known as the combinatorial explosion. If you have 50 variables that you could use, and there are some that have millions of variables with 50 variables, you have over one quadrillion possible combinations. You're not going to finish that in your lifetime. And it's also really hard to get things like standard errors and perform inferential statistics with this kind of model. Then there's also something that's even more recent than these others called relative importance weights. And what this does is it creates a set of predictors that are orthogonal or uncorrelated with each other, basing them off of the originals. And then it predicts the scores and then it can predict the outcome without the multicollinear because these new predictors are uncorrelated. It then rescales the coefficients back to the original variables. That's the back transform. And then from that, it assigns relative importance or a percentage of explanatory power to each predictor variable. Now, despite this very different approach, it tends to have results that resemble dominance analysis. It's actually really easy to do their websites. You just plug in your information and it does it for you. And so that's yet another way of dealing with the problem of multicollinearity and trying to disentangle the contribution of different variables. In some, let's say this, what you're trying to do here is choose the most useful variables to include into your model. Make it simpler. Be parsimonious. Also, reduce the noise and distractions in your data. And in doing so, you're going to always have to confront the ever present problem of multicollinearity or the association between the predictors in your model with several different ways of dealing with that. As we continue our discussion of statistics and the choices that are made, one important consideration is model validation. And the idea here is as you're doing your analysis, are you on target? More specifically, your model that you create through regression or whatever you do, your model fits the sample data beautifully. You've optimized it there. But will it work well with other data? Fundamentally, this is the question of generalizability. Also, sometimes called scalability because you're trying to apply it in other situations. And you don't want to get too specific or it won't work in other situations. Now, there are a few general ways of dealing with this and trying to get some sort of generalizability. Number one is Bayes, a Bayesian approach. Then there's replication. Then there's something called holdout validation. And then there's cross validation. I'll discuss each of these very briefly in conceptual terms. The first one is Bayes. And the idea here is you want to get what are called posterior probabilities. Most analyses give you a probability value for the data given the hypothesis. So you have to start with an assumption about the hypothesis. But instead, it's possible to flip that around by combining it with special kinds of data to get the probability of the hypothesis given the data. And that is the purpose of Bayes theorem, which I've talked about elsewhere. Another way of finding out how well things are going to work is through replication. That is, do the study again. It's consider the gold standard in many different fields. The question is whether you need an exact replication or a conceptual one that is similar in certain respects. You can argue for both ways. But one thing you want to do is when you do a replication, then you actually want to combine the results. And what's interesting is the first study can serve as the Bayesian prior probability for the second study. So you can actually use meta-analysis or Bayesian methods for combining the data from the two of them. Then there's holdout validation. This is where you build your statistical model on one part of the data and you test it on another. I'd like to think of it as the eggs in separate baskets. The trick is that you need a large sample in order to have enough to do these two steps separately. On the other hand, it's also used very often in data science competitions as a way of having a sort of gold standard for assessing the validity of a model. Finally, I'll just mention one more. That's cross-validation. This is when you use the same data for both training and for testing or validating. There's several different versions of it. And the idea is you're not using all the data at once, but you're kind of cycling through and weaving the results together. There's leave one out where you leave out one case at a time, also called LOO, L-O-O. There's leave P out where you leave out a certain number at each point. There's K-fold where you split the data and to say, for instance, 10 groups. And you leave out one and you develop it on the other nine and then you cycle through. And there's repeated random subsampling where you use a random process at each point. Any of those can be used to develop the model on one part of the data and test it on another and then cycle through to see how well it holds up under different circumstances. And so in sum, I can say this about validation. You want to make your analysis count by testing how well your model holds up from the data you developed it on to other situations, because that's really what you're trying to accomplish. This allows you to check the validity of your analysis and your reasoning and it allows you to build confidence in the utility of your results. To finish up our discussion on statistics and data science and the choices that are involved, I want to mention something that really isn't a choice, but more an attitude. It's DIY for do it yourself. The idea here is, you know, really, you just need to get started. Remember, data is democratic. It's there for everyone. Everybody has data. Everybody works with data either explicitly or implicitly. So data is democratic. So is data science. And really, my overall message is you can do it. You know, a lot of people think you have to be at this totally cutting edge virtual reality sort of thing. And it's true. There's a lot of active development going on in data science. There's always a new stuff. The trick, however, is the software that you can use to implement those things, often lags. It'll show up first in programs like R and Python. But as far as it's showing up in a point click program, that could be years. What's funny, though, is often these cutting edge developments don't really make much of a difference in the results of the interpretation. They may in certain edge cases, but usually not a huge difference. And so I'm just going to say Analyst beware. You don't necessarily have to do it. It's pretty easy to do them wrong. And so you don't have to wait for the cutting edge. Now, that being said, and do want you to pay attention to what you're doing. A couple of things I've said repeatedly is know your goal. Why are you doing this study? Why are you analyzing data? What are you hoping to get out of it? Try to match your methods to your goal, be goal directed. Focus on the usability. Will you get something out of this that people can actually do something with? And then, as I've mentioned several times with the Bayesian thing, don't get confused with probabilities. Remember that priors and posteriors are different things, just so you can interpret things accurately. Now, I want to mention something that's really important to me personally, and that is beware the trolls. You will encounter critics, people who are very vocal and who can be harsh and grumpy and really just intimidating. And they can really make you feel like you shouldn't do stuff because you're going to do it wrong. But the important thing to remember is the critics can be wrong. Yes, you'll make mistakes. Everybody does. You know, I can't tell you how many times I have to write my code more than once to get it to do what I want it to do. But in analysis, nothing is completely wasted if you pay close attention. I've mentioned this before. Everything signifies, or in other words, everything has meaning. The trick is that the meaning might not be what you expected it to be. So you're going to have to listen carefully. And I just want to reemphasize all data has value. So make sure you're listening carefully. In some, let's say this, no analysis is perfect. The real question is not is your analysis perfect, but can you add value? And I'm sure that you can. And fundamentally, data is democratic. So I'm going to finish with one more picture here. And that is just jump right in and get started. You'll be glad you did. To wrap up our course, statistics and data science, I want to give you a short conclusion and some next steps. Mostly I want to take a little piece of advice I learned from a professional saxophonist, Kirk Whelam, and he says, there's always something to work on. There's always something you can do to try things differently to get better. It works when practicing music. It also works when you're dealing with data. Now, there are additional courses here at datalab.cc that you might want to look at. There are conceptual courses, additional high level overviews on things like machine learning, data visualization and other topics. And I encourage you to take a look at those as well to round out your general understanding of the field. There are also, however, many practical courses. These are hands-on tutorials on the statistical procedures I've covered and you learn how to do them in R and Python and SPSS and other programs. But whatever you're doing, keep this other little piece of advice from writers in mind. And that is right what you know. And I'm going to say it this way, explore and analyze and delve into what you know. Remember, when we talked about data science and the Venn diagram, we've talked about the coding and the stats, but don't forget this part here on the bottom. Domain expertise is just as important to good data science as the ability to work with computer coding and the ability to work with the numbers and quantitative skills. But all through it, remember this, you don't have to know everything. Your work doesn't have to be perfect. The most important thing is just get started. You'll be glad you did. Thanks for joining me and good luck.