 Back here in data science, we're going to continue our attempt to define data science by looking at something That's really well known in the field the data science Venn diagram Now if you want to you can think of this in terms of what are the ingredients of data science Well, we're going to first say thanks to Drew Conway the guy who came up with this and If you want to see the original article you can go to this address But what Drew said is that data science is made of three things and we can put them as overlapping circles because it's the intersection That's important here on the top left is coding or computer programming or as he calls it hacking on The top right is stats or stats and mathematics or quantitative abilities in general and on the bottom is Domain expertise or intimate familiarity with a particular field of practice business or health or education or something like that And the intersection here in the middle That is data science. So it's the combination of coding and statistics and math and domain knowledge Now let's say a little more about coding the reason coding is important because it helps you gather and prepare the data Because a lot of the data comes from novel sources and it's not necessarily ready for you to gather and it can be in very unusual Formats and so coding is important because it can require some real Creativity to get the data from these sources to put it into your analysis Now a few kinds of coding that are important For instance, there's statistical coding a couple of major languages in this are our and Python to open source Free programming languages are specifically for data Python's general purpose, but well adapted to data The ability to work with databases is important to the most common language there is SQL usually pronounced sequel which stands for structured query language Because that's where the data is also There's the command line interface or if you're on a Mac people just call it the terminal most common language there is bash which actually stands for born again shell and then Searching is important and reg X or regular expressions While there's not a huge amount to learn there. It's a it's a small little field It's sort of like super powered wildcard searching that makes it possible for you to both find the data and reformat it in Ways that are going to be helpful for your analysis Now let's say a few things about the math You're going to need things like a little bit of probability some algebra Of course regression very common statistical procedure those things are important and the reason you need the math is because that's going to help you choose the appropriate procedures to answer the question with the data that you have and Probably even more importantly, it's going to help you diagnose problems when things don't go as expected and Given that you're trying to do new things with new data's in new ways You're probably going to come across problems And so the ability to understand the mechanics of what's going on is it going to give you a big advantage and The third element of the data science Venn diagram is some sort of domain expertise Think of it as expertise in the field that you're in Business settings are common You need to know about the goals of that field the methods that are used the constraints that people come across and It's important because whatever your results are you need to be able to implement them Well data science is very practical and it's designed to accomplish something and your familiarity with a particular field of practice is Going to make it that much easier and more impactful when you implement the results of your analysis Now let's go back to our Venn diagram here just for a moment because this is a Venn We also have these intersections of two circles at a time at the top is machine learning At the bottom right is traditional research and on the bottom left is what drew Conway called the danger zone Let me talk about each of these It's off machine learning or ML now you think about machine learning and the idea here is that it represents coding or statistical programming and mathematics without any real domain expertise Sometimes these are referred to as black box models They kind of throw data in and you don't even necessarily have to know what it means or what language it's in and it'll just kind Of crunch through it all and it'll give you some regularities That can be very helpful, but machine learning is considered Slightly different from data science because it doesn't involve the particular applications in a specific domain Also, there's traditional research This is where you have math or statistics and you have domain knowledge often very intensive domain knowledge But without the coding or programming now you can get away with that because the data that you use in traditional research Is highly structured it comes in rows and columns is typically complete and it's typically ready for analysis Doesn't mean your life is easy because now you have to an expend an enormous amount of effort in the method in designing the project and in the interpretation of the data so still very heavy Intellectual cognitive work, but it comes in a different place and then finally there's what Conway called the danger zone And that's the intersection of coding and domain knowledge, but without math or statistics now He says it's unlikely to happen and that's probably true On the other hand I could think of some common examples What are called word counts where you take a large document or series of documents and you count how often each word appears in there That can actually tell you some important things and also drawing maps and showing how things change across place and maybe across time You don't necessarily have to have the math, but it can be very insightful and helpful So let's think about a couple of backgrounds where people come from here First is Cody You can have people who are coders who can do math stats and business So you get the three things and this is probably the most common most of people come from a programming background On the other hand, there's also stats or statistics and you can get statisticians who can code and who also can do business That's less common, but it does happen and finally there's people who come into data science from a particular domain These are for instance business people who can code and do numbers and they're the least common but all of these are important to data science and so in some Here's what we can take away first several fields make up data science Second diverse skills and backgrounds are important and their need is in data science and third There are many roles involved because there's a lot of different things that need to happen We'll say more about that in our next movie