 When it comes to data sourcing, obviously the most important thing is to get data. But the easiest way to do that, at least in theory, is to use existing data. Think of it as going to the bookshelf and getting the data that you have right there at hand. Now there's a few different ways to do this. You can get in house data, you can get open data, and you can get third party data. Another nice way to think of that is proprietary, public and purchased data, the three P's I've heard it called. Let's talk about each of these a little bit more. So in house data, that's stuff that's already in your organization. What's nice about that is it can be really fast and easy. It's right there. And the format may be appropriate for the kind of software and the computer that you're using. If you're fortunate there's good documentation, although sometimes when it's in house, people just kind of throw it together. So you have to watch out for that. And there's the issue of quality control. Now this is true with any kind of data, but you need to pay attention with in house because you don't know the circumstances necessarily under which people gathered the data and how much attention they were paying to something. There's also an issue of restrictions. There may be some data that while it's in house, you may not be allowed to use or you may not be able to publish the results or share the results with other people. So these are things that you need to think about when you're going to use in house data in terms of how can you use it to facilitate your data science projects. Specifically, there are a few pros and cons in house data, potentially quick, easy, free, hopefully standardized, maybe even the original team that conducted this study is still there. And you might have identifiers in the data, which make it easier for you to do an individual level analysis on the con side. However, the in house data simply may not exist. Maybe it's just not there, or the documentation may be inadequate. And, of course, the quality may be uncertain, always true, but maybe something you have to pay more attention to when you're using in house data. Now, another choice is open data, like going to the library and getting something. This is prepared to data that's freely available, consists of things like government data and corporate data and scientific data from a number of sources. Let me show you some of my favorite open data sources just so you know where they are and that they exist. Probably the best one is data.gov here in the US. That is the says right here, the home of the US government's open data. Or you may have a state level one, for instance, I'm in Utah, and we have data.utah.gov, also a great source of more regional information. If you're in Europe, you have open dash data dot Europa dot EU, the European Union Open Data Portal. And then there are major nonprofit organizations. So the UN has UNICEF dot org slash statistics for their statistical and monitoring data. The World Health Organization has the Global Health Observatory at who dot int slash GHO. And then there are private organizations that work in the public interest such as the Pew Research Center, which shares a lot of its data sets. And the New York Times, which makes it possible to use APIs to access a huge amount of the data of things they've published over a huge time span. And then two of the mother loads, there's Google, which at Google.com has public data, which is a wonderful thing. And then Amazon at AWS dot Amazon.com data sets has gargantuan data sets. So if you needed a data set that was like five terabytes in size, this is the place you would go to get it. Now, there's some pros and cons to using this kind of open data. First is that you can get very valuable data sets that maybe cost millions of dollars to gather into process. And you can get a very wide range of topics and times and groups of people and so on. And often the data is very well formatted and well documented. There are, however, a few cons. Sometimes there's by a sample say, for instance, you only get people who have internet access. And that can mean, you know, not everybody. Sometimes the meaning of the data is not clear or it may not mean exactly what you wanted to. A potential problem is that sometimes you may need to share your analysis. And if you're doing proprietary research, well, it's going to have to be open research instead. And so that can create a cramp with some of your clients. And then finally, there are issues with privacy and confidentiality. And in public data, that usually means that the identifiers are not there and you're going to have to work at a larger aggregate level of measurement. Another option is to use data from a third party. These go by the name data as a service or Das, you can also call them data brokers. And then thing about data brokers is they can give you an enormous amount of data on many different topics. Plus, they can save you some time and effort by actually doing some of the processing for you. And that can include things like consumer behaviors and preferences, they can get contact information. They can do marketing identity and finances. There's a lot of things. There's a number of data brokers of round. Here's a few of them. Axiom is probably the biggest one in terms of marketing data. There's also Nielsen, which provides data primarily for media consumption. And there's another organization, Datasift, that's a smaller newer one. And there's a pretty wide range of choices, but these are some of the big ones. Now the thing about using data brokers, there's some pros and there's some cons. The pros are first that it can save you a lot of time and effort. It can also give you individual level data, which can be hard to get from open data. Open data is usually at the community level. They can give you information about specific consumers. They can even give you summaries and inferences about things like credit scores and marital status, possibly even whether a person gambles or smokes. Now the con is this. Number one, it can be really expensive. I mean, this is a huge service. It provides a lot of benefit and is priced accordingly. Also, you still need to validate it. You still need to double check that it means what you think it means and that it works in with what you want. And probably a real sticking point here is the use of third party data is distasteful to many people. And so you have to be aware of that as you're making your choices. So in sum, as far as data sourcing existing data goes, obviously data science needs data. And there's the three P's of data sources proprietary and public and purchased. But no matter what source you use, you need to pay attention to quality and to the meaning and the usability of the data to help you along in your own projects.