 Data science has a lot of really wonderful things about it, but it is important to consider some ethical issues And I'll specifically call this do no harm in your data science projects And for that we can say thanks to Hippocrates the guy who gave us the Hippocratic oath of do no harm let's specifically talk about some of the important ethical issues very briefly that come up in data science Number one is privacy That data Tells you a lot about people and you need to be concerned about the confidentiality if you have private information about people Their names their social security numbers their addresses their credit scores their health That's Private that's confidential and you shouldn't share that information unless they specifically gave you permission Now one of the reasons this presents a special challenge in data science Because we'll see later a lot of the sources that are used in data science We're not intended for sharing if you scrape data from a website or from PDFs You need to make sure that it's okay to do that But it was originally created without the intention of sharing so Privacy is something that really falls upon the analyst to make sure they're doing it properly Next is anonymity One of the interesting things we find is that it's really not hard to identify people in data If you have a little bit of GPS data and you know where a person was at four different points in time You have about a 95% chance of knowing exactly who they are you look at things like HIPAA That's the health insurance portability and accountability act Before HIPAA it was really easy to identify people for medical records Since then it has become much more difficult to identify people uniquely. That's an important thing for really people's well-being And then also proprietary data if you're working for a client a company and they give you their own data That data may have identifiers. You may know who the people are and they're not anonymous anymore So anonymity may or may not be there major efforts to make data anonymous But really the primary thing is that even if you do know who they are that you still maintain the privacy and confidentiality of the data Next there's the issue about copyright Where people try to lock down information Now just because something is on the web doesn't mean that you're allowed to use it scraping data from websites is a very common and a useful way of getting data for projects You can get data from web pages from PDFs from images from audio from really a huge number of things But again the assumption that because it's on the web. It's okay to use it. It's not true You always need to check copyright and make sure that it's acceptable for you to access that particular data Next in our very ominous picture is data security And the idea here is that when you go through all the effort to gather data to clean it up and prepare for an analysis You've created something that's very valuable to a lot of people and you have to be concerned about hackers trying to come in and steal the data Especially if the data is not anonymous and it has identifiers in it And so there is an additional burden and placed on the analyst to ensure to the best of their ability That the data is safe and cannot be broken into and stolen and that can include very simple things like a person Who is on the project but is no longer but took the data on a flash drive You have to find ways to make sure that that can't happen as well. There's a lot of possibilities. It's tricky But it's something that you have to consider thoroughly Now two other things that come up in terms of ethics, but don't usually get addressed in these conversations number one is potential bias the idea here is that the Algorithms or the formulas that are used in data science are only as neutral and bias free as the rules and the data that they get and So the idea here is that if you have rules that address something that is Associated with for instance gender or age or race or economic standing You might unintentionally be building in those factors, which say for instance for title 9 You're not supposed to you might be building those into the system without being aware of it and an algorithm has this sheen of Objectivity and people can say they can place confidence in it without realizing that it's replicating some of the prejudices that may happen in real life another issue is overconfidence and the idea here is that analyses are limited simplifications they have to be that that's just what they are and Because of this you still need humans in the loop to help interpret and apply this The problem is when people run an algorithm to get at a number say to 10 decimal places And they say this must be true and treat it as written in stone Absolutely unshakable truth when in fact if the data were biased going in if the algorithms were incomplete if the sampling was not representative you can have enormous problems and Go down the wrong path with too much confidence in your own analyses So once again humility is an order when doing data science work in some Data science has enormous potential, but it also has significant risks involved in the projects Part of the problem is that analyses can't be neutral that you have to look at how the algorithms are associated with the preferences prejudices and biases of the people who made them and What that means is that no matter what good judgment is always vital to the quality and success of a data science project