 Okay, it looks like everybody's got a seat at this point pretty much And I wanted to let you guys know that I will be putting the code on Github that goes along with the presentation that I'm giving so if any of you are new to Data analysis and want some code to get started with that will be available on Github and My talk Will be a combination of different ways to explore. Oh, sorry about that. Okay Oh, oh, okay Okay So my talk is going to be a combination of why you should do exploratory analysis and How to do exploratory analysis? I'm not gonna cover everything because that could take weeks but just a couple of graphs to get everyone started and thinking about Why we do exploratory analysis why it's important and What happens if you don't do it? So the biggest problem that I have seen with machine learning is Not that we don't have enough algorithms to use or that we don't have enough techniques It's that people don't want to start with the basics. They want to jump right to the exciting stuff Oh, let's throw a neural net at it. I go that'll fix it Rather than just looking at the basics of the problem Kind of like eating your vegetables people aren't super excited about vegetables generally, but they help make you stronger and healthy and Exploratory analysis can also help you make your data healthier So some examples of why you want to do exploratory analysis or even just sort of think about what you're doing with your data Because this Article as an example They were trying to identify Criminals based on facial recognition not a specific criminal just criminals in general I that doesn't really sound like a great idea if you think about it Especially when you realize that oh, hey, they weren't even comparing pictures from the same source They were comparing mugshots to Photos that they found on the internet of smiling people So yeah, I think most of you are probably familiar enough with data science to realize why that's a problem But it's easy to just get so involved in what you're doing that you don't stop and think about is this actually a good idea? what I'm about to do and another example is This one. I'm not as sure about I think it's an interesting example Because there are some diseases that have hyphens in the name and it turns out having a bunch of hyphens in the title of your article Has a negative correlation with how many citations your paper has Well this could be a thing or It could just be that longer titles generally have more hyphens and also are a more specialized article So then it's going to be cited less because you're talking about a niche within medical Research so maybe maybe not a Lot of interesting correlations out there something that I Think needs to be looked into more, but just an example of Why we need exploratory analysis. We need to figure out what's actually there what might not be important and The good news is that there's all sorts of Python packages out there to help you with your data and I'm gonna go through a couple of my favorites and Show you guys some examples of what you can do with them so I'm not going to cover a lot about feature extraction just because it's very specific to The kind of data you have For my dissertation research. I used inertial sensors So I dealt a lot with time series data. That is completely different from language processing So how you extract features from your data? Look within whatever area of research you're doing but also other areas that might have similar data When I did my research I used a lot of papers that were from the aviation industry Because it turns out it's a lot more important to keep a plane up right while it's flying through the air Then it is to make sure that you know the exact position of grandma's ankle when she's walking around the house So there's been a lot more research For aviation applications. So had I just looked in medical journals I would have missed a lot of valuable information about how to deal with the data from these sensors So just something to keep in mind that you don't have to stick to exactly the application that you're using something for And another thing to keep in mind with feature extraction So for categorical data There's really two different kinds of categorical data some of it is an order some of it not really so much and I've seen these mixed up and people will try and turn nominal data into Numerical data It can work if you look at it and you see if there's some sort of correlation put it in a specific order I mean, there's always Exceptions it is specific to the data you use but in general if you haven't explored your data You probably don't want to take something like colors like what someone's favorite color and just randomly say Oh zero is going to be blue and one is going to be red Without looking at your data first and another thing that I just want to mention was binning Because I've seen that done a Variety of different ways When something is not continuously distributed, it's probably worth looking at your data before you make your bends I'm going to use an example of the Titanic data set From Kaggle just because there's so many different Options out there for code to analyze it already once you get familiar with that particular Data set then looking at the different ways. It's been Re-analyzed is a good way to get started with visualization. I think that's how I Decided to get started looking into different kinds of visualization. So I'm going to use the same Example here except for a couple of different graphs that don't really lend themselves well to The data from the Titanic data set so the first thing that I usually do is just use describe and That will work on your numerical data sets Good way to get started find out how many things there are some Standard information very easy to do Worthwhile information the next thing is Pandas profiling If you haven't heard of pandas profiling, it's really great Please use it It gives you so much information it can be overwhelming if you're not used to doing data analysis But it's a great place to get started you get a bunch of different graphs with distributions of The different features that you have It will give you warnings I've put an example up on the screen of what happens when you run pandas profiling on the Titanic data set So definitely pandas profiling great way to get started It won't necessarily tell you everything you need to know about your data though So that shouldn't be the only thing you do it just sort of gives you a direction To figure out what you might want to explore further So the cons of pandas profiling The outputs HTML I Normally use spider The spider is not super happy about HTML files You can save it to a file That's pretty much your only option with spider if you use it with jupyter notebooks, then it's going to be a lot easier but Definitely worth using I'll even open up jupyter notebooks even though it's not what I usually use just to Look at something in pandas profiling real quick before I get started so once you use pandas profiling get a Idea of what might be interesting Then if you have time series data It's good to start with just a simple graph like this graph is from the data analysis that I did as part of the data science boot camp that I went to and And I realized part way through my project at the data science boot camp that oh, hey I was classifying data That had come from two different wards of a hospital and I realized I'm pretty sure that They had two different kinds of sensors on the patients and the two different wards and That was kind of a problem because I was classifying based on What sensor they were using instead of if the patient was actually in the cardiac ward of the hospital or not? so that was Unfortunate for my project, but at least I didn't think that I actually had something that was Super exciting and useful and then it turns out. Oh, hey, I was just classifying an artifact of the different machines that were used to collect the data So yeah, I'm telling all of you guys about my very embarrassing story just so that you know that Everyone has these stories if they say they don't they're lying So yeah, if I had started with doing some of these graphs just a simple graph of the data from the two different groups that I was trying to classify I could have saved myself a lot of time and Then it would have been early enough into the boot camp that I could have picked a different project and Not had to tell potential employers that oh, hey, here's this analysis. I did I think it might actually just be a giant steaming pile But it's what I did because I figured it out too late to start a new project So Heatmaps Everyone loves heatmaps. I like using them to help show data to people that are not Necessarily data scientists It can be a good way to see what might be interesting and correlated, but Also, it's easy for someone that's not super technical to understand So I am a fan of heatmaps. I realize that not everyone might be But I do find value in them And it's also part of pandas profiling. So you get one of those without having to really Do any extra analysis? Well, that's not great. Okay But another thing that I like is violin plots Especially if you have a binary Classification because then you can put your two classes on either side and see if they are similarly distributed So this example is from the Titanic data set and you can see that Okay, this data is definitely dependent upon age Which makes sense They put all the children on the lifeboats first. Well Especially the male children it looks like But yeah, if you look back at the heat map Between survival and age, there's not really much of a correlation But then if you start breaking it out Into different ages versus just Age then you can see oh, hey, I Might should make my bins According to where these probabilities are So that's what I was talking about with looking at your data first before you decide to bin it Because this one if you had just taken the youngest and the oldest and divided into a certain number of groups you might have not had the Best classification based on the probabilities that are in there. So I Prefer to look at data before just randomly assigning it to bins and another thing that I like pair plots and Pair plots are great because you can see how two variables are related This is not the best example of a pair plot But you can sometimes see Just okay, all it really takes is those two variables to separate out your classes If you're lucky that happened with the pair plot for my dissertation research, but unfortunately that one was really large and I Wanted to keep this consistent with showing you guys the same data but it's not necessarily that if you don't see a separation on a pair plot you won't be able to classify your data is Just if you see it that's good news for you I your life just got easier because you know that oh, hey these two variables definitely can show you something okay, and Here is a little more of me Showing you guys just how terrible my idea at the data science bootcamp was Because if I can't be an example, at least I can be a terrible warning for all of you So this drift feature that was the most important when I used a tree-based classifier That was when I did that midline So back oh, I should have put these two slides next to each other You Okay, so here when I did that midline correction so that Orange line I took that out and then you get the green one So that orange line the standard deviation of that line Was the most important feature that I found for my project the thing that I took out to try and make the Two different data sets the same Yeah, that was my most important feature Because one of the data sets had already been Normalized and had that midline correction and the other did not so Yeah, that was a frustrating Experience for me, but at least I found it out before I went any further with it So I can't really complain and I mean I did learn something from going to the data science bootcamp because Now I know the importance of doing much more EDA much earlier Before I have to commit to an idea so Yeah, that was my boot camp experience, I would say If you are thinking about going to a data science bootcamp, they might tell you not to pick a project before you go there Definitely find some data you're interested in and look into it a little bit first try to avoid doing what I did and Realizing your project is a steaming pile when it's too far into switch So yeah, that just my rant on data science boot camps, but Okay so this was another thing that I did while I was at the data science boot camp because I Needed to get results that didn't look too terrible So I just threw a bunch of classifiers at my data because I was running out of ideas I'm not saying that it's a bad idea to try a bunch of different classifiers, but at least Look at your data first So you don't end up in the same situation that I was in and in this case Yeah, there were a bunch of ones that were overfitting. I should have also used more data but It was a learning experience and here's another graph with me just Telling you guys how terrible my idea was So yeah, really high training accuracy really low test accuracy. I was overfitting and it was so bad but I at least was able to Look at false positives versus false negatives Since this was a medical application It's a little bit different than security for medicine. We really really don't want to miss anything Like saying that oh, hey someone that's having a cardiac event is fine is real bad Better to accidentally say oh, hey someone might be having a cardiac event when they aren't so This is a little bit different than security but Yeah, so Does anyone have any questions cuz I Could just keep on going but okay, I guess Yeah, so for that, there's a bunch of different techniques you can use You could do Um Dimensional reduction so combining those features seeing what features end up being combined But yeah, I was just showing some ways to get started Definitely just looking at two variables together is not the only thing you want to do If two variables is all you need to start getting it separated then that's really good But that's not how it usually works Um, sorry. Did I answer your question? Or there are a bunch of different techniques for that Principal component analysis lasso Sorry, I'm drawing a blank on all of them. But yes, there are definitely ways to do that and then again One option you can use is Running it through a tree base classifier that might not be why you end up using in the end, but It can show you if there are multiple variables that are important. So for this one it showed There wasn't just one or two standout variables that okay, that's the one that was really predicting what was going on So as far as if you want to visualize it that could be an option looking at that feature importance Oh, he was asking about when there would be more than two features that would be important so I Was just because I'd shown the pair plot which only shows two features combined with each other But a lot of the time you're going to have a combination of more than two features that will End up being what you use to predict Whatever it is that you're classifying so Yeah, there are other techniques beyond what I showed this was just more of a starting to think about your data and Please please think about what you're doing and if there's bias in your data before it was even collected like the example that I gave of my data science boot camp project where It turns out I was just checking to see which kind of sensor had been used to collect the data Versus what I was trying to do There's so much bias data out there and it's easy to Not really think about where the data came from Is this a good idea? and that sort of thing like the Sorry, I keep on coming back to this But it's just is so bad Like this like If they would have just thought about oh, hey, why am I trying to separate criminals from non criminals? I I feel like if they just spent a little more time thinking about that. Maybe they wouldn't have put that out there and Now everyone's seen it and sort of makes fun of it Okay. Oh, oh, yes it There are a lot of people I've seen that skip EDA because They either don't know how to do it or they just get really excited about their project and want to just go straight to classification So I think EDA is so easy that there's not really any excuse to skip it like You're never going to look back at a project and say I really wish I hadn't spent that hour or two Looking into my data to make sure that it's what I think it is But there are plenty of times like my bootcamp project where I look back and thought I Really wish I had spent more time looking into that project before I committed to it. Yeah, so that's sort of why I am now Very passionate about EDA because I don't want to have any more of those Experiences and I want to help prevent others from having that experience because it was fairly unpleasant for me I mean, also, I'm happy that I did discover it during the bootcamp and Not try and submit it and get a grant and then realize after I told even more people about it So but yeah, it would have been better if I found that earlier for sure Okay. Oh and your question so to be honest Most of my experiences with medical data where there's not really an issue of ever having too much data Because there's so much Regulation around medical data and rightly so we don't want medical records just being available freely on the internet So in medical research, it's more a problem of needing to get more data If you have tons and tons of data Then okay number one. I'm jealous. I want your data. I Want that problem? but also you can always take a subset of your data to Do some exploratory analysis with And then maybe do a couple of different comparisons to make sure that that subset you took is representative of the entire data set If you want to do something like pandas profiling running that on a very large data set is going to take quite a while So you could take a subset do a couple of basic comparisons to make sure that they are similar and then run that subset through your EDA To do more extensive analysis some of the things like the correlation coefficients can Take a while if you've got a very large data set But definitely taking that step to do that comparison of oh, hey is this subset? Comparative to the whole of the data set. That's valuable Okay Any other questions? Okay Sparsely populated features So for that probably the best way to deal with sparsely populated features is to look into how language processing is done because That is a common problem in language processing because they will vectorize word strings or Strings and then vectorize it so one hot encoding of each word that is in a sentence for example So there is a lot of very sparse data in nlp So I'm not super familiar with nlp. I don't want to tell you for sure I do this and that and but Even if you're not dealing with language data, I would say look at some of the different processes that are used for nlp and yeah one of the things that is Used for nlp is combining some of the fields so that then you don't have as Many different dimensions that are sparsely populated and there's a bunch of different ways to do that It's going to be specific to your data so Okay Well, I guess that's all the questions Thank you for Coming to my talk and Hopefully if I have even prevented one person from making the same mistake that I did then I am very happy Because it's really frustrating to get far into a project and then realize oh, hey, I missed a fundamental flaw in my data