 So I'm Corey Chivers, I'm a PhD student down the road at McGill. I study there, I study computational ecology and if you know what that means come and tell me at beers afterwards because I'm dying to know but I'm not gonna talk about that today I'm gonna talk about what I like to do kind of in my more spare time evenings and weekends which is competitive data science and with using our arm Python so I tried to make a pretty grand title of my alternatives were from the depths of the ocean to the expanse of the cosmos competitive data science something like that and the original title said competitive data mining I think so it's not really data mining and I was going to change it to machine learning it's not really machine learning specifically either or statistics so I use that wonderful catch-all title that we all love now data science so what's competitive data science so it's this there's a company called Kaggle in the Valley and they they're making data science a sport according to their their website and they've been around for a couple years and they've been getting tons of press that's in the New York Times I could have pulled up a you know a ton of tons of articles about them they're really huge but what do they do so what they do is they host challenges data challenges online and anybody can take their best shot at them they post up the data they post a well-defined performance metric of what your solution is to a given problem and everybody just takes their stab at it whoever gets the best score wins and it turns out doing these things is kind of like crack like it's really really addictive and so I sunk a lot of hours into doing these these contests so I'll just get started with a bit of the toolkit that one would use if they're going to try to get into this competitive data science world so not surprisingly given the title I use R a lot in my academic work as an ecologist ecologists are huge on R but also Python is is increasingly becoming used for for numerical analysis and I mean some would argue for a long time but people are in ecology are starting to recognize that as well but that's of course a simplified view of the toolkit obviously it's much it's much more it's much bigger than that you've got all kinds of awesome Python modules and actually what stats models I think was in there but it must have got erased it was in there before and you know so you're going to use all kinds of tools not just are in Python Octave there is the open source version of MATLAB C C plus plus if you really need to do some computationally intensive simulations but not so much in the competitive data science world where you're given a very well-structured problem with very clean data but in kind of the more data science more broadly of course your toolkits can also include things for wrangling and munging that data like said grep and awk which are favorite go-tos and then of course you can't talk about data science without putting up the friendly elephant and mention Hadoop none of the data that you'll encounter on on Kaggle is of the scale that you would need to use Hadoop but I put it there anyways for completeness so let's just launch right into what that one of the challenges so Cornell University in collaboration with some other some other scientists have dispatched this big network of buoys in the high-traffic areas high ocean traffic areas that are roots of ships I'm struggling with that so and the idea is that they're trying to limit the amount of collisions between whales and these ships and so they want to be able to identify when is a whale kind of in the vicinity of one of these buoys and the buoys have these underwater microphones so the problem is one of signal processing of that audio information and then it's a classification problem it's the question is is there a whale or is there not a whale so I thought I would just show kind of some of the highlights of this kind of analysis or this like how you run through this kind of thing so we're going to find the whales because these in your oceans so some of the some of them important modules are this sidekits audio lab this is going to allow us to to read in these audio files so they're little two second clips of audio from these underwater microphones and then they mostly just sound like there's mostly nothing but actually the whale kind of goes it's very subtle but it's in there and you can hear it in some of them but there's they provide you with something like 30,000 of these two second audio files so we don't want to just listen to them we want to actually come up with an algorithm to classify them so that's going to help us read those in and then of course you know you can't go to bed feeling good about yourself until you've done some fast Fourier transforms so we bring in FFT pack from sci-fi and then this module is just so we can get ourselves some motivation to get started oh no I'm not online I had a cute little picture of a whale there he was telling us to take a breath so just imagine it oh oh that's right there we go I thought that everything was still in the kernel there we go brilliant thank you okay so yeah that's very good advice you know before you start any one of these you know data problems take a deep breath and then we're just going to load in the the labels so so it's a supervised learning problem meaning that you have some examples that are labeled that there either is a whale or there's not a whale and then you're meant to create an algorithm that can learn from those training cases and create a prediction when they when it experiences a novel case so we're going to just read in the the labels that I guess was read in here because it's 30,000 long and then we'll just have a look at what this data looks like so I mentioned it's it's it's audio information but audio information is really difficult to do data analysis on on its own so we need to kind of massage that into some other form and we're gonna that's what we're gonna use the Fourier transform is is gonna give us a spectrogram of the of these audio files so we'll just read in ten of them just to just to have a look this is called spectrogram so this is just in time on the x-axis and and on the y-axis is it's a power spectrum across frequencies so this is going up in frequency on this axis and you just have a heat kind of map of the power at those different frequencies so it turns an audio piece of a piece of audio data into now what is essentially an image and then we can do we can use all the techniques we have for image recognition to to answer this problem so I've also just plotted here whether whether there was a whale or not so we just have a look and say oh that that's that's one that just sounds like what I how I described a little garbled nonsense there's some more garbled nonsense underwater noises it's like you know you supposed to listen to it as you're going to sleep it's really soothing not present not present okay I wonder if any of those these are present I think one of them is not present or garbage there's one okay so can anyone see the whale call in there yeah okay awesome right obviously it just stands it sticks right out of you right so so let's actually let's instead of just looking at them one at a time what I did is I just compiled all the positives so we could start to get a visual idea of what does a whale call look like so hopefully that'll work yeah so there's all all the whale call so now can you see the the whale call so these are just positive cases these are ones that have a blue whale in them I don't know if you can see it but it's this it's this little hook here so that's the so it's so it's going up up in frequency over this little kind of this is two seconds total so that's over about you know a little under a second so that's there that's that's what they look like so that's cool so that gets us somewhere okay so it so actually that was pretty useful for me when I was actually doing this this challenge to realize that it's all concentrated in this region I actually trimmed everything else out of these spectrograms and only focused on that this one region because all this other stuff is noise it's not really going to give us any information it's like if we were looking for a face you just you know you try to eliminate everything it's not where the face is likely to be and then let's just go ahead and fit fit a model to it so scikit-learn has all kinds of the standard classifier algorithms here we're going to just import support vector machines maybe you've heard of some of these support vector machines like artificial neural networks these kinds of things they're they're they're ways to learn from from labeled data so let's just import that and then here this is just I've taken the training data that I that I cleaned up a little bit I just put it all together into one flat data file and and we're going to read that in and hopefully hopefully I don't have to rerun retrain this because it kind of is slow but so you can just it's really straightforward to train one of these classifiers you just you just kind of define it you give it your input so this is the the the whale call pictures pictures of the whale calls this is the labels whether it's there or not ones zeros and it's going to just go ahead and and and fit a support vector machine on that and hopefully I'm going to try not running that again and see if that'll work and then we're going to just validate that so I ran that just on a subsample of the data just the first five thousand just just to make it sort of doable and then we're just going to validate and see okay so we we just took five thousand of those labeled data and we're going to now see how well we can predict some other one thousand some completely unique now knew that the model hasn't seen set cases and we're going to go ahead and predict those as probabilities and then the the performance metric that was used in this contest is one that's fairly common for binary classification models it's the area under the receiver operator curve and it basically just gives you an idea of your relative rates of false positives and false negatives and you want it to be one you want it to be as close to one as possible an ROC or an ROC of point five is you're doing no better than random so let's see how we do somewhere between point five and and one okay so I would have had to train that unfortunately let's let's run that for now and maybe we'll come back to it okay but the punch line there is so that the Cornell University's state of the art at the time that they posted this challenge was an AUC of point seven four or five or so that's how well they were able to classify these audio files and within a couple of days of this challenge going up well I was about to show that I was able to get by fitting on even a subset sample the data in the seven five region so at the state of the art level using this very simple case but within a couple of days there were other people who had models that were you know AUC's of point nine nine like they were just it was ridiculous like all these you know point dexters at Cornell couldn't couldn't come close to all these other point dexters who participated in the Kaggle competition okay so that was cool so wouldn't be cool if we could have seen that but alas so that was a neat one and I had fun kind of playing with it and and you have second learn has all the tools I tried also an artificial neural network and some other built-in kind of classifier models but there are also contests up there that don't really conform to this like really sort of standard case of of of classification or like k nearest neighbors or something like that and that was this contest so that was the the object of this contest was to map the dark matter in the universe it was very you know not very ambitious so I thought I would give it a try and it turned out so it's the winner would have got twenty thousand dollars so participating in these contests is kind of like crack but I mean if you had a one you could actually buy it like a lot of actual crack if you so desired but so so so dark matter is kind of cool so dark matter is dark so we can't see it but what we can see is its effect that it has on on the very fabric of space time itself which is just awesome and so it turns out that that while you don't see the dark matter you you can see it has effects on what you see behind the dark matter so it actually there's a phenomenon called gravitational lensing and that's going to distort background galaxies that we look at through our telescopes so so what you get from the from this contest was let's go up here so you get a bunch of a bunch of skies and they've just coded them as so they've just coded them as galaxy locations and then they've abstracted all of the you know idiosyncrasies of different galaxies it just in terms of into just their ellipticities so that's just their their confirmation in the sky what you know kind of how they're oriented and how elongated they are in the sky and and you get you get all that that data and then you get some some cases where the the location of the dark matter is known by other means so that's cool it's kind of hard to see anything in that but so we can have a look so here I use are to to take that data and actually plot it with the with the dark matter in there so this is one of the training skies cool and there are two dark matter halos here and so in the absence of any kind of foregrounding dark matter you would expect the configuration of these galaxies to be at random it would just be oriented random everywhere but what happens is the the the lensing effect of the of the foreground dark matter skews slightly skews what we see what we see as the ellipticity of the galaxies that in the background so it's actually really hard to see in this in this case those are pretty edge cases so you can sort of start to see it here so it's having this kind of this kind of effect where it's just sort of skewing from the random the otherwise random configuration of background galaxies and that's the information on which you're meant to then not knowing where this is try to try to build an algorithm that which will guess where they are so there's another one that's also not that strong that you can kind of see this the effect of this one so this is a really cool problem because it does not conform at all to like you can't just throw it into like you know a support vector machine or something like that and just you know Ladi Dawn you go so you actually have to build a model of how you think that the system is working of the process so a little bit of Wikipediaing and I discovered that that this these lenses follow one of the proposed distributions is this Ernesto profile and it's just this functional form which actually causes induces this skew on the galaxies surrounding the dark matter so I built up the model to simulate how that would affect the galaxies as the the mass of that foregrounding dark matter increases so you can start to see on the left right now you're getting this full full skew full skew and now you now you've got this full lensing effect surrounding this very very massive dark dark matter halo and so then the idea is okay so I have this model which simulates it so then the the solution to locating the dark matter is to simply fit this functional form to the skies in the test set and so what I did it was that and this is just kind of a heat map of of the the locations of better fit of this this ellipticity component so this is just as this increases that's the tangential ellipticity component so if you imagine a line going out there in a tangent these are all aligned along the tangent so this is kind of a map of that for one of the training skies and there's where my algorithm picks where the halo is and there's where it actually is there's another one so it does pretty pretty kick-ass I was really excited when I did this but that turns out that's for the one halo sky situation and in skies like I showed you in the examples they often have more than one halo and those lenses interact in complex ways and it becomes very difficult so so I'm showing you like the really good results but when when you have more than one halo it's hard to fit these simultaneously but try I did and in the end I was a little bit late but actually I'm kind of proud of this I ended up in the top 10 from my last entry that came in a little bit so I didn't bring home the 20 grand but but it was cool and it was super fun to do and so I just want to end with a quote from Tukey so the best thing about being a statistician so like I said I'm a computational ecologist and what am I doing in the cosmos well the point is you can apply you know your quantitative reasoning to a lot of different fields so it's great to be a statistician because you can play in everyone's backyard but of course nobody says statistician anymore right we're all data sciences so that's you know the updated quote from Tukey and then just finally I'd like to just rep this Montreal our users groups that I co-run with this guy a ten he's a colleague of mine I'm a gill and we have a meet-up group that's similar to to to the kind of it's kind of similar to the project nights of Montreal Python but we look at our so if you're interested in checking that stuff out come on out and see us