 Hello. So I'm Josh Bloom. Hi. Welcome to UC Berkeley. Welcome to Doe Library. Welcome to the Berkeley Institute for Data Science. And welcome to day two of Astro Hack Week. So I hope you had a wonderful day yesterday. Today we're going to talk about machine learning. And I wanted to try to put machine learning kind of into the context of an inference space that's certainly not rigorous, but one that hopefully helps you level set a bit relative to the types of things that you were hearing yesterday. And to give you a sense of how I want you to start thinking about bringing machine learning into your daily work. What I should say from the outset is that I start thinking about machine learning as just another tool in your toolkit for doing inference. This is not like, oh, now that I know machine learning, I'm not going to do Bayesian statistics or I'm going to stop learning physics because, boy, machine learning is so awesomely data driven that I don't need anything anymore. So there's a time and a place for applying different tools. And what I'll teach you today is some of the at least pythonic ways of approaching some astronomy types of questions around inference using machine learning. But let's sort of unpack this a little bit. I've got three axes. I've drawn them as orthogonal from each other, but for those that are kind of reading ahead you probably see that they're probably not that orthogonal. One is sort of the statistical space which is going sort of in the left and right direction. You know Bayesian on one side and frequentist on the other. And while I didn't see all the lectures yesterday, my prior knowledge of the two speakers suggests that you probably got beaten with the Bayesian stick pretty hard. As it turns out, most of the types of machine learning tools that we're going to wind up using are much more on the frequentist side of things. And, you know, those that, again, are practitioners of these various statistical frameworks for thinking about data and inference will often refer to the Bayesian frequentist divide as really kind of the wave particle duality. That it's really both. And depending upon the situation, it's more helpful to think about data and inference in one place or another along that axis. In the other axis, up and down, that's much more sort of where you bring physics to the table or not. And so one is what you might call theory or hypothesis driven, where, you know, you've started from first principles and you've derived the model that should lead to some set of observations. And that's the sort of hat that you wanted to putting on. The other hat which is sort of the opposite end of the spectrum is one where you say, I don't know anything about the system that I'm studying that I'd like to make inferences on. So I'm just going to basically throw the data into some framework and hope that I get the kinds of answers out to the types of questions that because you've taken this Hackday seminar, you've asked them correctly in a well posed way. Although oftentimes you'll find that people will be challenged in that respect. And then in and out of the page here is the computer science sort of view of this world, which is, are you doing the computation on a small enough amount of data that it can fit into RAM? Because if you can do that, then you get access to lots of really interesting algorithms that don't need to be that well parallelized versus ones that are potentially out of core where you're taking chunks of the data and you're doing inference on chunks of the data at a time and over time you're building up a better and better inference about the questions that you're asking. So this is the space that you should be thinking about. Many of you have been doing inference already or you probably can localize the types of problems and approaches that you've taken in this space. And again, this is not going to necessarily be orthogonal from each other. I just wanted to see sort of the stats view, the computer science view, and the and the physics view of this. So yes, no that's true. I mean it's sort of like the fundamental plane of galaxies or HR diagram, right? Just because you can draw a temperature and luminosity doesn't mean that they're going to wind up populating everything equally. So there's definitely places of higher density versus lower density. That's a great exercise for us or for you to come up with some interesting problems that live in those different quadrants. I have got some thoughts on that but I don't want to spend too much time. But you're absolutely right and that in some sense emphasizes one of the critical points I think of this whole morning is that you're going to choose your inference framework depending upon the types of questions you're asking and the type of data that you have access to. So if you're at the Facebook level you have access to, you know, of order several billion images and so if you're trying to do insight into images and you can actually afford to get lots of labels that is answers for your previous images or videos, then you can bring, you know, a framework together that learns a lot from data where you don't have to know much about, let's say, object detection. You can just basically throw it through some whole system that figures that out like a deep learning network. Whereas in astronomy oftentimes if you're working right at the edge of signal to noise you'll wind up not using machine learning oftentimes because there you would really benefit essentially pulling the signal out of the noise by bringing all your prior beliefs both from a statistical perspective and also from a physical perspective. So just a quick overview of what I'm hoping to cover today and there may be a little bit more we can get into as well. What is machine learning? Obviously we'll talk about that in all its gory details and cut as quickly as we can to the chase to try to do this in the most pythonic way that we can. Go into two types of approaches and questions you'd ask of the data. One is regression, getting numerical values out of your data essentially from prediction perspective. We'll do a little bit of a run-through on an existing notebook that I've created and by the way that notebook is up on the GitHub for Astro Hack Week so if you want to follow along feel free. And then classification which is saying now I'm not trying to get some inference at the numerical level I'm trying to understand which class an object winds up belonging to. And then we'll talk about how you actually improve your model. So the first parts will be just sort of introducing the topic giving a little bit of sort of theoretical motivation for those but mostly I want to get into the what one needs to do as a practitioner. And then we'll start thinking about how to get these into into production environments. Most of what I'll be covering is what's called supervised learning so this is where you have a set of labels or answers on an existing set of data and now you have a new corpus of data which is presumably taken and obtained and reduced in the same way as your original data and then you want to ask the same sort of questions on that. So if you had a galaxy sample of you know let's say a thousand galaxies that all had spectra and you had your own classification scheme for what types of galaxies though those were and now you wanted to apply that to all the spectra from Sloan this would be an example of a of a classification problem but also a supervised one. So I'll make the distinction between unsupervised and supervisors as I go along. All right so what is what is machine learning? The short answer is it's an offspring of statistics and computer science. The long answer is that it's a set of models which aim to learn something about a data set and apply that knowledge to new data. That's there's lots of different ways to unpack that and there's lots of different ways to try to localize machine learning as something that has some boundaries to it within the broader context of artificial intelligence and what started happening over the last couple years is that there's been this almost Cambrian explosion of different algorithms and approaches and commensurate with that are lots of books about each one of those and lots of tweets and people writing blogs about how awesome their algorithm is relative to other people's algorithms. One of the main difficulties I think not being a practitioner in machine learning and not having come through the the ranks of having learned it from first principles which I put most of us in that category is that it can often be very noisy and very difficult to figure out what's what's hype and what's actually very useful. The short answer on that point is don't believe the hype but you probably should believe the hype at some level because there's a lot of really powerful tools out there and a big job that you're going to wind up having is trying to cut through that and trying to understand which are the tools that are useful and practical and which are the ones that are highly specialized and really only need to be used in very specific cases. So it's the difference between a generic camera that you would buy at Home Depot and some very specialized tweezers that can only open up a Nexus 6 or something. Very, very different things. One of them you absolutely need to have when you need it. The other one though is much more practicable and useful in your life. So let me just step through these sort of various components of what machine learning is and then I'll sort of break that out at an even higher level for you into different types. So it's using labels from training data to classify new objects. So I've just given you a light curve. Is this a supernova or a nova? Okay, so domain experts can look at those and go I know the answer. But if you want to do this at scale without any people in the loop, then that might be a good thing you'd want to use. I have a bunch of images. What's the galaxy type of these images, et cetera. So learning the relationship between explanatory features and response variables to predict new data. So what does that mean? There's a couple of terms in here that we need to get straight. So features are the variables. They're the X variable from what we saw yesterday. So this is the input data. Oftentimes, we think of X in an abstract way. So X could be an image. X could be a set of images taken over time. X can be metadata that's derived from those images. So in the context of astronomy, it would be, let's say photometry on a single object as a function of time. It can be photometry as a function of time on a given object, but it can also be the weather conditions at all the telescopes where that happened. This is all the data that could potentially get brought to bear on predicting this response variable or what's sometimes called labels. And that's the Y vector from yesterday. Another utility of machine learning is its ability to discover natural clustering and data. And so obviously, if you plot up lots of two dimensional data and you plot it all up, your eye is going to wind up picking out clusters. You could draw circles around those and write papers about the objects in that circle. And that may actually get accepted to journals, and that's completely appropriate. But there's a sort of more robust way to come up with the notion of clustering, even a two dimensional space. There's obviously some parameters that define what it means to be a cluster, not that you'd have to decide based on your problem. But where machine learning tools wind up becoming really useful in this context is his ability to find clusters in very large dimensional spaces. So if you have not two dimensions or three or eight, but 10,000 dimensions that you're looking at, and you're trying to find clusters that actually have meaning, you can use a bunch of different tools for that. And similar to that idea is the idea of taking a very large dimensional set of data and reducing its dimensions down to just a few that are actually informative. So a classic example of that in astronomy would be taking a bunch of galaxy spectra. And while every single flux value in every single at every single wavelength has its own sort of meaning, and you can interpret that in a physical way, there's only, you know, a small set of representative spectra of galaxies and every galaxy in some sense can be represented by that. So if you've done principal component analysis on spectra, you've seen this where if you take lots and lots of Sloan spectra at the same redshift, you can wind up picking out just a few representative samples, and then you can get an admixture even in a linear sense to build up most of the signal that you see in new galaxies. So that would be an example of taking a very large dimensional space, in this case flux is a function of wavelength, potentially thousands of measurements, and condensing it down to just essentially a few weights on on eigen spectra. I should say that particular example in PCA starts where you start kind of blurring the line of what is what is a machine learning technique and what's not PCA is sort of in the in the gray area there. Oftentimes PCA, and this is a bit of an aside, will be used for doing dimensionality reduction on very large data sets with a large number of dimensions. And then the output of that dimensionality reduction would then be thrown into a machine learning framework. So in the context of, let's say galaxy spectra, you might take the weights on the top 50 principal components to reproduce the spectrum that you saw and throw that into a more traditional machine learning classifier. And last, because you can do clustering and you can find things that are like other things, the flip side of that is also true. You can find things that you haven't seen before. An anomaly detection is a very important thing in astronomy. What I was going to try to unpack a little bit here, so these are all the sort of different components of what makes up machine learning. There's another way that I'd like you to also start thinking about what machine learning is and what it isn't. And that's in the types of questions that you wind up asking on the data. So I'll sort of call one type type one for lack of a better term. And that might be called exploratory. And the other one, which I'll call type two, might be predictive and prescriptive. So why am I making that distinction? Exploratory in the context, let's say anomaly detection. So that's a great example of what I'd call a type one activity with machine learning. You have a large amount of data. You want to find something you've never seen before. And then you want to look at that and say, wow, that's interesting. I should get more spectra of that object. Let's keep going. And eventually, that could wind up leading to a paper. It could also be that it's exploratory just visually. So this is a way of, let's say, taking very large numbers of dimensions and bringing it down to a smaller number of dimensions so that you could even visualize and play with to get a better sense of the data and the low dimensional structure that is effectively embedded in larger dimensions. I don't think of exploratory in the context of lots of types of machine learning that's out there in the wild as all that useful. Because it doesn't oftentimes give you a tremendous amount of insight about what you need to do next. So in the context of Facebook, the reason why they're building new frameworks for doing inference on videos and images is not because they want to learn about, you know, when somebody's happy, are they wearing this color or that color and they're going to write a paper, which is going to get accepted in some journal. They're doing it because they want to make predictions about what they can actually get out of that image. And then ultimately, they want to monetize the results of that. And it's prescriptive in the sense that the best types are sort of, I'd say, the more leading types of machine learning are ones that say, not only do I understand what happened here in this data that you've just acquired, I'm now going to make suggestions about what new data needs to get obtained in the future to get to some objective function. And so while all of machine learning is sort of an optimization, and as long as you can start couching your frameworks around optimization at some level, you're doing some form of machine learning when you include data, I think that type two are very, very interesting. And those are the ones that, so in the company that I started, it's called Wise.io, we're basically working on sort of more of the prescriptive kind where we're looking at interactions between people and we're making predictions about what they're saying and then basically making suggestions about what could happen next. What's nice about that type is that it winds up creating nice feedback loops where if somebody takes an action that you didn't suggest, you wind up sort of getting counterfactuals to your models and then the models themselves can wind up improving over time. I actually think while this is very exciting and there's lots of classes of machine learning frameworks that are being developed here, I think a lot of what we do in astronomy is much more on the exploratory side. So going back to this example of classifying large numbers of galaxies, that is basically saying I'm going to build up a catalog of galaxies of this type. I've got a couple of exemplars from that space. I want to find more of those things. So you go and you find more of those things and then it's really up to you to decide what to do with that. Not baked into the optimization technique traditionally is what you need to do next. So you're using it as a starting point to say, I just found some interesting data that I didn't know existed there and now I'm going to use that because I understand the physics of what I'm looking at to go off and do something interesting with it. Is that clear on the distinction? And by the way, it's obviously not a very hard and vast black and white set of rules. I just wanted you to start seeing the kind of two differences there. And then any other questions about sort of overall statements that I've made about what machine learning is? Yeah. So the question was couldn't you say that detecting low dimensional structure in high dimensional data is all of these things and so I don't quite understand the last piece of that. Yeah, so at some level it's a good point. At some level you've got a function on your input data and you want to predict something else. And so often times, and so this is a generic statement about making models on data to predict outputs, it could be that I take a very large dimensional data set and my output is three numbers. And that's my way of reducing it down to something that I could potentially visualize and play with. It's not the same as saying what class do I belong to, right? So just because I can take a hundred dimensional data set and then show that it's got some actual structure in a two-dimensional projection doesn't mean that I know what class of galaxy that is along those two dimensions, right? So you're right in that machine learning is a summarization of the data and it is a compression in some sense of the data to get out and answer that you care about but they're not they're not identical. Any other questions? Okay so we're gonna now start looking at the landscape of machine learning capabilities within Python and for those that haven't already you can do a Kanda update scikit-learn as you see up there. If you're using a different package manager or a distribution system you can do a PIP install scikit-learn. I believe the most current version is 17 1-1. It's probably fine if you have 17 and beyond. Some of the things in the notebook might wind up breaking if you're at 15 or 16. So I'll give you a second to do that but while you're while you're actually getting your environment up to speed here what I will say is that there's a large number of packages that exist even in the Python sphere that do some parts of machine learning. So there's ML Pi orange which has been around for quite a long time is much more visual. Keras is a wrapper around a couple other machine learning frameworks in particularly deep learning frameworks and that sort of gives you a very high level view of deep learning. No learn is sort of a competitor at some level with Keras and then there's some of the lower level things like TensorFlow etc that give you a little bit more direct access to some deep learning frameworks and then astro ML there's a whole bunch of these things that are out there. By far I think it's fair to say scikit-learn is the de facto starting point for doing any machine learning in Python and that's because a number of things. One it's very well maintained, it's got something like 20,000 commits at this point, hundreds of committers and they have a very well documented and really sharp Elbowd API system. So you can wind up writing an entire pipeline against one type of machine learning model framework and then essentially rip and replace that out and stick a new one in once you wind up realizing you want to try something else and not break a whole bunch of the rest of the code. So any questions about Python machine learning frameworks in Python anyone else beg to differ? So this is a fairly complicated slide from the scikit-learn folks of what they call the cheat sheet for doing machine learning. I'm asking you kind of questions about what type of data you have and what kind of approaches you're going to take and then what's available within the scikit-learn world. So the first one is you have more than 50 samples, if not go get more data, so stop. That's not completely fair because of course you can do inference on just 50 data points or 50 instances. Are you predicting a category and do you know what the label is? So if you're predicting a category you're going to do some sort of classification. If you're predicting a response variable quantity you're probably going to wind up doing some type of regression. Do you have large-dimensional data, etc. etc. I'm not going to flow through all of this but what I will say is that once you've arrived in each one of these different boxes what scikit-learn gives for you is a whole bunch of choices of the different algorithms that you can use. And one of my fairly strong statements that I'll be happy to be challenged on is that there's only a few algorithms out there now that are so battle tested and so useful that you probably don't need to learn nor try all these different ones once you've arrived in one of these different boxes here. That is to say if you've got low-dimensional data and not a lot of it some form of a linear model is probably fine and appropriate for for your data. If you've got a medium amount of data let's say at the gigabyte level or even the terabyte level something like a group of decision trees something called random forest will almost always get you to the best answer possible and then if you've got much more data than that typically you'll wind up using some type of deep learning to get access to the best results. It should also be clear because all of you do do computation in one way or another that the best answer the one that essentially you know maximizes the objective function that is one that gives you the let's say the best accuracy in you know a galaxy star quasar separator isn't always the best one to use. So anyone want to come up with some arguments why the best accuracy algorithm might not be the one you'd want to use? David it might be outrageously computationally expensive and Dave talked about yesterday of the tradeoff that you have to make as scientists between what's the right thing to do and what's like the tractable thing to do because you need to publish. So if it takes the age of the universe to get you know 2% better accuracy on a star galaxy separator and you have to burn all of your campus' resources to do that you're probably not going to do that. So computational efficiency efficacy is an absolutely important one. What else? I'm going to save these. So why not use the most accurate? So we said computational efficiency yeah ah good so overfitting right given a sufficiently large number of tuning parameters I can fit a sufficiently arbitrarily large amount of data but then that doesn't mean it has predictive power on the data that I might have that I haven't seen yet. So that's that's a great that's a great thing of what we want to protect against yeah yep. So accuracy might not be the right metric. So indeed your your goal could be to minimize the number of false positives right so a great example of that if you've got a drone that's flying around shooting down other drones or airplane shooting other planes autonomously you probably don't want to have any false positives and you start shooting down your own people's drones right you want to shoot down somebody else's coming up with examples that's an example where accuracy is really important but it's not the most important thing there you'd want to create a model that drives false positives down to zero in the context of false negatives if you're building a classifier on images as it streams off of telescopes and you want to find things in the sky that are new and novel you want to minimize your your false negatives you don't want to miss anything that's new and novel but in that case you're you also want to not find and say everything is interesting right so that the two extremes of that in the context of let's say a transient classifier that's happening in real time one would be everything in the sky is interesting right now and the other side of that would be nothing is interesting in the sky right now and both of those are horrible horrible systems in production what other what other reasons might you not want to use the most accurate is the most accurate the most informative that is if I have a classifier which just beats all the other classifiers and then I look inside of it and I say I have no idea how it's combining the data and and yeah I can see the math but I don't get any insights out of that for astronomers that could be a pretty big deal right unless you're working in a regime where the proof is in the pudding that I don't care how you got the answer just show me that in a calibrated sense you get you know the essentially the best most accurate curve or the or the or the perfect metric very very often you're never going to be in that space because you have to write a paper eventually about the things that you've built and somebody's going to ask well why did you get to where you got and so they may not be they may not be all that informative all of these are reasons that you'd want to take a bit of a pause when when you meet somebody on an airplane and they say I work at you know startup X or large corporation in the Valley Y and we have the most accurate classifier on something and you'd say well that's great but it's like when you talk to an astronomer in a field that you don't know and you say what about magnetic fields and they go aha magnetic fields so so so these are the things that you want to keep in mind when you're actually working with with real frameworks recognizing that overfitting isn't just what could potentially happen when you work with one model you could wind up doing some sort of meta overfitting where you basically choose a model that happens to look best across a multiple different types of models and that's because you've gone through effectively a multi-trials problem and you you've sentry glommed on to something that's best but within the noise did you have a question yeah so that's so the question is around out there in the in the non-industry a non-academic world in the industry for the type 2 kind of predictive sort how how much do they care how much they can unpack an understanding of what their models are producing I think the answer varies my guess is that in the ad world and the ad placement world it doesn't matter right because there again the proof is in the pudding if I get epsilon better than somebody else that means millions or billions of dollars in accuracy and boy I can throw the whole kitchen sink at that because I can afford the computation to do it again in the context of understanding whether you're happy or not to suggest a different ad to you you know inside of Facebook I suspect there are people with within that organization that do want to understand what it is that their models are doing and what type of data do they need to acquire what type of data is informative what's not for the purposes of basically making a better prediction but that the end goal is not to actually understand why I think in the context of let's say financial model so let's say a risk assessment model there are plenty of reasons the most important one being regulatory that when you deny a loan to somebody algorithmically you have to say why you denied the loan right so just like with your FICO score which is a terrible model of your ability to repay like a ten dollar loan from your friend in two days but you can actually figure out exactly why you got the score you got there's a regulatory requirement around that there are regulatory requirements around making some types of inference models essentially completely transparent of how they got to that answer I believe the EU just started passing a law they've already passed a law that's starting to make some of those actually required to be transparent and then actually known by the public of why they got answers that they got I think the the the informative component of machine learning models is vastly understudied in the academic circles right most people are focused entirely on is it more accurate than somebody else's on the same data does it scale better by a couple of different metrics and something else and very little of it has to do with interpretability and I think that's a fantastically interesting kind of open subject okay so supervised learning we're going to use a set of training pairs so this is an x vector and y outcome to predict new y outcomes when we see new x's in the context of regression that's predicting a continuous variable which is which will be our why from an input set of features lots and lots of things and lots of different approaches exist within within Python and scikit-learn from regret linear regression to very nonlinear model capability all right so why don't we jump into the notebook now mirror displays and see if we can get this to work so if you go into astro hack week astro hack week 2016 and then you click on day two dash machine learning you should see the latest version of the notebook I'm going to do this and make this a little bit bigger and we're going to do some astronomy examples I adapted a previous notebook from Python 2 to Python 3 but I try to make it backward compatible just so show a hands who's using Python 3 okay so most the people in the back and then who's using Python 2 was the people in the front that's interesting I just did I did some clustering in my head all right which could or might be wrong might not be right that's actually by the way one of the problems that I generally will wind up having with unsupervised problems is that how good you did is tends to be very subjective and with supervised problems you can actually define a metric that you can test yourself against I find okay and now you can't change your now you can change result and you can raise your hand twice because you might be using both kernels oh it's more more right than it is okay good but how good is my pattern matching I mean boy yeah all right so we're going to work some examples using a salon digital sky survey data again they should work in both Python 2 and 3 I basically wanted to grab photometry that's corrected for extinction for a thousand quasars that have known redshifts and so as you can all probably do SQL in your head you can see where this is all going and what this gets but we're going to basically pull over a thousand quasars from Sloan and you should be able to do that locally if you just run and execute this I'm running it on demo hub just because I don't want to use resources on my laptop if you run this you should get the data what does this data actually look like probably appropriate to look at it it's not getting bigger but so here we have the object ID RA deck d reddened u-band photometry d reddened our g-band etc etc we've got the we've got somewhere we've got a we don't have a redshift here we do have a redshift that I just pulled over redshift one of these is redshift we've got RA deck etc okay so that's the data so we just now pull over some other stuff you need so pandas seaborne mat plotlib stuff to actually do plotting and then pull over scikit-learn I don't think I need to run this but I'll just try it does everyone know what demo hub dot uber dot org is so it's a it's a site that is maintained built and maintained by the jupiter collaboration where one can have not ephemeral notebooks which you can get from try dot jupiter dot org but actually sort of dedicated on computation space so you can do a get clone of a repository there and then pull in all of your stuff and then just have it there and those computers and principal will run for forever and your servers will run forever that's obviously not true because it's maintained by a third party I think it lives on top of rack space but the whole machinery for building these things is actually now out and fairly robust obviously open source so for those of you that are teaching it's kind of a nice thing to be able to build a space for you and your students to all kind of work together that's in the cloud so this notebook obviously I could be running for my own machine if I'd like but I'm just running it from the cloud now and that'll probably come back to haunt me because I just extolled its virtues so what are we going to do we're going to import numpy, matplotlib, pandas, seaborne etc and we'll take a look at that data just basically doing a head on the first thousand quasars you see all the stuff that I got out of that oh here's the Z spectroscopic redshift is spec underscore Z okay and here's the class now we're not going to just do machine learning on it one of the things that you have to know is that you have to look at your data before you start throwing it into these frameworks just because these frameworks are nice and easy to use doesn't mean that you're allowed to get away with not actually looking at the data so let's look at the data so what are we going to do we're going to pull only some of those columns out because we're going to wind up doing a regression problem or and later on a classification problem I'm going to drop something like right ascension and declination because I want to basically not care where you are in the sky if I want to decide if you're a quasar or a galaxy or a star for instance this is a great example of bringing prior knowledge to this problem it should be the case that if you build a nice classifier or aggressor that included right ascension and declination in the data set that you would basically find that those features those values of x in that vector have no informative value and no meaning but because they're in there and because they're going to wind up not being distinguished between something else that may be much more informative you know putting on our domain driven hat we have to be we have to allow ourselves to use domain knowledge right just to make our models more understandable more tractable and actually have some meaning so we're going to pull out the object ID we're going to pull out the spectroscopic redshift we're going to pull out the different colors etc and then we're going to make features which is just a copy of the of that whole data frame and then we're going to get redshift so we're going to basically try to build a predictor of redshifts using just photometry data right and we're going to delete out of our feature set the answer because that would be pretty bad if we had the label in the thing that we're trying to use to then predict our why values and then let's see what the result is of that so this is our features and now we've got a smaller number one two three four five six seven eight nine and some of these features by the way are the difference between aperture of photometry and petrosion photometry which could be informative for like the size and the shape of the galaxy at some level I'm not taking all the other metrics that are obviously also available I'm just trying to do something simple here so we've reduced us down to sort of a nine dimensional input y vector and for those are used to pandas I've created index around the object ID and now let's plot the histogram of the output variables and what do you see right away is that there's very little numbers of quasars beyond redshift of two in the Sloan catalog or at least of the ones that I wind up picking out so there's a pretty big cut around to so you can imagine that when I create a regressor if I don't give it any other information it's going to do a very good or it's going to try to do a very good job around the places where there's lots of data and it might not know that actually what I really care about is finding you know large redshift quasars okay so again this would be how you would wind up constructing the model you'd have to think about that so let's plot the basically the data against itself so pairwise every single feature against every other feature let's just see what looks informative and we'll basically colorize that by by redshift this takes a little while to produce here to give it a second sorry to Stefan who's sitting over there because I'm using jet and not your color map there we go okay so there it is and it's going to be very hard to see here we've got D red and R and it's just a histogram of that down the down the diagonal and you can see indeed that there is correlation heavy correlation between the difference in the U band between the two different photometry methods and the G bands and the I bands and Z bands those ought to be correlated with each other and indeed they are you also see something kind of crazy here what is the D red and R band magnitude it looks like it's mostly around zero but some of them are around minus ten thousand this is like somebody setting off a supernova in your eye or something so that obviously isn't valid data and we just realize that there's something crazy with the D red and R band magnitude so some of these actually are not detected in our band and we can decide what we want to do with those again here we're basically not even doing any machine learning we're doing what's called feature engineering we're going through a process of understanding our data we've articulated the question that we're going to wind up asking so you know can we predict redshift from just photometry on quasars and what we'll now do is we'll say look I'm just not going to care about all those ones that don't have R band magnitudes maybe those actually ones would be really easy to find what the red shifts are but just for the sake of on this demo if the color is sensible and it's actually a sensible magnitude that is it's not this minus 999 we're going to keep it otherwise we're going to get rid of it so now we can look at that matrix again and make sure that we get what looks like pretty sensible answers and then once we're done with that we'll wind up saving that data to a CSV file so we can use it later on and bring it back being it back in this also takes a little while I guess these computational clusters we're running on are not all that beefy maybe I should have used my laptop but we'll give it another second here there we go so it may be very hard to see from the back of the room but you can start to see in some of this data here let me see if I can zoom in even further now that doesn't help you can start to see in the some of this data here that there's different colors so there's some green and there's some blue and it starts to look like there may be some separability in that so this could should give you a little bit of hope that you may actually wind up being able to figure out the different red shifts in this large dimensional space but it's very clear that it's not like in two axes we get perfect separability of you know high wretched versus low wretched for instance okay so this is actually a pretty hard problem to do and those that have worked on things like this know that's very hard to do it all right but we'll say and declare for the purposes of this demo that we're done with kind of the preprocessing steps I'd still call this part of the machine learning pipeline that we'd have to build if we're going to do this for real but now we're ready to do some basic model fitting so to do basic model fitting we need to create a training set and then we're going to create what's called the testing set so we're going to hold back some data on the side and we're going to take our model and predict against data whose answers we already know and then we're going to try to see how well we did so we're going to create our x vector and our y vector so the x vector is going to be features and y vector is our answers so we've got nine thousand 988 examples of data that meets our our filter criterion in nine dimensions and we've got an output of the same size so we're off to the races just for the purposes of this demo we're going to choose half of the data for training and half of the data for testing what did I just do other than just create a train set and a test set what did I just potentially introduce in terms of biases into the thing that we're going to wind up building yeah it that's right so it's not a random selection of the data points I assume that what came back from Sloan was not something ordered by magnitude or ordered by location in the sky which could have some very subtle impact on you know whether you're in the supergalactic plane or not in the supergalactic plane we just introduced a whole bunch of biases in the way I even created this train and test set so a better way to do it or at least another way to do it would be to create a random test set we'll see that python has a whole set of methods that allow us to without having to do what we're doing right now implicitly create train and test sets as you wind up building up the model but good so really important to bear in mind the assumptions that you're making about the data as you wind up building up these models so we can build a linear aggressor on that and typically what you'll wind up doing is instantiating one of these different models and you call it CLF and you can wind up doing a tab completion to see all the different things that can be done on this on this model fitter the first one obviously is to to run fit we can also when once it's been fit we can get the parameters of the fit etc etc and we can apply that if we want to to to new data so let's fit the data and so for all the learners they will all have a dot fit method associated with them and so it should be where you can swap in and out these different learners and still apply it to apply it to the same format of the data so we'll train on the X data given the Y outputs and we got a result here we're going to wind up predicting on our test data and then we're going to see how well we did we're going to get the mean squared error relative to our test data visually we can see what that actually looks like and you see that this is pretty crappy right yet at some level it's actually pretty good because it did the best job it could for a linear or aggressor on nine dimensions in fitting most of the data but you see that it was a massive massive underfit of the high redshift results here's how well we would do if we just took the average of the of the training data so mean squared error of 0.65 and if we just chosen the average we probably would have gotten about the same result so this is not a very good regressor and you can see some you can see some things here from from scikit-learn where we can actually use just don't have to build up the the notion of what I mean square areas ourselves they have all these different sort of scoring functions so before I jump into other classifiers what I'll do is jump back into the lecture notes and I'll sort of introduce some of the other sort of I won't give I won't call it theoretical motivations but at least visual motivations for these other types of classifiers but anyway I wanted you just see at the very beginning here that we've got sort of reasonable end to end of starting off with raw data and then we're building a model and then we're able to use that model on new data it just turns out to be a really crappy model yes question what does the fit actually do when we when we do the matching give me one second just put something up here sorry the question was what are we actually doing when we're doing the modeling yeah so what's happening so the question is what's kind of happening under the hood when you run dot fit the machinery of scikit-learn is going through looking at your input data looking at your outputs and coming up with a linear model that winds up predicting this so it's coming up with for what however the linear model is is created and run within within that there's logistic regression there's all these other ways of doing modeling it's just coming up with essentially weights on the data so multiply what are the nine numbers I need to multiply to get out another number and it's just coming up with that and it's storing those results so that when you then apply that to new data you wind up getting out essentially what the result should be and because you know the results on the test data you can intercompare them directly what i want to show you is this thing called k nearest neighbors k nearest neighbors is represented right here which is to say i have a new data point and this is showing you a data point in two-dimensional space right so i'm not showing you what the axes are here but i have a new data point which is green and i want to know what its value is its result is in a classification sense you might just say well if my if i have a parameter which is called a hyper parameter k equals three i'll just take the three around me and if i know the two classes are class blue and class red then i can wind up saying well probably with a 66 percent probability this class of green dot should actually belong to the red triangle class right if k is equal to five you may wind up saying that it's equal to the blue class so this is a tuning parameter of a model like this in the context of regression what this is doing is saying let me take the y value variable outputs of those three things inside of that first box and i'm going to let's say average those together or i could take the median of those and you can define exactly how it winds up doing the combination and then that should be your answer okay and this works yeah basically in any large number of dimensions but as you can imagine you get into very weird places where you're in thousand dimensional space and really nothing is near you right almost by definition nothing will be exactly near you unless you have a repeat of the instance from your training set and so the notion of what that distance is becomes really strange one of the things that's really weird about k nearest neighbors and linear aggressors and lots of the types of machine learning models that are out there is that it assumes something implicit about the data does anyone want to guess what that is based on some of the words i just used i've got a nine dimensional space in the problem i just used what are the what are the units of those of all those nine dimensions those happen to be magnitudes right if we go back to the notebook here i've got magnitudes but look at look at this most of these are color differences right so their values are pretty close to zero but i've got one which is the apparent magnitude of the object and these numbers are nowhere near zero when you build a linear aggressor there's nothing there's no difference between this value and this value here and when you're doing k nearest neighbors and you're trying to find out the distance between you and the nearest object right the probability of having a distance of you know point four in u minus g is much different than the probability of having point four of the difference in the d red and r band magnitude all this is saying is that many of these different classifiers that we wind up using and regressors have this implicit assumption that the data across all the dimensions are kind of of equal importance and have sort of a equal notion of distance in a euclidean sense and so when we build a distance metric implicitly in some of these machine learned classifiers and regressors we're implicitly assuming some metric in that space what would happen if i you know added another column here that was something that was potentially you know qualitative like near a galaxy or not near a galaxy or near the galactic plane or not near the galactic plane how would a regress linear regressor deal with that one possibility of course because now you're dealing with sort of categorical features one possibility is you could turn those into zeros and ones and say if you have a galaxy near you we're going to call that zero and if you don't we're going to call that one so then you have binary variables as some of your input and then you have continuous variables as some of your input as well many of the original classifiers used in machine learning don't do all that well in that context let me now do a k nearest neighbors and see what happens and see if we actually get an improvement in our in our model here so we already saw the result of linear regression and we saw that the mean square error was not much better at all than what we got from just essentially choosing the mean so here we're going to wind up now dealing with this metric space and doing something that tries to put all the different nine dimensional parameters on the same footing and that's where we're going to wind up scaling them so we're going to wind up all kind of scaling them to approximately the same distribution so they'll have I believe mean of zero and they'll have standard deviation of one and this scaler winds up remembering what it did to the input data before it wound up running a k nearest neighbors regressor so I'm going to use k nearest neighbors of 10 so I'm going to take the nearest 10 objects in my nine dimensional space and then we're going to use whatever the defaults are from scikit-learn and we're going to wind up fitting you see that fit was pretty quick it assumed a minkowski metric which is fine because I already did the pre scaling and we're going to see what our mean squared error is which is much much improved so now our mean squared error is 0.23 as opposed to 0.6 something and you see that I still have sort of the same structure I had before so that I'm missing some of the high redshift stuff so effectively if you think about it these high redshift things at redshift of six were assumed to be off by four so it's assumed as around two so it's assumed all the high redshift stuff was like basically at redshifts of around two which is near the mean and you see I got a result here which is much better than what I had before but I have a hyper parameter so is that result better because I chose k equals 10 or k equals five let's see the results out so that wind up improving things a little bit let's try k equals one so we'll just take literally the nearest neighbor and you see the result is much worse you can start to see some structure in the data here which is a little bit scary I know let's try 20 the results are all about consistent with each other so here's an example now where we had to do some pre-processing if we didn't do the pre-processing I think we can try this what would happen if we didn't pre-process that so let's so we already pre-processed x I don't want to go back up and and reprocess all that but it'd be worth just trying it for yourself what happens if you don't do this pre-processing step here okay so here's a pretty good example I think I played around with this for a while and I found that k nearest neighbors of five gave me approximately the highest accuracy answer any questions about that k nearest neighbors I'm going to jump back to another type of classifier and regressor called random forest in just a second but if there are no questions then I'll jump over by the way when you split up when you split up the space with k equals one this is again in two dimensions so I've got three classes red green and blue this is the space that if you now sample densely across your learn space you get a very different sort of classifier that is for the the same point that shows up just randomly choose a place in your head there you often wind up getting a different answer depending upon which you know what what k value you wind up using there's a nice thing on the web by the way I wanted to show you where you can see I don't know these are just support vector all right I'll come back to this it's support vector machines and random forest I'll be able to show you visually and interact with that let me show you now a little bit on decision trees yes what's my favorite clustering algorithm I often don't do clustering I feel like like the most interesting man in the universe I don't do clustering when I do drink dozeckies and I do k nearest neighbors but those tend not to do great in very large dimensions so there's something called db scan which tends to be quite good and oftentimes when I'm doing clustering it's because I really just want to see low dimensional embeddings so I'll often use some some manifold tricks that I'll show you towards the end of the lecture to do that and then most of the time again because I'm not generally in the business of doing lots of unsupervised problems I'll wind up trying to couch the problem in a supervised way and then I won't use k nearest neighbors questions uh-huh yeah so the question is about periodic features so right ascension being very very close because you loop over on the celestial sphere for instance that's a great example of a where domain knowledge needs to come in so if you don't care about location on the sky throw that out of your feature set but if you do because that's going to wind up being informative put it in put it in terms that you will believe will be most useful for the output that you care about so if you want to know for instance that this thing that you just found is an asteroid rather than give it Ariane deck you probably want to give it ecliptic latitude and longitude or ecliptic latitude only right so that would be an example of taking that that data changing it to get closer to what you think is going to wind up being informative the probability of like building a I shouldn't use the word probability so easily the chances of one like building an amazing classifier which winds up learning how to transform Ariane deck into galactic latitude is more or less zero right you're not going to build something that's going to wind up inferring the math behind that if you already know the math by all means you're you should be entitled to actually apply that in what you would generally call the featurization step as most people who do machine learning both in industry and academia know when you're applying to real data and asking real questions of it you're spending almost all your time in featurization that munging that we just did which I breeze through really quickly and was just really simple filtering and stuff that's the kind of thing you're going to want to spending a whole bunch of time on because it's the place you're not only introducing more informative features it's also the place you're starting to protect yourself against bias there's there's another question over there yeah so the quite so the question was you know we just scaled the data to do the as a preprocessing step before we did the k nearest neighbors how much does the the parameters of that scaling wind up impacting the results the short answer is I don't know we just have to try it and if we were going to build a real paper around this or a real result around it we would spend a lot of time trying to understand how sensitive our results are to the different scaling techniques and it may be that we want to get our results very sensitive to the scaling techniques because we want to highly optimize on the data we have the flip side of that could be maybe I want to build a star galaxy separator using k nearest neighbors that don't just work on Sloan data but work on other types of data in which case we try other types of data after we finish that whole exercise and convince ourselves that we got the types of accuracies that we were expecting so we really don't know what the impacts are of the choices that we're making until we actually explore that and try that and this is the crux of why you all would be better at building a star galaxy quasar separator than if I handed this to you know the most advanced data scientist who worked at you know pick your favorite company facebook right they might know a lot more machine learning than you they might not they might know a whole lot about all the intricacies of preprocessing of data but because they don't understand the domain they they wouldn't do as well right and they potentially over fit without even knowing it any other questions so let me jump now to decision trees and and random forest I'm going to present this section mostly in the context of classification but it also has implications for for regression as well before I do that let me just show you a very classic data set that people wind up using in in machine learning this is called the iris data set it comes prepackaged with with scikit-learn and here you've got essentially three dimensions sorry four dimensions measurements on on these various different flowers and three different classes of what type of iris these flowers are has anyone not seen this before okay so I'll breeze through it because only a few of you haven't seen it this is the thing that you will typically apply your classifier or regressor against to get to see how well it performs here you can visually see that I can make cuts on the data that allow me to basically separate out different classes very easily so for instance in sepal length and pedal width if I made a cut in the top right quadrant over there up and down vertically around 0.5 I would be able to perfectly remove all the classes of red and then I'd be left with needing to make some other cuts in another dimension that wind up giving me the separation between the green and the blue so this isn't a classification sense if I now wanted to say given for regression sense given the values of let's say the first three parameters what's the fourth parameter can you actually predict that that's another question you might ask of the data but let me step back to decision trees and how it views that so this is a decision tree that's built on that's built on on the iris data set and all it's doing when you build decision trees is you're building essentially these branching points where you wind up in the case of prediction when you wind up having a new input y variable for the at the very top level we're saying is the pedal length greater than 2.45 and if it is go to the right if it's not go to the left and if it's in the left you see what that's called the terminal node up there do I have a pointer yeah so this is what's called the terminal node right up here but if not I'm going to go down here and I'm going to ask another question well is the pedal with greater than 1.65 and if it is go down to the right and if it's less than a 5.5 in the pedal length go down here etc etc and eventually I get to another terminal node for the given instance of what I was just given that tells me that this was of type iris virginica in the content so that's classification in the context of regression the idea here would be to go all the way down and you would wind up just getting the value of one of the examples from the training set how you actually build up these different trees is kind of interesting and that's where the real interesting math lies but the way to think about it is you know you wind up trying to figure out how to create the most amount of separability between the classes that you know about or if you're doing a regression problem you try to get the sort of most information out by making one of these splits there are always splits at every single node that happen for one of the variables in your feature set and how those how those variables are chosen as you progress down the node is where all the interesting math winds up coming in but you can see here that I can build a perfect classifier right so if x in the x2 dimension if it's greater than one then go down this path if not I've just gotten the green class that's a perfect class fire on that and then if x2 is greater is x1 is greater than one then I basically will wind up splitting here and I can classify between blue and red so how you actually do that be on the scope of this tutorial but there's a couple of different ways in which one does it most of it is the way to think about it is you're trying to gain the most information by making a split so what you will typically do is when you're building up one of these trees is you'll randomly choose a feature to try out of your large dimensional feature set and then you'll move effectively a pointer around for all the different values that you could split on that and you'll stop the pointer where you wind up basically on the other side of it getting less information by cutting and the information gain or an entropy notion from Shannon information theoretic perspective are just simply to find these ways over different numbers of classes or you can define a similar type of thing when you're doing regression but depending upon exactly which feature gives you the best gain you will just say okay for this note I'm just gonna I'm gonna cut on that one and then as you go down farther down into this tree you wind up taking all the examples from the training set that made that cut and you'll keep on going and then you'll have a stopping criterion of when you decide I'm actually done and get to a terminal note often times you'll stop when you get to one single instance of your training data but many times what that means is that you've got lots and lots and lots of cuts to get all the way down to basically isolate every single training example so one of the parameters of these decision trees will be what's the minimum number of objects that I need to have in my node before I wind up stopping or the maximum number before I stop you can also imagine as you wind up building up these trees that you would actually want to stop when you get to a fixed length now it turns out the trees are really good for classification and regression and what I'll show you here is that given a classifier I can make a decision when I make when I put down some new data exactly where I'm going to wind up showing up in this space but what's interesting is that if I show up in this space here for let's say new data that is given a new x2 and x1 not only do I get a good classifier so if I'm in this space the answer is green I also get some notion of probability of which class you belong to right so here I had one example of blue and I had a couple examples of red if I'm somewhere in here and that's the space as I partitioned it using my decision tree then I wind up actually being able to get some notion of probability which can turn out to be very very helpful and useful now I can show you a demo so here is a demo of decision trees and this is essentially the result of having built not just one tree but multiple trees and when you build multiple trees those are called large numbers of trees or random forests and it's got some other names associated with them now if we've got a class of red in our training data and a class of green in our training data this is the space as it's partitioned by not just one tree as I put in essentially a new value here which space do I belong to you can also see that there's some notion of a probability so out in here I'm not sure whether I belong to green or red in fact in this case you're gonna get an answer of what the probability is but because you're so far away from your original data this might be an example where you sort of give pause and say I got a probability out I got an answer but I'm not sure if it's if it's right because I'm far away from my original data that's that's another thing if I now plop down another data point so I plop down red and build a new set of trees and now took an average over a bunch of these different trees to get what the result is you see that I wind up changing the space if I put one right inside of here I wind up changing the space put here etc so if I want to do a green data point shift mouse click so you notice here I'm not changing anything except I'm making this quadrant very very strong for green if I put a green right in here things get pretty ugly right so I just drew a little red box around this one red point and you see that there aren't just it's not just a fixed number of parameters if I change the number of parameters of my model I wind up getting a different space out of that so if I have a few number of trees so in this case I've got one tree this is what one tree would wind up splitting up the space to be if I have lots of trees I split up the space so this is 189 trees I split up the space like this the number of trees you have may actually wind up being part of your protection that you have against overfitting if you have lots and lots of trees in principle you could wind up overfitting on the data because you've just seen it too many times random forest is a pretty good job of protecting yourself against that not introducing too many biases with it okay I can change the depth of how deep the trees can go etc etc okay any questions about what I'm showing you here yeah good point so is there a notion of errors in the data yes because there are astronomers in the room and we know that we don't know our our measurements aren't perfect and there's noise in the data and we have biases and errors in our measurement but no in the context that random forest doesn't take those into account explicitly one could take a new account by making the error on let's say the r-band magnitude another feature but what you're not doing is telling random forest or any other of these classifiers by the way feature two is the error on feature one it would just potentially wind up learning that if there's a large number in feature two then I should discount the values in feature one one of my big knocks against a lot a lot of the academic work in machine learning is that very little of it contemplates uncertainty in the input data likewise very little of it contemplates the fact that you might have mislabeled data so in this example here our original data taking may tell us that this value is red but the class is actually green and so we just messed up our classifier locally around that point so in a Bayesian sense of having parameters of some probability that every data point is incorrect or some that every data point is incorrect you might want to actually take that into account in your classifier we don't do that the only way to really do it is with some notion of a brute force where you would simulate what your data sets would look like under different types of assumptions so you could jiggle around each one of these different data points it's mildly appropriate to create lots and lots of instantiations of your data by randomly sampling what you observed from what you believed to be your noise properties of what you observed and then using that to build up what is effectively a much larger data set to build up a potentially more robust classifier but that's a bit of a cop out again most of the machine learning that you're seeing here and has done in the real world is very frequentist and we don't typically have a good way of incorporating errors from the data into it any other questions so the question is how do we given that we know that we're changing the whole model space as I change the hyperparameters of the model how do we go about choosing the right hyperparameters the most appropriate ones that's what's called model selection hyperparameter optimization that's what we're gonna do a little bit later on it's somewhat of a black art there's machinery within Psykit learn to help you do that and run all of it but in the back of your head you should always be thinking as you're doing these model selections and hyperparameter optimizations am I just overfitting on all this because I'm just happened to choose the right one that gives me the right answer or the best most accurate answer but we'll talk a little bit more about that it's a good thing that you're thinking about it I want to be cognizant of the time what is the what should we be giving some thought to stop all right why don't I go through and I'll do I'll just finish up on regression and then we'll come back and do a classification example and then we'll do a little bit of anomaly detection after that all right so let's do some random forest you instantiate a random forest regressor here we're gonna build not one tree but 100 trees we're gonna create a criterion by which we decide on the different splits at every single one of the nodes this is just gonna be mean squared error minimum number of samples in a leaf will be one these are different parameters that you're gonna have to get to know if you're using random forest so I'm running that now you say it takes a little bit longer than the other ones did and it just finished so now we can plot up the results and you see that my mean squared error is now lower than all the other ones that I had before one of the major problems of random forest and these nonlinear learners because it should be obvious that this is a highly nonlinear learner right so I'm not partitioning the space with just a bunch of different hyperplanes I'm actually I'm actually basically making very very nonlinear cuts when it comes to sort of variable interdependency and it's the job of your good learner to basically figure out the relationships implicitly between these various different features but what should also be clear about random forest is that it allows you to make cuts on the data where you don't have to do this preprocessing step and for that reason alone it's why I tend to stay away from the ones where you do the linear regressors or the caniarest neighbors where you have to effectively define the metric space ad hoc in this case because every time you wind up making a cut on the data you're deciding go left or right you're deciding it based on the units of just that column of that feature so if I'm cutting on d red and r band magnitude I'm making a decision based on what's your d red and r band magnitude at that at that split point I have no reference at that split point to what the other values are in the other features whereas when I make a decision around how many neighbors do I go out to you're basically you basically have to make a decision in a large dimensional space which means that all the features have to be effectively scaled correctly with respect to each other one of the beautiful things about a random forest is that it doesn't require you to have to do that and so it takes away one of those really ugly places where one as a data scientist or a physical scientist would have to imbue a lot of knowledge to get it to get it right with some meaning so now we have some questions about model selection so choosing choosing the best model we have this notion of what's called cross validation so what I just did is I built a very poor man's cross validator where I took half the data and I essentially took it out of my model building and I built a model and then I applied it to the other half just to see how well I did but what you can imagine one can do is instead build a model on let's say 80% of the data and then predict on the other 20% and then move your window to another 20% of the data and build a model on the first 20% and the last 60% and apply it to your second batch of 20% and now what you have is a prediction on all of your data where you've now not used half the data but now you've used 80% of the data so in principle your model should be better and as you wind up tuning your model you can imagine making those sort of windows smaller and smaller so this is called this is called cross-validation just visually I'll show you what that looks like so this is what's called five-fold cross-validation as depicted here so this is just a two-class problem but it works as well on regression problems you have a test set and a train set and so you take the first 20% you hold that out you build a classifier on your basically this portion here this 80% and then you apply it here and then even though you're applying it on a small amount of data you store the results of those of that application and then when you're done you basically have built up a whole bunch of predictions on data that weren't part of the training set and as long as these models are trained on the same hyper-parameters as you go to each of these different so-called folds you wind up getting a very accurate measure of what it would be like to have a model that's built on all of the data and applied to another set which is out here by the way oftentimes what people will do is not just create a test set and a train set they'll also build what's called a validation set in that case let's say you have 120 examples you take 20% and you put it fully aside and you say I'm not going to do anything with that other than to say when I'm ready to write the paper I'm going to apply my best final model to that and say that's what I expect my results to be and then you can do these clever examples with train and test to help you get a better and better model but it's considered a very good practice to hold out even more data in this validation set and not use it at all to help you train your model right because as soon as you start building models and changing hyper-parameters based on the results of what you got on here even though these data are in principle completely held out from here there is some implicit crosstalk between the train data and the test data so there's lots of different ways to do these different cross-validation holdouts as it's called the one that's often useful in the context of classification is something called stratified k-fold where you're saying for a large number of classes let's say I'm going to I've got 12 different Hubble types of galaxies and I want to classify something in if I have a class of galaxies that is a minority number let's say there's only 10 of S0s in my in my galaxy classifier in my training set but there's a thousand of you know some other type then you can imagine that a classifier that's built correctly on the data a little bit like we saw before with the Redshift regressor would try to fit most of the data and so there's a nice thing about this k-fold where it basically winds up holding out the appropriate amount of data for the given distribution of the classes in your in your training set don't worry if that's you know a little bit beyond what you've thought about for now I just wanted you to hear it and think about a little bit for now all right so we're just going to build a little helper function which is going to tell us what our cross validation score is we're going to do our cross validation using k-fold cross validation so we're going to do essentially 20% hold out 80% prediction oh wait what oh I forgot to run this to get an answer there we'll do cross validation k-fold with 10 using random shuffling so this gets to one of the questions earlier about the potential bias that we introduced by holding off the first half of the data versus the second half this will actually shuffle it for us and then we can do it on other classifiers so this is clf2 so this was the random forest classifier you see this takes a lot longer this is building up a more intricate model see how well that does any questions while this is finishing up this is running on a raspberry pi is rack space all raspberry pies I guess not well we'll see so what is this doing it's actually doing a three fold cross validation which means it's building three random forest models and you see that our mean score wind up getting better this by the way is different than the mean squared error so this number should be larger all right so I'm going to stop here we'll take about a 10 minute break or so we'll come back and do some classification we'll do some hyper parameter optimization we'll do some parallel computing and then we'll try to do a little bit of anomaly detection as well I'm happy to take questions during the during the break if people want to come up let's start up again here so I've been making reference to classification problems before we jump into that I will just formally describe what I mean so predicting the discrete class that is are you in class A, B, C or D through N of an object from an input set of features so instead of it being a continuous variable that we're going to want to predicting we're going to try to decide which class you want to belonging to and of course you could turn a regression problem in your classification problem where you know if you're above the value of one then you belong to this class if you're less than this you belong to that class and vice versa you could take a classification data set and say if you're you know in this class A that means that from a regression perspective I'm going to put you into somewhere in the bucket between zero and one but the closer you can get to defining your problem that's consistent with the question that you're asking and the data that you have obviously the better anyway so if I've got you know a bunch of instances these are examples of x vectors and pairs of y which are the labels say 150 of them and I want to predict these three different classes that's actually what defines the the iris data set so lots of different ways to do classification within within scikit-learn you can turn a logistic regressor into into a classifier k nearest neighbors classifier we saw that a little bit already lda dershla allocation naive base support vector machines classification trees random forest blah blah blah so there's a whole kitchen sink like there was in the regression side for you to make use of and then there's also something which is also quite useful called feature selection which is an automated way of deciding which features are informative and then rebuilding models based on what's informative and what's not informative because I haven't shown it before why don't I just quickly introduce yet another way of doing regression and classification support vector machines typically you think about these in the conducts of zero one classification so are you do you belong to the red box class or the blue circle class and support vector machines is basically a mathematical model for building up what are essentially called hyper planes because we're showing this in two dimensions so it looks like a line but in multiple dimensions you want to basically create these nice places to separate so in the iris example there's obviously some nice places that you would wind up separating between the red and the and the green the question is where and how would you build those lines so obviously I could build a line that sort of cuts right close to the red and say everything that'll left to this line in data we haven't seen yet is going to be of red class and everything the right will be of the other classes but is that the optimal place to put it the whole idea of support vector machines is that you try to get these hyper plane hyper plane sort of in between the data where you create the largest what's basically the largest space right here so these are this is the largest support as it were so this a is the optimal line between it's to separate this data set and this data set even though B also does a perfect separation it gets very very close to this value here and gets very very close to that value here so it has less support so the larger the support you can get the better this works really well in many cases because if the data is nicely separable in these hyper planes then if you were going to use something like a random forest you have to make cuts which are effectively always perpendicular or parallel to one of your axes and so there a random forest or even just a decision tree would have to make a cut that's here and then on a subsequent run it might have to take another cut in another dimension just to build up this what is a very easy math to figure out what these hyper planes are so this is again is an optimization where you're trying to find all the hyper planes that basically give you the maximal separation here that if you had let's say a red value right here the constraint would be you know how many values of one class are allowed to be on essentially the wrong side of this line so is that clear conceptually what's happening there's some very nice properties of these things for these linear separative models where you wind up getting some very very good classification results here's an example though where you've got data might be hard to see where it's blue here inside and then it's red here in these two dimensions and so these you basically can't build a hyper plane that winds up separating these two things out but instead what you can do if you click on this link here this link go to page what you can do is actually project your data into a higher dimension that you had before and then you can create hyper planes that actually do give you perfect separability so here we're taking two dimensions we're going to multiply those and push those into a third dimension and now you can create a hyper plane that optimally splits between the blue and the red and then you can project that back down into the dimensions that you had before so turns out that in a very very large in fact infinite dimensional space you can build a perfect classifier even for highly non-linear relationships between the the data the problem with that is you wind up you wind up having very very large models and again you come back to this problem of the metric space that you're doing all of this work in if this direction here is magnitudes and this direction here is which telescope you used what is the metric in that space that that you would actually wind up using again I come back to things like random forest because when you make your cuts you're making it in the metric space of the thing you're cutting inside of so it tends to be much more much more powerful all right so let's go back to classification any questions about sport vector machines what I will say just it's not even anecdotally I guess you could look it up is that for lots and lots of different types of problems both in astronomy and outside of astronomy support vector machines rarely do better than things like random forest on you know real world competition they have other issues as well like their model sizes can be very big so if you try to move them around in memory and you have to use them often you have to hand them to people the sort of typical rule of thumb is that they're about the third the size of the original dataset which can be quite large anyway lot of people use support vector machines but that tends to be mostly on the academic side the other major issue with support vector machines is that there's no natural way to get probabilities out whereas with k nearest neighbors or with random forest you can basically just look at what all the different effectively votes are and say well I voted mostly for green so that's the answer but I had a couple other red classes for this new instance then you can create a realistic notion of probability with support vector machines there is no way of getting a natural probability out of that and many times we want to get some notion of I belong to this class with this probability yeah yeah again I just I completely agree obviously and it's beautiful at the mathematical level it just isn't practical for all the reasons that we have up here actually some can be very computationally expensive not to build it necessarily but to move it around if you're doing a non-linear kernel it can be pretty expensive but you know it's not all that informative either right so you're making all these massive hyperparameter splits it's not really clear what's what's going on and then again because we care about probabilities support vector machines don't give you probabilities and for me the biggest knock is our data is heterogeneous and in terms of what its different units are in the features and we also have missing data right and so you can try to course your what is a real world data set or astronomy data set into something that can be you can build a support vector machine against but in the end you know you're you're basically doing a square hole and around peg whereas there are other other models that allow you to natively take care of things like missing data and the last one is that in context of classification support vector machines works in a zero to one way and so if you're trying to build a multi-class classifier you're effectively building up do you belong to this class or another class do you belong to class B or the other classes and then you wind up basically choosing the one that gave you the best answer and that's your classifier so it's not very natural for multi-class problems either for those that didn't hear that at home this is a tweetable moment from Dave if it's if it's tractable I mean can I rephrase it a little bit if it's tractable and it's beautiful mathematically it's probably not that practical okay so now we're going to build a star galaxy quasar separator so star galaxy should be pretty easy if we had shape parameters but again we're going to use the original data we had which is just photometry parameters so this is a somewhat non-trivial problem and we're going to pull over a thousand quasars thousand stars thousand galaxies we're going to remove all the bad data and we're going to take a look at our data so here we've got our object IDs and our different colors here and we'll take a look at our different classes so we've now got 3000 so this is what's called the label a balanced label problem so we've got one third, one third, one third so this means the classifier really is going to have effectively an equal shot at making predictions on all of them this obviously doesn't take into account that they're far more stars that you're going to be able to see in Sloan than you are going to be able to see quasars so again we're not using kind of our intrinsic knowledge about this this is saying given a source and given no other priors on what that is if I told you I found a source in Sloan the first thing you'd say is that's probably a star right because there are more stars in Sloan than other types of objects if I said well it's not a star you'd say well it's probably a galaxy you'd say okay well it's not a galaxy well then it's a quasar so without giving you any other data other than saying it's not a star you can pretty much figure out what something is but given a new given a new object in a purely frequentist way without any priors I want to know given its data what is what type of object is it and here we'll basically make our y value so now instead of it being redshift so a continuous variable we're going to wind up having a three class problem and we'll run random farce on that with 200 trees and interestingly before we see the results you can actually count up the number of times that a feature is split upon during this process the number of times a feature is split upon is actually indicative of how important it is for the answer that you care about because I didn't go into all the details of exactly how you choose which feature to split on at every one of these different nodes but effectively there's a hyper parameter which says choose a certain number of features to try out every time you get to a node figure out the one that is best at doing the split the split this is what's called a greedy algorithm and then split that way on that one and the next time just randomly choose another number typically the rule of thumb is that you'll choose a number of features which is the square root of the number of features that you have so if you have nine features like we do it will choose at any split point three random features to try it'll figure out the optimal split across all three of them and it'll choose the one that gives the best information gain and then it'll split there on that feature and then it keeps on going so if you keep splitting on the same feature to improve your information at each of the nodes it stands to reason that that would be more important and there's a more rigorous way of keeping track of importance in your model with random forest and then you can plot that up was there a question so this is the number of estimators this thing of how many features do I try that's called mtry how deep is the forest how many you know how many nodes how many instances should be in each node in the terminal node all of those things are called hyperparameters of random forest and support vector machines is a different set of hyperparameters and etc etc we're going to show you how you do that sort of selection but effectively you have a bunch of rules of thumb of how many you should have and then you wind up effectively looking around around that effectively what you do is you either do a grid search or a random search over those hyperparameters again with the little bird on your shoulder or devil on your shoulder saying if you do this you may be overfitting you may be overfitting so protect yourself so there are ways to try to do that but the the answer is sort of logistically you basically do a search over the hyperparameters okay so u minus g color is pretty indicative of whether you're a star a galaxy or a or a quasar the difference between your g-band magnitude and your aperture or your protrosion magnitude is the next most important these other two are important and it turns out your i minus z color is not important at all or a very not very important at all we can get what's called an out of bag error when we do or score when we do random forest let's try to figure out what that what out of bag actually means the way to think about out of bag is that whenever you are deciding to build up a tree the first thing you do is you say okay I'm going to try a bunch of different features but I also have to know all the different instances that are going to be part of my selection you randomly choose your instances from the data but you do it with replacement so that means if you happen to randomly choose instance two which could be let's say a star you put it back into the bag and then you choose again and you might actually get number two but you choose the total number of instances that you have I'll do it in the column way so this is instances it's the number of rows and columns are the number of features you're basically choosing let's say and instances but you can do the math and figure out that you typically will randomly leave behind about 30 percent of your data at every single one of the top splits that you went up doing when you build a new tree and that's what's called the out of bag data so the out of bag data when you're building up a tree is actually left behind is not part of building up the tree but you can take out the things that you hadn't used apply it get what your answer is and if you store up all the results you get effectively the train test split for you very nicely so that the out of bag score is for all the data that randomly wasn't used what was my predictions and how well did I do so this means that I basically got in 95 percent of them correct with this this classifier we can do the same thing and support vector machines and there's lots of different kernels you can use to to basically to build up your SVM and we'll print out what those look like and what the results are hopefully any questions while this is running about the statements I just made about out of bag error okay so what am I doing here let's take a look I'm just creating a bunch of different support vector or machine models and I'm creating different types of kernels which is whether you're in a linear kernel or a non-linear kernel which actually allows you to morph a little bit what the what the metric space is that you're using and then there's some other hyperparameters of that of that model again we're not going to go into the details of what all these different hyperparameters are but it's it's useful for you to to look at that if you want to play with it I don't know why this is taking so long this takes too long I'm just going to jump back to my laptop just curious what do we have here does anyone know how to figure out what kind of machine we're running on side of Unix no uh is it cat like ptc proc something cat slash cpu info ah oh it looks like a lot of cores huh all right let's see if this finished what the hell it's still going uh huh all right why don't I talk about hyperparameter optimization while this is running if this takes another minute or two I'll just jump back to my local version on my machine so this is where you would actually do this hyperparameter optimization that you're asking about really shouldn't take this along so here we basically grab this thing called grid search and we can now create a dictionary that has a set of parameters where we're now not basically individually creating all of those support vector machines by hand we're going to create a set of parameters which will be fed in when we're creating our different models that we want to loop over or in this case we're going to create a large dimensional grid over all these three different parameters what kernel it is whatever gamma means and whatever c means okay I'm going to just kill this oh we got some of the images all right so here's two of the support vector machines in the space that they wind up creating here you notice that the linear kernel is basically splitting up the space as best as it can across the three different classes here with radial basis function kernel you wind up having some nice behavior here again we don't know whether we're overfitting on this one blue dot or not but that's just to show you what the different spaces are that can build out all right so let's take a look at what we're doing here grid search cv will basically run the fitter on on this type of model where it's passing different parameters and different combinations of those parameters so we can run this here and now run it in a parallel way so hopefully this will be faster and we can figure out what the best what the best score is so we're doing 168 different fits that finished pretty quickly so it looks like I have a lot of cores but terrible cpu or something and I got my best answer out here and so you see this actually got comparable to random forest and here's the best model so the best model in this case was gamma of 0.1 it shows a radial basis function kernel and or the other parameters we searched over I think we looked at degree so that's actually kind of nice because we can run it over multiple cores and there's something called joblib which is what scikit learn uses to actually push this out to as many cores as it can get access to and it's running all of these embarrassingly parallel jobs for you so we did 168 of those in seven seconds over however many cores there were here all right so we can look at what's called the confusion matrix which is a nice way of showing you how we wind up you know labeling correctly or mislabeling between classes 0, 1 and 2 so I forgot how what we called those things 0, 1 and 2 so 0 is quasar 1 is star 2 is galaxy and we can get an actual numpy array showing us what those what those values are so this means that for all the things in the training set we've built a we built a classifier and then we applied it to the testing set we got 230 one of those right of quasar whatever 254 of star and 255 of galaxy and you see these off diagonals is where we had some confusion between those different classes if you get a perfect classifier and you see a confusion matrix that is you know has no power off the diagonal and all the power in the diagonal you've overfit I guarantee it or you're using a trivial data set where you shouldn't even be playing with it but if you don't see some off diagonal power it means that something in your training set is leaking into your testing set or something in your features knows about the answer in a way that you don't want it to know like if I had a label that said the label Q S and whatever G probably we'd have a perfect classifier because it would wind up figuring out that whatever column I went up using in my feature data was actually exactly predictive of the thing that I wanted another way to figure out whether your classifier is any good is put the actual answers into your feature set run your classifier and the most important feature better be the thing that's the answer if it's not you have a bad classifier you've done something else horribly wrong these are the types of things that you have to start thinking about as you're building these up okay so let's do the same thing where the question was how many trees did we use will actually build one where we're looping over a whole bunch of different questions that we have for the hyper parameters of this random forest so we're going to loop over 108 different types of models with different hyper parameters hopefully this will also finish in a finite amount of time okay that was pretty good give me back there we go all right so we got an answer which is a little bit worse than what we had had before with the models that I chose before we don't know whether this is an inferior one it just happens randomly that we wind up getting slightly worse results because again there's a random component to the construction of this model I thought I saw a hand somewhere okay now so here's our answer max features three min samples in the leaf node is two number of estimators 200 we can get those parameters and we could save those for later these are the best parameters of the random forest we built here's the best parameters of the support vector machines that we built and I won't go into the details of what you can do in here but you can actually set what your different scoring functions are so if you want to do your correct choosing of which hyperparameters are best you don't have to use whatever the default is which is probably mean squared error for for regression and it's probably just accuracy your score for for classification any questions about that it should be obvious that when you're doing grid search it's not it's not ram optimize and not compute optimize you actually look at the code that's doing the grid search it's doing a whole bunch of nested for loops and oftentimes it will wind up not saving like preprocessing that you might do in the step so if you had a pipeline that was part of your grid search you might actually want to save the data and then reuse it for something that's inside of a for loop these are not always optimal when you have a lot of compute power a lot of time to sit around that's fine what it turns out is that somebody wrote a paper fairly recently showing that if you just did a random search over your space you get an answer which is pretty good and it's pretty close to the right answer because over multiple hyperparameters it turns out most models are only sensitive to a few of those hyperparameters and so searching over entire grid in lots of the hyperparameters that you think you care about you're spending and wasting a lot of computational time in a space that's not actually all that useful there's also a whole Bayesian formalism for deciding what next hyperparameter to wind up using given the results of all the other runs that you've had that works irrespective of the model that you're actually building I want to jump back into the notebook and go into a little bit more details here and some some questions that you would be asking when you're doing hyperparameter optimization how do I choose a model is really the critical one once you've done all the featurization and preprocessing steps k nearest neighbors is what's the type of what's the number of neighbors sport vector machines is what's the kernel that I'm going to use with the bandwidth random forest is how many trees what's mtry with Gaussian processes it's different sets of questions I showed you this already and then there's lots and lots of different metrics that you would potentially wind up using to create your optimization and lots of different ways for you to show plots of how well you did on the models that you wind up choosing obviously won't go into all the details here I will post this notebook in the astro hack week github repo so you can see that for later on if you want to go into that so all these things are available to you and you may decide none of these metrics are the ones that I actually care about because if you talked about later we might not want to be optimizing on something like accuracy we might want to be optimizing on you know something which is much more closely related to the problem that I have at hand what I will say and this is from a paper we wrote in in astronomy context where we looked at a bunch of image differences on the palomar transient factory and we had a bunch of labels of whether this thing that looked like it was a new source in the image difference was a real source or whether it was not a real source as in a transient or not a transient so this is what we call the real bogus problem and given the same input featureized data across I think it was 42 dimensions in this case what you wind up seeing is that in all metrics for false positive and false negative random forest went up beating out everything else and this is what the best tuning of each one of those different models I won't claim that this is a universal type of curve that you'll see but if you're asking the question which model do I try I often would try random forest first and then maybe if that doesn't give you the results you want you might do support vector machines but the other ones tend not to do as well and again we already talked about this question about what we can pull probabilities out so in this case we want to minimize the false positive rate and minimize the false negative rate right the type one and type two errors and so you want to push that curve all the way back down to the bottom left and at all places we found that random forest actually did better a notebook we're not going to do the breakout session I will continue to extol the virtues of random forest and say that random forest is built in to connect so all this is doing inside of connect in hardware is running a random forest that's been pre-trained on lots of probably post-oxid Microsoft maybe grad students probably post-docs where they walked around and they basically then labeled which part of your body each pixel was and then random forest is basically trying to figure out and colorize in this case here what your what body part you are in a given pixel and so all it's doing is a pixel level it's saying what body part are you what body part are you given the input images and random forest because it's a bunch of if then statements effectively it can run incredibly fast on the prediction side so this is what they did they basically took all this input data left hand right hand shoulder neck they built a a strawman model of what that would look like to create some notion of tracking and then they just basically built a classifier at the pixel level so here are the post-docs doing crazy things that you do in connect they had like a million training examples all right all right so let me just finish on the hyper parameter optimization stuff and then we'll jump back and I'll show you some other types of models show you some other examples of using machine learning and astronomy all right so we're going to run run this without doing a full grid search and we got an answer which is pretty comparable to what I had before we've been using this thing called job lib has anyone heard of Dask okay so a few people again it's the people on the right so it's the python three people interesting so people that continue have built essentially a distributed computation and scheduler that allows you to make use of multiple threads and multiple cores and even multiple computers where it keeps track of what the computations are that are going to be needed and are then going to be needed farther down in the computation tree and it keeps in memory the things that it believes it's going to need later on and then once it's done with that it will wind up excising it out of the out of memory so it's a computationally and RAM efficient way of doing computation if you run Dask install Dask and distributed we can try to run this grid search using Dask which will basically do the distributed thing in a completely different way so here I'm just going to pull over a very very not safe for work well not safe for work in the traditional NSFW not safe for work in the sense it's not ready for production way of this thing called Dask learn which is trying to basically use Dask and rewrite all the scikit-learn fitters and optimizers so we're just going to do that and pull that into our namespace hopefully that will work oh nothing called Dask I forget to run this all right because I'm running it in the cloud I'm not running on my machine dee dee dee dee dee installing Dask extracting packages okay so let's this should now work yep that what that worked right yeah so now we'll do the same exact thing but instead of doing the grid search that comes with scikit-learn we'll use a different grid search and this one actually may take a little bit longer we'll see how long it takes just because it sets up a lot more infrastructure but if we did many many more parameters or we did this on a very large cluster let's say it took five minutes to build one of these models this would just this would totally win so we'll let that run I won't execute this thing here which allows you to build up a cluster and then run it over a cluster I'll come back to a different kind of cluster clustering so here this took 31 seconds we got about the same answer as we had before let's see how it did timing wise relative to the other one that one took nine seconds so still much faster on the scikit-learn side but there are workloads that you can find on the web using Dask Learn that actually do much faster and much more computationally efficient what I didn't do is keep track of the RAM usage I think one of the conceits of Dask is that it's just going to be more efficient at using RAM all right so I'm going to jump back now to a couple other a couple of things I wanted to show you who's heard of deep learning okay now almost everyone not just Python three people effectively the idea around deep learning or what people are very excited about is not only does it give very very good and probably the most accurate answers on a class of problems typically around language understanding and around voice to text around image processing around video processing the thing that people really like about it is that you don't have to do a lot of featureization in fact the idea is you can throw raw data at a deep learning network which is effectively just taking in the end it's just taking your raw data and it's deciding how to combine different pixels from different places on let's say an image with weights that it winds up learning through a bunch of very clever techniques that allow it to learn in a finite amount of time and then it takes the results of whatever it wound up taking from your raw data that's going into what's called the next layer and then you have another another set of weights which are multiplied against that data to get to another layer and the idea for image processing is that you don't have to know a lot about you know what does it mean to be an edge to detect something that is an edge you don't have to know what it means to be a cat to find cats and images that as long as you have an objective function on the other side that you're trying to let's say classify am I a person or a cat if you're trying to get sort of a you know a two class problem as an output that you just if you have enough data you let the machine basically figure that out so what's happening unlike what we just showed you basically throughout this whole tutorial where we took our raw data we did some feature engineering we threw some stuff out we built some new features and then we did our classifier in this case with deep learning the idea would be that you can really do both a feature engineering and the classification in the same sort of process this is very computationally expensive although again there have been a large number of advances in the field that make it at least tractable but oftentimes because a lot of this is a lot of matrix multiplication these are things that you can actually wind up doing on GPUs very effectively or even FPGAs it's worth taking a look at this result here where people built up a network to do some inference on galaxy images essentially to classify galaxy images using raw Sloan data and here you can see that their input was RGB data and they wind up basically building up a network to get to some sort of answer that they they wind up caring about and I'm not going to go into the details of what all these different types of layers are and how you wind up combining all of the data other than to say that yes you give up on feature engineering that's a big plus one for deep learning but in the end there's this whole black art of how you build up your network and how did this person get to the point of building a network that looks like this because everyone's network is going to wind up being different and you wind up getting different results depending upon what your network is and all the different hyper parameters of the training of that network all of that is stuff you still have to do so we've pushed featureization which I tend as domain experts I tend to like a lot because it allows you to commune with your data allows you to ask questions that you know are physically sound and relevant for the question that you're asking of your data and it allows you to basically build features that are going to be informative and then throw it into a class fire whereas in this case you wind up potentially just throw everything in the whole kitchen sink and hope that you get a good answer out now there's nothing forbidding you from using deep learning networks to work on featureized data but because these tend to work well in the context of two-dimensional data most of the time you're not doing featureization and then there's some some sort of notion of some local connection between pixels in that case you certainly could build a deep learning network on an input vector which is what we've been operating on now but it typically will wind up wanting to work on two-dimensional images or at least that's how it's been used to most effect recently any questions about deep learning yes yeah and look just like machine learning is not the answer to all your inference problems in astronomy deep learning is not all is not your answer to all machine learning problems in astronomy and in fact in the image problem where we looked at the real bogus we did a traditional featureization step on that and then we did random forest we also tried using deep learning and we got inferior answers and the reason being is that while you can take networks that have been pre-trained on other images and apply it to your image what's actually happening in many of these cases is it's learning a lot about the details of the questions that you're asking of the original image set and we didn't have enough data to train what is effectively millions and millions of nodes in this network and so we just got inferior answers so here's an example where we needed to put it in production to work on real world telescope data effectively in real time and deep learning just didn't didn't work you could argue we're not deep learning experts and we didn't know how to build the networks correctly but we did spend enough of time on it I think Danny kind of banged his head on that enough but you know it's not going to solve all of your problems another approach of using deep learning which I think is gaining more and more attraction is to use this notion of auto encoding for those that have heard of that where you basically don't try to predict an outcome with labels you try to basically build up the data that you had before so you have a deep learning network that looks like this and what you're basically doing is you're compactifying your data and getting it more and more summarized over time so that by the time you get to these deep layers you in principle have learned concepts about your data whereas at the very high layers especially for image processing you're in principle learning sort of low level features that you would apply to data if you were going to do it from scratch so this could be like histogram of gradients it could be other types of low level featureization that you would ordinarily do if you didn't know about deep learning it's been shown that some of these higher level layers closer to the raw data actually do some traditional filter bank techniques and are learning those intrinsically which is pretty amazing but farther down you've got concepts and deeper concepts of what it means to be the data that you've given it if you then take this compact notion of what it means to be your data and then you actually start building and start expanding that data back out you can try to basically build something that doesn't get smaller and smaller but now gets bigger and bigger and then you wind up trying to predict the data that you started off with that's what's called auto encoding what's nice about auto encoding is that you don't have to have you don't have to come at this problem with the classification set in mind you can just say I've got a whole bunch of data bunch of images let's say I want to basically build a deep learning network that gives me back the data I started off with but now I can use this stuff in the middle which are the deep concepts and I can in principle use those as features for another learner so what we've been doing in my company what we've started thinking about in the astronomy context is marrying both deep learning and something like random forest where the deep learner is not the classifier itself but it's just the thing that's building the features for us that's definitely something worth exploring somebody wanted to build a hack around that this afternoon that'd be pretty exciting happy to talk with them about it the other thing I should say in the context of deep learning and more generally about machine learning is that when you have more data you wind up traditionally doing better so even if you have the same learner you have the same featurization process and this comes something Peter Norvig at Google said a while back more data beats clever algorithms but better data beats more data so get more data but make sure it's pretty good and oftentimes if you want to improve your model given that you've just spent a bunch of time featurizing it and extracting what you think is the most information out of this and then throwing it into a good classifier the only way to get a much better result is typically getting more data there was a question good so the question is what's the rule of thumb around whether you do deep learning or something else I don't have a good one what I will say is that a couple of things one is if their images or their sonograms frequencies or something like that that may be a pretty interesting example where you start thinking about using deep learning because those are the obvious places where they're doing really well if you have lots of metadata and heterogeneous data in the sense that you have you know categories of your features and you also have numerical values and maybe even strings that's a pretty terrible place to think about deep learning and the only time you'd have to do it is after you did a whole bunch of featurization but once you've done all the featurization you might as well just use a traditional learner so if your data looks like that that's a little scary for deep learning if it's very very you know if you're looking at pixel data and everything every pixel you know goes from 0 to 256 that's a good place to think about it now in terms of the volume of the data millions of images you know you can do well with say tens of thousands of images there's a whole notion of what's called transfer learning which I alluded to where you could take a network that's been built on images taken off the web and then take that whole network and just use it to apply it to your images and in principle there are features that it's effectively learned and concepts it's learned that it could wind up using and reusing that could be informative and allow you to use smaller amounts of data I think once you're into the millions of of instance level then you can start credibly start thinking about deep learning people on the web you know maybe flaming me right now because they said no there's an example where you can do 10 if you're the 10 instance level 100 or 1000 you have to be using some of these other techniques any other questions I missed the beginning of your question I mentioned the concept of what writes yeah right so that the process by building up these networks so what should be clear just to step back is that a big part of the work that people do in machine learning at the academic level and a lot of that's moved into R&D centers and set of companies is to figure out optimal ways to do the optimization problem of basically how do I build up the network or in random farce what's the optimal way to make the splits to get the answer out that I want where optimal now can be not just accuracy but RAM efficiency and CPU costs etc etc so a lot of the work is on building up the networks and learning how to learn once the network's been built then it's just a matter of just cranking data through and one of the nice things about you know things like random farce and deep learning networks is that it's not that computationally expensive to take the data throw it through and get your answer out really really quickly with not a whole lot of extra math but so the question about how the weights are created that comes down to how you decide to build a network oftentimes the weights will be randomly assigned at the beginning and then you have this notion of what's called back propagation where you start from the answer and there's a way to in a mathematically practical way to start and basically update the values of the weights going backwards forward there's also forward ways of doing it as well so that's all of the techniques that people are working on to try to make that better all right so we already talked about improving models I guess what I'll say I'll just show you a couple examples of how are we doing on time a couple examples of machine learning approach to classification of what we've done so we I told you about the zero one classifier of real bogus we've also done this on variable stars so now we've worked in the time domain and we've taken photometry and other things and we basically wind up building a classifier off of that so here we took what is very heterogeneous data so sometimes we've got photometry and arban magnitude sometimes it's regularly sampled data often times it's not regularly sampled data in time and instead of throwing the raw data in and RA and DEC and all that stuff we build a whole bunch of features off of that so you can build features and the frequency domain you can build features and with unordered statistics etc so variability metrics periodic metrics shape metrics on the light curves and then context metrics of like where this thing is in the sky and you know what's near it and we were able to get some pretty interesting results out of that we wind up building up and this is now still a work in progress with some people that are in the room essentially a framework that allows any of you all and hopefully a lot of your colleagues to start basically getting the not just the pipeline that I showed you at the IPython notebook level but even data handling and project handling and reproducibility notions out of out of your data so this is a project we call cesium and we just released sort of the initial version of this this provides a lot of the featureization capabilities around time series data and then also a whole bunch of access to scikit-learn modeling and feature selection etc and then even plotting that allows you to basically build up you know entire frameworks even around some inference problems you might have in the in the context of time domain this is different than scikit-learn in the sense that scikit-learn is a set of models and it focuses very much on the modeling part not so much on the featureization part we've spent a lot of time on the engineering components around the featureization and our idea is that we're going to be able to take this to other domains as well like in seismology and neuroscience and take what is fundamentally a similar type of signal just you know measurements as a function of time to predict outcomes and use our essentially our feature bank to to actually get good answers so I'm happy to talk with people about that if they want to start using it and playing with it there's a way to install and get it get it working on your laptop with with Docker so we applied our classifier that we built on on variable stars to something like 50,000 stars from ASUS where we only had a training set of 810 over 25 different classes so this is a hard problem and our training set size was really small and we try to basically get out of light curves like this what the probability of it being an R or LiR is and we try to do that as far back to the raw data as we could and we got something like a 15 percent error rate which is pretty good across these multiple different classes and we built a website called bigmac.info if people want to check it out where it allows you to peruse through the hierarchy of variable stars and then click into those so here's pulsating and we'll go into different types of R or LiR we'll get a fundamental overtone R or LiR out of this and these are the ones that were predicted 405 of them out of the ASUS catalog and we show the probability of it being belonging to that class and then you see the probability vector of what comes out of that of belonging to the class you see the raw data and the folded data and then we made it social so that Facebook could buy us one day but anyway for those of you that are working in variable stars the whole idea is that you've got to start getting used to probabilistic catalogs so when we say this belongs to this class or that class or that class the real answer is it belongs to this class with this probability and it belongs to that class with that probability and oftentimes as astronomers we want to take spectra of things that are of that type but intrinsically we know that there's some chance that they may not be part of that type we're trying to formalize that in studies like this that allow us to put these different objects into different buckets and allow us to do science across those different buckets I can go into the details of that if people want to ask me about it offline the last thing I'll say and then I think I'll probably end and leave for you in the notebook you'll see there's some stuff on doing anomaly detection is that we did a study with a student of mine named Adam Miller where we looked at Stripe 82 in Sloan which was effectively a five-color study in time of all the stars that had essentially shown some level of variability over several years and then we built a classifier on the variability metrics using all the parameters that I showed you before to predict stellar quantities that you could only get out of spectra so that's temperature gravity, log g and metallicity and we got about 5000 spectra or so ourselves they existed in the catalog already and we wound up showing that using time domain parameters plus colors we're able to get root mean squared errors which are comparable to what you would get out of low resolution spectroscopy itself so here we used random forest we did lots and lots of featurization try to protect ourselves as much as we could against overfitting but this is just another example of using using machine learning in real world context what we wouldn't claim is that this you could take our model and then apply it to another time domain survey and get as good of an answers out of it that's a whole separate problem but in principle if we got more Sloan data in other parts of the sky without taking spectra we could in principle learn what the fundamental parameters are of that star or seller system all right so I'm going to end there we'll have time for a couple of questions let me just say again just to reiterate that think of machine learning as another set of tools in your toolbox and if you haven't been trained on it and by the way I've just given you like an opening to all of this if you haven't seen this before so don't consider yourself well trained there are lots of ways to to hammer your thumb with with your new tool and you may think you're building this amazing house but in the end you've built complete crap that happens a lot machine learning is fraught with places where you're introducing biases that you didn't know about but it can be very powerful and as long as you're asking a question that's appropriate of the data that you either have or you plan to get where machine learning is the right type of tool you're off to a good you're off to a good start and when in doubt it's worth asking other people in your field and then in particular for those with home institutions that have stats departments and people working in machine learning it's not a bad idea to go to them and say here's the type of problem I'm trying to do give them a little tutorial for five minutes on the kind of science you're trying to accomplish and let them tell you whether they think this is appropriate for machine learning or whether it's not because oftentimes it's not appropriate so just bear that in mind I often come back to this nice quote that Jim Gray who was at Berkeley and then at Microsoft sort of prototype of a modern data scientist who said his quote was I love working with astronomers because their data is useless and he meant it because you know if you're in Microsoft and you're trying to build new algorithms against data the data that you typically have access to has personally identifiable information if it leaks out that you're using it it's really bad but he loved working with astronomers because our data who cares like oh wow you showed me a galaxy that I wasn't supposed to see right it's not the end of the world you're not going to start a war right so statisticians and computer scientists actually really like working with astronomers because our data is pretty big it has some interesting properties noise aside most of them don't like think about that but there's lots of questions to ask of that data where machine learning may actually be useful and they can hone models try new scaling curves get new benchmark data sets around that but the flip side isn't always true just because you've got data and you're an astronomer and you've taken you know this boot you've been at this hack day doesn't mean that you should be trying all of these out against all these cool fancy you know new approaches that you've been reading about at the blog post level so just be careful and if you're careful I think you'll go a long way so I'll end there and happy to take a few questions yes yeah so the question is have I seen machine learning being used in places where it's completely inappropriate yes I mean oftentimes if it's truly inappropriate it won't make it all the way out to the public sphere but there are lots of examples from people who reside in computer science and stats who have applied machine learning to questions that aren't all that interesting to astronomers using astronomy data I'm not going to want to name names but they have the other problem of not knowing what the right questions to ask and not being able to evaluate independently other than at the accuracy level of how well they're doing right so if you build you know a photometric redshift estimator on on photometry and you only look at the R band magnitude or something that's complete even if you get a classifier which is pretty good I mean that's okay but we know that for all intents and purposes when you're doing photosies you're doing it so that you can do something in cosmology and there you need to also have your uncertainties but you're probably also bringing lots of other data to bear and so looking at this one benchmark data set that has just that data available just is cool because they use astronomy but it's not all that useful for any of us any other questions Dave we're obviously very interested in my group in probabilistic catalogs and one one issue is that if you want to combine data from different sources you really want likelihood information to combine data not posterior information implicitly random forest generates posterior probabilities that things are in different classes and so there's an implicit prior that's been applied and it's very important just first of all just as a general comments it's very important that when we release probabilistic catalogs we also release the prior information the implicit prior information but can you say a little few words about what the implicit priors are for the random forest classifier yeah so the implicit one obvious prior is that the distribution of your labels in the data you haven't applied your model to yet is the same distribution as in your model so if you're building a star galaxy classifier again there's vastly more stars in the Sloan catalog than there are going to be you know quasars or galaxies I think that's true and so if you now apply it to just a pure let's say subset of galaxies you're going to wind up getting answers that don't make sense because I already knew that they were all galaxies why am I sometimes getting the answer of star so that's there and the imbalance in the data set in principle is being learned by all these different classifiers you can figure out ways to post hoc you know pull those those that assumption out without kind of breaking the the classifier itself but that's an obvious one the other one which is obvious is that you're assuming that the that the classifier that you then apply on new data isn't just looking at the same universe it's acquiring the data with the same noise properties done in the same part of the sky so again if I built a star galaxy classifier and I just happened to use it around places where there are a bunch of stars like in the galactic plane and then I applied that blindly to a place somewhere else in the sky where there are just more galaxies around you're going to get wrong answers or if I did it not on the Sloan catalog but now I did it on the ptf I'd get I'd get wrong answers so there's the the the obvious Mitch matches of what's what you're actually looking at and how you obtain the data there's probably other more subtle priors that I'm not coming up with the top of my head but those are the obvious ones okay thanks