 Okay. Oh, that's working. All right. So, hello. I see some faces that I know here today, so that's good. I have some, you know, people in the audience that can clap for me, even if I say something stupid. Okay. So, the talk today is, I expanded it into the R and Python ecosystem. So, I was curious, who here, I mean, I don't want to pigeonhole people into R or Python, but do people have, like, could you raise your hand if R is your main language? Okay. And then, how about Python? Well, it's about half and half. Okay, good. Any other languages that you want to share? Julia? Okay. No. All right. Yeah. So, this is for R and Python users, so I'm glad that you're both here. So Kevin just said a little bit about who I am, and some of you already know who I am, but right now I work as a statistician and machine learning scientist at H2O AI, which is a company in Mountain View, California, and the main thing that we do there is produce open source machine learning software that's meant for, like, big data and scalable machine learning. So, but kind of the idea, if you're familiar with Python, scikit-learn, or R carat package, the idea is to produce a platform or just sort of a library that has a bunch of different machine learning algorithms. So, instead of just using one library with one certain amount of conventions over here and another library for something else, we have, like, all of them together that kind of work in the same way. So Kevin already said this as well. I have a PhD in biostatistics from here, Berkeley, and then I've worked as a data scientist at a few different startups, and then I've written a few machine learning software packages, mostly in R, but I do do both R and Python. But as a statistician, I guess I'm more of an R primary person. Okay, so this is just the outline. So I'm first going to say a little bit more about what H2O is and who we are. And then the H2O platform, I'll just kind of describe what the point of it is, like, how is it different from scikit-learn or, you know, some of the R libraries. And then I'm going to talk about the R API and the Python API. If we have time, I'll say a little bit about this project called H2O Ensemble, which is a project that I work on. And it's something that I worked on when I was doing my PhD here. And so that's kind of, that's how I got involved using H2O was I needed a platform that had scalable machine learning algorithms and had a bunch of them with the same interface. And so because I wanted to build a scalable ensemble machine learner. And the only packages that currently existed for that at the time were R packages. And I've made some effort into trying to make them a little bit more scalable. But eventually opted for this route. And then we're going to, I guess I was going to try to go over either an R or Python Jupyter Notebook because I have both with the same examples. But since it's very even, I don't know, maybe you guys could fight and then we'll decide which one we'll go over. Anyway, but they're both available and they're both on GitHub. So if you want to look over one of them while I'm going over the other, you can do that. Okay, so just a little bit more about the company and what we do. So it's been around since 2012. And I think the first time I used H2O was maybe 2013. And my first experience with it was like doing a big benchmarking project to compare the random forest algorithm in scikit-learn and R and a proprietary algorithm called YZRF, which is Josh Bloom's company who many of you may know. I used to work there as well. And the sort of fourth one that we benchmarked was H2O. And YZ actually did the best. But H2O was not too far behind. But both of them sort of blew the R and Python native implementations out of the water a little bit. So yeah, H2O is the name of a company. It's also the name of the software. And I'll say more about the software on the next slides. And then I wanted to mention that we have this Scientific Advisory Council. They are from Stanford. So that's a little controversial. But if you're a machine learning person, you could probably move beyond that. And because Trevor Haste and Rob Tipscher on here are both very well known in the machine learning community. And Steven Boyd is there. So they're both more on the statistician side of things. Steven Boyd is more on the CS side of things. But he's also a big optimization and machine learning guy. So anyway, these are the people that we work with. And a lot of the algorithms that we have are algorithms that they've came up with. And they helped us figure out how to implement in a distributed fashion. OK, so I think the right way to think about H2O is that it's a platform. It's not just like a machine learning library. It's actually a set of machine learning libraries in different languages. But the main piece of software that is the core to all of H2O is written in Java. And then all the other APIs just communicate with this main Java library. So it's not just machine learning, though. The reason I call it a platform is because it has this distributed data frame structure as well. So the way that the algorithms work is that H2O has something called an H2O frame, which is just a distributed data frame. So different rows of a data set would be on different machines. And the algorithms work on just chunks of the data at a time. OK, so this is just a summary slide. So one other thing I didn't mention yet is that it's a platform in the sense that Spark is a platform and Hadoop is a platform. And the idea is that H2O is originally when it was developed, Hadoop was really popular. So that was only four years ago. And now Spark is probably much more popular. So originally it was meant to be a machine learning library for Hadoop. And now we've rewritten it so that it will also work with Spark. And so there's tight integration between those two systems. And the idea is that you shouldn't have to move your data into this other place and do the machine learning there when your data lives in the Hadoop file system or in Spark. And so moving data around is a big problem with big data. Because it takes a long time to move around. And writing back and forth to disk is slow. So here's kind of the full list of the APIs that we have. So there's R, Python, Scala. Java is not listed up there. But I mean, the library is in Java. And then there's a actually rest in JSON is how the H2O Java library communicates with all the other APIs. So basically when you're typing in a command in R that says train this model on this data set, it translates that into a rest command and sends it over TCP to the H2O cluster in Java. And then it does what it needs to do. And then it communicates back what the results were. And the idea is that it's not sending the data ever over in between R or Python or any of these places. The data just lives in a cluster running on Java. So and then because we have this rest API, we can also add a front end on it. So there's a web interface as well. So that's also something that people, it's one of the only platforms that actually has a GUI but also has the more hardcore stuff as well. So a lot of times things will just be a GUI or it will be only on the cloud or something like that. So we have all different levels of users or all different levels of skill sets of people that are using this. So this is just the philosophy of H2O. So the idea is you shouldn't have to sample your data to do machine learning. So one of the first approaches to doing machine learning on Hadoop was a project called Mahout or Mahout. And the way that that worked was they just basically like for their random forest algorithm, for example, they would instead of training a tree on the full data set, it would just train it on the local data on whatever node. So you didn't really have the full access to the data in the same way that H2O does. So you're essentially doing machine learning on big data but you're sampling. So the whole point of big data is to be able to use all of it. So the other thing is that it's all in memory, all the computations, so it's fast. I mean, a lot of machine learning libraries work that way, so that's not super unique, but it's all in memory. And another philosophy that we have is kind of using all of the most cutting edge algorithms and trying to really iterate on our software quickly. And so we do a release every like two weeks or so. And usually there's a lot of new stuff. So it's kind of like a fast moving target, which is actually one thing that people sometimes don't like about it is that things are changing so fast that, you know, if you blink your eyes, like things might not be the same. And then this is a list of just kind of the algorithms that we have right now. There's some that are sort of off this list, but so we have coverage of kind of like the main popular machine learning algorithms like Random Forest and GBM, Gradient Boosting Machine, GLM. And then we have a multi-layer, feed-forward artificial neural network, which is just one particular type of deep neural net. So there's a whole bunch of different types of deep learning algorithms, and we have one implemented right now, but that's something that we're going to expand. So on that topic, all of our algorithms are written for CPUs, and so a lot of the deep learning algorithms that exist right now, like CNNs or RNNs, so convolutional neural networks or recurrent neural networks, kind of work better on GPUs. And so before we expand our deep learning coverage, we will be implementing some sort of new piece of functionality that will allow you to use H2O on GPUs, and hopefully it will be doing that this summer, but a lot of people ask us for particular deep neural nets, and usually we say, we have this one, and it works really well if your data kind of looks similar to all the data where, so I guess what I'll say is that certain types of input data, like images or video or things like that, those are good with CNNs or other types of neural nets. So if you have kind of like normal data, then that's where we do well. Okay, and so just a little bit about what the architecture of H2O is, so there's these two concepts. I might just sort of skip the middle one, but H2O cluster is just kind of a name for this Java multi-node, like it's a Java process running on many nodes at once, so you should just sort of think about it as just a process running on a machine rather than like a physical cluster of machines. It does run on top of a physical cluster, but the H2O cluster is sort of just a concept. And then something that we call the H2O frame, which I've mentioned is just a distributed data frame and the columns are distributed. I guess the one thing that you have to have is that each node must be able to see the entire dataset, and what I mean by that is that the way that that works is there's something called a distributed key value store, which is just like a distributed hash table, and that just keeps a record of where everything is, and so that is how data is sort of accounted for in the H2O environment. So a popular way to use H2O is on Amazon EC2, so this is kind of like what it was designed for is very cheap commodity clusters that you either already have or are cheaply available, versus like very expensive GPU clusters. So if this is, if you're at Berkeley, you probably have access to like a cluster here, so that could be another option, but we have a lot of instructions about how to use it on EC2. That's something that's interested you. So here's just a picture of what a distributed H2O frame looks like and kind of how you should picture it. So each of these little sections that say Java, JVM, one heap, so this is just like a, would be like a four node cluster, possibly with many cores on each node. So here's just the link to the GitHub, so I've mentioned that this is all open source and it's actually that if you guys are kind of savvy, so you might know a little bit about different open source licenses. So this is licensed as Apache 2.0, which is basically what businesses want to see. So there's GPL licenses, which are the copy left philosophy, and then there's like MIT, BSD, and Apache, which are more liberal. So this is the sort of most liberal of those licenses. So there's actually other companies that take H2O and just like steal it or whatever. I guess that's not the right word, but like take it and then build a whole bunch of other stuff on top of it and then like repackage it as some other name. So I just saw one yesterday that was like from Sri Lanka or something and has it's like H2O but not H2O. It's like Bizarro H2O, which is good because that's the whole point, I guess. Okay, so now we'll just talk about, so how do we use this from our Python? Because that's generally how most of our users are interacting with H2O. It's meant to be accessible to people that are not well-versed in distributed computing and the idea is that you shouldn't have to learn like every single skill in the big data world to be able to do machine learning. So it functions just like a normal R or Python library. So first, I'll kind of like alternate between R and Python as we go through the next set of slides. So the R package, basically all it needs is R and Java and so it runs on all the different operating systems and you can install it from CRAN, although we always kind of recommend people to install it from our website because we always have the latest version. So CRAN doesn't like it if you update your packages like too frequently, so sometimes the CRAN version is like a version behind what our stable release is. So, but of course that probably doesn't matter to most people, so if you're comfortable just installing it from CRAN in RStudio, you can just go ahead and do that. And I kind of mentioned this before, but the whole point is that no computation is performed in R, so you can do things in R maybe munge your data and do whatever you need to do and then push that data frame into H2O, but you can also just skip that whole step and just load from CSV or wherever your data is into the H2O cluster and then you can do a lot of the typical data frame operations both in R and Python, so the idea is not that you have to do all the munging in R and you're still loading this giant data frame into R and then doing the machine learning. The whole idea is to extend the pipeline as far as possible, so you should be able to do most of your data science stuff in H2O. So basically we've overloaded all the functions for H2O, so all the data frame functions that you would typically be able to do like sub-setting rows and taking the mean of a column or whatever it is that you're doing, you can do that with the same syntax in the H2O frame and the same for Python pandas, so if you have an H2O frame in Python, all the pandas syntax should be pretty much covered if there's stuff that you ever run into that's not there just to let us know and we'll add it, but we have most of the functionality covered I think. So here's like if you go to this place on our website, this is where it will always have the latest stable release and so this is really the only line that you need is that install.packages thing, the other thing is just guaranteeing sort of like overtly installing all the dependencies and stuff and then removing the H2O package if it already is there. So then the Python module, the same kind of deal, it's just need Java and Python, so Python two and three is supported. There's a couple Python module dependencies that you will need and then works on all the operating systems and you can either download it from PyPy or from our website and so that's the link for Python and so I think this screenshot is like a little bit old so I think there are two more dependencies. I think the futures module is on there for when we added Python three support and another module called six. Okay, so now back to R. So this is so the only extra like task that you have to do when using H2O is like start up an H2O cluster. So you can do that from R or you can do it outside of R and then once you start up R just connect to it. You do that all with the H2O dot init command and you have a lot of different options for this function. You can specify how with the max memory size so if you know about Java that's like the XMX function or sorry, XMX flag will specify like the max memory for your cluster. So in this case I'm just saying like make an eight gig cluster because this is me running this on my laptop and number of threads equals negative one means use all the cores. That's another thing that we have to do because of CRAN because CRAN won't let you ship software that like by default will start more than two cores so I don't know why but that so by default it will only start a two core cluster unless you change it to n threads equals negative one. So what it does is it looks and see there's also two other arguments that might be relevant so there's an IP address so if you're not doing this on your local machine if it's running somewhere else you would put in an IP and then a port. So by default it will always start the H2O cluster on the machine where you started it from. So this is just running on my laptop it says successfully connected to HTTP slash slash basically that's local host and then port five four three two one. Okay and so I'm just gonna I'm gonna go over all these functions when we do the IPython notebook but I just to sort of introduce you slowly we'll just like put them on the slides first and then we'll as we're going through the notebook you'll have seen them once before. So here's just like the general way to load data in R so we've started our H2O cluster using H2O.init and this second one is the function that I mostly use so most of my data that I'm dealing with on a day to day basis is just I mean I'm developing the software so I'm just using sort of example CSV files to test things I'm not always using like a terabyte of data so usually I'm just loading data from a CSV so I use this H2O.import file which will in parallel read your file into the H2O cluster without doing anything in R you're just basically typing commands in R and then the other way that I mentioned is if you have some data frame already existing in R and you have a big enough cluster it's not a big deal to process that data frame in R then you can do that and then use this as H2O function which will just send it basically make a copy of it from R and then push it into H2O where the copy will actually be much smaller in H2O because it's all compressed and it's like very efficiently stored data so here's just a real simple example of training a GBM so a gradient boosting machine so the interface sort of the basic H2O interface that all the different functions have an X and a Y and a training frame argument so training frame is just the name of if you go back to the slide so that train object in R is actually more like just a pointer to where that exists in the H2O cluster so that you know it's just metadata really but it's the way that R knows where things are and what they're called so we pass in this name train so that's the H2O frame for training frame and then Y is just the name of the response variable so whatever that is in your dataset and then X is just a vector of your predictor names so you can either specify these by name by a column name or by index so that's a little bit different than the standard R machine learning conventions where usually X is actually the data frame and Y might be like just a vector so then there's an H2O dot predict and so this is the same on all the different methods so whether you train a GBM model or a GLM model there's always just H2O dot predict and then that will give you some predictions usually, well I'll explain this is not actually typically what I do but I don't actually usually predict and get the predictions and bring them back into R and look at them I actually give it a test frame and have H2O calculate all the performance metrics like efficiently for me so it will calculate MSE, AUC, R squared, log loss all the like normal ones that you would think of however if you did wanna do your own custom performance function then you could do this where you get the predictions and you bring them into R and then you like have some R function that calculates some custom performance metric okay so there's some like basic plotting stuff that we have so I will talk a little bit about the GUI as well so when you start an H2O cluster it starts a web server and so if you type in local host port 54321 it will bring up this thing that we call flow which is this GUI browser and you can like click to upload a data set click to train a model and then once you've trained models it spits out a bunch of plots automatically so it will scoring history is just sort of scoring your model as you iterate through the training process so if you were training a GBM for example like a way that you can sort of step through the training process is by number of trees so it will plot, this is just sort of the default for GBM I think so it will plot the number of trees and then the MSE but so you can plot all these things that you get for free if you just go over to the GUI so a lot of times I just will train a model in R and then go over to the GUI and then like look at all the plots because they've already, they're there automatically okay so then another thing that you might wanna do in machine learning problem is do a grid search so this is the syntax for doing a grid search in R so there's a function called h2o.grid and then you just type in the name of the one of the algorithms that you'd like to do so this is a deep learning example and then you pass a list of hyper parameters so in this case this first thing hidden underscore op is just different architectures of the hidden layers so in a neural network so 200 comma 200 means there's the input layer of the neural network then 200 neurons another 200 neurons and then the output layer so yeah so this is just another model parameter that is specific to deep learning and then the L1 penalty is another thing you might wanna grid over and that's how you do that okay so then Python we're just gonna quickly go through the same stuff so this is how you start up an h2o cluster in Python so import h2o library and then h2o.init by default Python will actually start up on all of your cores so we don't have the CRAN restrictions so that's a little bit better and then sorry this is what it looks like if you don't have a cluster running it will look for one on localhost and if it doesn't find one it will start up its own but if you already have a cluster running it will just sort of print out some metadata about the cluster like that okay so then how many of you are familiar with scikit-learn okay so a lot of you guys are so this is our Python library is sort of meant to be easy to use if you know scikit-learn so has kind of the same sort of pythonic way of doing things so the first iteration of our Python API was essentially a clone of the R API we type in h2o.gbm and then do that but then we changed it cause it sort of made more sense to do it this way so now you import whatever estimator you want so this is the gbm estimator so import that and then to train you first set up the model by defining the different model parameters so here's just an example of some different parameters that we could do so when I run that command that says model equals that command that will just basically not do anything it will just set up the model and then to train the model there's a dot train method where you pass in sort of the non-model parameters like the data parameters and then it will build a model and one of the things I kind of like and sometimes people complain about this which is interesting is this progress bar so you can see exactly how long your model is gonna take to run but there's a lot of people that like email and they're like how do I get rid of the progress bar I hate it so I like it cause it gives you some sense of like what you're getting yourself into okay so then so this is just another aspect of this is kind of similar to the information that will come up when you go over to the GUI it will just automatically calculate all these things that you might wanna know about your model and you can also access that in the R Python interface so if you just print the model this actually goes all the way down there's like more stuff but this is just kind of a snapshot so those are the performance metrics that it will calculate for a binomial model so MSC, R squared, log loss, AUC and G and it will also calculate a threshold so for a binary classification model it will give you the probabilities but then it will also basically choose the threshold between zero and one that will maximize your F1 score that's like the default so you can change that so when you train a model you get both the probabilities that the outcome is one and also what the predicted label is based on this default threshold okay so I think I have just a little bit of time just to mention this project since there's some bio stat people in the audience they might be familiar so H2O Ensemble is a project that implements the super learner algorithm which anywhere like pretty much outside of Berkeley is called stacking so either stacking or super learning so it's basically an ensemble machine learning algorithm that takes different algorithms like a GLM and a GBM and a random forest and finds the optimal way to blend them together so it does that by doing a secondary learning step so you train all the base models and then you train what's called a meta learner that will figure out what the optimal combination is and I won't really spend a lot of time explaining how that works but there's a lot of information about it on my website which I'll put you, there'll be a link at the end of the talk if you want to learn more about it but basically H2O Ensemble is implemented for regression and binary classification and the two things that I'm working on right now is the Python version of this so this is right now only in R and then support for multi-class classification so this is, you know, it's complete sort of in the R version but the Python is a work in progress so let's see and then if you're curious why do you use ensembles usually it's because you use them when you care about performance over everything else so if you want the best possible model that you could possibly have that's gonna probably be an ensemble and of course they take longer to train because you have to train all of the models instead of just one so that is sort of the use case for ensembles here's a link to the package it's just in the H2O repository under the H2O-R folder slash ensemble and then on this this is kind of like the home page for the project and so there's a bunch of information if you go there below that will tell you what the super learner is and give you links to training and slides and stuff like that and then show you how to use the package oh okay then I guess I'll just talk a little bit about the interface for a second so this is basically the setup with a super learner you have to specify what base learners you wanna put it in the ensemble so that's your job as a user is just to come up with a bunch of different models that you wanna throw in there and then you have to specify what the meta learner function is and so these things right here you can see they're just strings but what these strings represent are like particular, particular, so the first one is a random force with a particular set of model parameters so maybe it's a random force with like 100 trees and like max depth of five or whatnot and then these other ones, deep learning one and two those would be just other versions of deep learning so like maybe one with one type of activation function maybe one with a different type and the goal is to get a diverse set of base learners in your library and that's typically how you're gonna get the best performance so you don't wanna just do a big grid search and find like the best four models and then just take those together you kinda wanna have a diverse representation of different models with different strengths and even like GLM which is typically never the best performing model in the group is often good to have in there and you might think oh this one algorithm it's so much worse than the others I'm just gonna like take it out and sometimes that actually hurts your performance so depends on your data and whether or not GLM could help could help sort of see things that the other algorithms can't see and then this is the interface which looks the first line is just exactly like all the other H2O algorithms and then there's the family argument is just describes the task that you're trying to do if you're trying to do binomial classification or regression would be Gaussian and then it has a predict function as well okay so now we're gonna do a demo so any like votes on Python versus R anyone yeah do you have a Python okay oh okay I've made this a problem for myself I guess okay wait first let me tell you about the data and then I'll delay deciding for a second okay so the goal in this demo it's just this EEG data set where it's a binary outcome it's whether it's to predict whether or not your eyes are open or closed based on EEG signals from your brain using this thing that you can kind of see in this picture but it's kind of hard to see it's like this plastic like spider thing that goes on your head and measures your brain signals and there's 14 EEG channels placed on the head so you can I didn't create this data set myself I just downloaded it from the UCI machine learning archive of data sets so somebody you know made this data set and then this slide is just kind of here to remind me to mention flow when I flow is the GUI when I do the demo so here's the two links that you can go to that will send you right to the location of the notebooks which are on GitHub but if you want to actually follow along probably what you should do is so you could like just go to it on the web or if you want to actually run it yourself let's see you would you could go to H2O I'll show you real quick so if you go to the H2O GitHub you actually I mean it'll take you a second to set it up but you would clone the H2O 3 repo so that is there and then each of these are inside of the different sub folders so H2O dash pi has all the Python stuff R has all the R stuff and there's something called demos and then they're in there so if you wanted to run it on your machine you would first clone the repo and then I know that you guys are pretty much like experts at Jupyter notebooks but I'll just show you how this works so okay so if you do type in Jupyter notebook not book okay notebook then it will bring up the browser in that directory and then you can just there's a there's actually only one Jupyter R notebook which I made but there's a bunch of Python notebooks I hope I'll make some more R versions so anyway so this is this is how you get there so okay so I'm going to I feel like I should like overcompensate for my like R tendencies so maybe I'll do Python but the R people can follow along and it's exactly the same notebook with just the different commands so okay so let me actually just make this a little bit bigger oh this is the R one sorry you should go back actually I think I already have the Python one up okay there we go okay so I might be able to sort of go between both so this is just shows you how to install H2O and I'm going to assume that you already have done that even though I'm sure none of you have but if you'd like to follow along you can do that so this is how to install so if we move we've sort of already shown this like how to start up the H2O cluster so import H2O and then do I think it's already running so we'll do that okay all right so then to download this data it's not a huge file so I think you guys should be able to do it over the wifi but we'll might take a second I also have it locally if there's a problem with the wifi okay so that worked so this demo just uses a CSV file so I mentioned that you can use data that's in Hadoop or Spark or other places like S3 but CSVs are easy for demo so I'm going to do that so we do H2O dot import file and then we have functions that are similar to pandas or numpy where you just do data dot shape and that will give you the dimensions of the matrix of 14,000 rows and 16 columns and that run so because I've already run the snow book before it sometimes doesn't look like I'm doing anything but this is just printing out the head again so in this file we have these things are the 14 EEG locations so these if there's any neuroscience people in here any anyone so these actually represent locations on your head and the eye detection is just a zero or one whether or not your eyes are closed and then I added this column called split because it makes it easy to test this data set on a whole bunch of other platforms as well so this just I randomly partitioned the data into a training set a validation set and a test set and I will as I'm going through this I'll explain what the difference between a validation set and a test set is because that's those words are sometimes used to mean different things or the same thing so but they actually do mean different things so the way that we'll subset this data set is just by looking at that split column okay so this next like series of steps is just to show you how to work with like a data frame in H2O so it has again like all of the Python Panda stuff so data.columns will just show you the column names this is how you subset can you guys see that is that maybe I should make it a little bit so if you wanna subset the by columns you just make a list with the column names and you're gonna do data and then bracket columns and if you wanna look at the head you just do head so this is just showing you these three columns if you wanna do one column it's the same idea and I'm gonna skip over the next couple things so that we have so we don't run out of time but just so you know how to do all of these things like calculate the unique values in a column or look at the levels in a categorical column all of this stuff is here. One thing that I will show you actually both in this notebook and I'll go over to our notebook is one thing that you do have to do in H2O and this is for efficiency reasons so we when H2O loads in the data it only looks at like a little snapshot of what the data is and then it will load it in so from that it will guess like if this is a numeric column and if it's numeric like what kind of numeric column or is this a categorical column or something like that so it'll make a guess as to what the column is and then it will load the whole thing in so in this data set we have the outcome is encoded as zeros and ones so it will basically guess that that's just a numeric thing because it has to make sort of the most conservative guess about what it is so unless you tell it explicitly what you can do that it is supposed to be like a categorical column then it will just basically call it numeric so in H2O it's a lot of the reason that it's fast is because we do efficient representations of data and so if something is just zeros and ones you wanna encode it like in the smallest way possible so by doing this dot as factor that will convert it into what's called like a Java enum or in R it's called a factor is it, what is it called in Python? Is it called, I don't think it's called factor is there a name for like a categorical variable in pandas, does anyone know? Okay so maybe I don't know if there's a name for it but the R users will know what a factor is so a factor is just like a categorical variable and usually our people don't like factors because if you load in the data frame and it's by default we'll say strings as factors equals true and then you think you have all this text and then it's like all factors and you don't want them but with machine learning it actually is beneficial to understand that something is a factor or a categorical variable because when you have categorical variables usually the machine learning algorithms underneath the hood are doing so some algorithms will have to sort of expand that category column so if there's three different categories in your factor column it will sort of balloon it out into three either two or three columns that are just binary indicators of was this category A yes or no was this category B yes or no et cetera so GLM and deep learning will do that and that's not just an H2O thing that's how those algorithms have to work but with other algorithms like GBM and random forest it can actually be more efficient with categoricals so it doesn't like balloon them out like that so this is something that handling categoricals is not usually done for you by machine learning algorithms so the name for this process is called one hot encoding so you take a categorical and expand it into like the binary representation so a lot of packages will make you do that by yourself so it's something to just think about but let me go over to the R version and show you sort of go down so this is what we just saw and then this is where we are over here so just making sure that it's a factor so and actually if it is if it's numeric it will try to do a regression instead of classification so it's important that you have your data and it's kind of to force people to make their data efficient efficiently represented but I do have some grievances about it because it's not necessarily I don't know a lot of the other machine learning packages kind of just like do that for you and make you not think about it but we kind of want you to think about it and I don't know if I 100% agree with that approach because a lot of people ask us how did I get this regression model I have binary outcome and so it might be like more of a user interface thing that we possibly change in the future but anyway I spent way too much time on that topic so we'll go on and we'll train a model okay so all of this is just like general showing you basically that this is exactly how it works in pandas and this is exactly how it works in H2O so the thing that I said I was gonna explain is what is the difference between a validation set and a test set and so a test set is something that you only use once and it's to evaluate model performance or to estimate model performance and a validation set is used to in a couple different ways one it could be used to like tune a model so if you have some parameter that you need to choose in a model like let's say if it's a GBM you would maybe like the depth of the tree so you could try a bunch of different depths of trees and then train a bunch of models and then evaluate them all on the validation set and then like choose one that looks the best but then you throw away that validation set because you've already learned from it just like you don't use you don't test your model on your training data you don't test your model on your validation data either so you actually have to like have a clean data set the test set that you've never used before to get like an honest estimate of your data not everybody always does that because sometimes it's essentially you're overfitting your model if you were to use let's say a test set to choose the model parameters and then evaluate it on that you're gonna overfit your model but sometimes it's like not the worst thing in the world to do that if you are okay with that but this is sort of like the proper way to do to have a train validation and test set or there's a whole nother thing cross validation which you can also do but we'll just I'll show you that next but let's train a model finally okay so I've run these so actually yeah I need to specify what X is so X is I'm just saying give me make a copy of the list which is train.columns which is the name of the columns and then I'm gonna just get rid of the last two things because those are we're not gonna use for prediction so you remove those and so now X is just a list of those 14 EEG locations and Y we set it actually a bunch of cells that go up above but that's the eye detection column so this will train a model okay so that's done and now we can inspect the model so let's see before I just gave you a screenshot of what this looks like but we'll kind of go through all this stuff so the first thing that we report is the training metrics which I don't usually care about I kind of skip over that and I go down to actually let me take a look here do we oh yeah so I passed in a validation frame so it actually also calculates validation metrics and so I might look at those and think that's like an okay estimate for model performance because with GBM there's nothing I just train one model with one set of parameters so in this case it is functioning like a test set because we haven't used it other than to just calculate these validation metrics however there are a lot of GBM is one of the algorithms that by default won't use if you pass a validation set won't use that to like tune anything automatically if you turn on something that we have called early stopping so if you say I want like 3000 trees except for stop when the error stops going down basically then if you pass in a validation set it will use that to decide when to stop and like maybe it'll stop at 570 trees and that's just good enough so a lot of times if you turn on certain parameters like early stopping it will use that validation set to train the model and then it becomes not good anymore you have to throw it away and then you would have to and so this is not H2O like specific in the early stopping is but like the concepts are not H2O specific it's just machine learning best practices so it will print out a bunch of stuff which you may or may not care about and then we can calculate the model performance on a test set so we can do two things one we can generate predictions on a test set and like look at those predictions and maybe do something with them or we could just calculate the performance by first it will get the predictions and then do all the performance metrics so that's usually the function that I use is so model dot, so model was what we called our GBM model and then model underscore performance takes in like a new data frame which is called test and has this thing called like this performance object which is it's just each of them there's like binomial, multi-nomial and regression model metrics and they give you different ones so this was a binomial model so it will give us particular things so if you wanna actually pull out specific values you just do that using like dot r squared function dot a u c dot m s e and that will let you get those values and then let's see how we are on time okay I'll try to hurry up so then the other thing that you might wanna do is do cross validation so you can actually just copy exactly what we had before and just add this n fold equals five and then what it will do is it will do it will train a regular model like it did before but then it will do five fold cross validation as well so it will train five more models and like evaluate the performance and then you have like a cross validated performance metric so just do that so this is obviously gonna take a little bit longer but it does the n folds in parallel so it's pretty fast so that's done and then you can access different things so it will have, this is how you get at the cross validation metrics, by default it'll return the training metrics which is also probably, because it will always calculate the training metrics it's not usually the thing that you want but it's the default so I showed you in R what the grid search looked like and I won't run this because I wanna leave time for questions but this is basically how you set up a grid search in Python so here's the number of trees that we're gonna grid which is actually a very silly thing to grid over actually but it's sort of just here for demonstration purposes so max depth and then maybe learn rate and this is just what you do so there's something called h2o grid search which is similar to the R version of h2o.grid and you just pass along the hyper parameters and you say which thing you wanna grid over which is the gradient boosting estimator and it's that simple so then it will do this grid and then usually the point of grid search is to try out a bunch of different sets of model parameters and then pick which ones are the best and that's another use for the validation set rather than the test set so validation set is good for grid search basically so if we want to do that, I'm not gonna run this but basically it just gives you back all the models and you can sort by, in this one I said sort by AUC and so like this one has 0.96 AUC and then like if we scroll down to the very bottom like there's some really bad versions like 0.67 so this is why it's good to do a grid search because you don't really ever know beforehand what good learning rate is, what good max depth is, how to, you know, there's a lot of different parameters that you can work with so then this is just how to sort of piece out what the best model is you're just saying in this AUC table that I've already sorted up above so I sorted it and I called it AUC underscore table and then it's just saying go in the model ID column and grab like the first thing so 0 and then that's the, so this is the, when I mentioned the key value store so the data frames and the models are all stored by like hashes in the key value store so this is just the name of the model which is a key and we just say to get that model from H2O and make an instance of it in Python or R you can use this H2O dot get model and then if you wanna look at certain things it's like best model dot AUC and et cetera so that's the rest of that so I think this I'll just sort of like scroll through this and let you see the R version looks very similar so this is the same thing that we did H2O dot GBM just add the end folds and this is getting the AUC this looks similar as well H2O dot grid and the printing looks a little bit different in a Jupyter notebook between R and Python yeah so this is the same stuff so let's see I think I have like one more slide maybe oh yeah so these are just like all the links to all the different places that you can learn more about H2O oh I did mention that I would talk about flow to you so flow let me just do one thing before I stop so let me pull up since H2O is already running there's this is what flow looks like so if I wanna say like get models like here's all the models that we just trained these are the default generated names you can name them things that are easier to read so if I just click on one of them this will show just some basic stuff so the ROC curve you can change it like with different thresholds and then it will calculate like the confusion matrix from there so this is validation metrics which is usually the one that you care about you'll do variable importances for GBM and random forest so we'll show you how important things are there's some other things like gains and lift tables and there's a whole bunch of other things that you can kind of click through like that so if you wanted to like this is how you so here's data if you wanted to upload a new file and you do that modeling here's all the models that you can do from here and then there's other things like this will give you stuff about like your you can download logs and stuff like that if anything breaks there's this thing called the water meter which it doesn't work on it only works on Linux but what it is is it shows you like all the cores being used so if you have a huge cluster you can see actually when your cores are active there's other like Linux tools that will do this but this is just like a write in H2O type of thing cluster status I think there's like some quick links to like documentation et cetera so and then people sometimes compare this to like a Jupyter notebook because it looks kind of the same and but actually it's written in coffee script and I don't know why but it is and you can't execute arbitrary like R commands or Python commands so that's how it's different from a Jupyter notebook and people have asked us if we were maybe gonna like merge this with Jupyter notebooks or like try to somehow integrate it and I don't know that we are gonna do that or not but I mean since this is like a good place to ask people about Jupyter notebooks like if anyone wanted to talk to me about that further like feel free to do that it's kind of just meant for people who don't write code so but there's a lot of coders who use it who wanna be able to interactively like write the R code and then have some stuff come up and then you know so we've talked about sort of like merging those together in some sense but I don't think that we have that planned or anything so I don't know so yeah so this is where to learn more and we have all these like we do a lot of talks and meetups so we have a lot of recorded video presentations of all of our different algorithms and we have a lot of like applied talks so customers or people just in the community that wanna talk about how they're using H2O and they go to a meetup and talk about it so there's all these we're very like involved in the community and in the meetup community especially so those are those links and then there's if you wanna learn more about how we and maybe didn't mention this I'm not sure all of our algorithms are completely written from scratch and they're distributed so that's part of what H2O is for is to sort of rewrite these out rewrite them from scratch so that they work in a distributed fashion so a lot of the details about how we did that are in these little like booklet things so at conferences and stuff we'll print these out and we hand them out to people but they're also just the PDF versions can be found at that link so this will say like how you know go into like a lot of detail about how we dip the distributed stuff and how our deep learning works and stuff like that okay and then this is how you can get ahold of me if you wanna talk about H2O or machine learning or anything else really and then if you wanna learn more about stacking or super learning I have some links on my website which is still stat.berkeley cause I'm not a grown up I don't have a real like website yet maybe I'll get one this year seems like the right time okay so anyone so I guess maybe should we stop and then we'll do the questions