 But a better title or a more descriptive title might be this I'm gonna talk about brush fire, which is a library that I've been working on at Stripe for distributed generic decision-tree learning And right now using Hadoop as the underlying platform I haven't been sourced it yet. I will over the next month or so so You know, everyone knows that that talks about Hadoop are really about counting things Inevitably tutorials about Hadoop are also counting things But I believe that that that they're also Often it's it's more useful to think about them as as adding things So so we're gonna talk about counting, but we're really gonna talk about adding and and in particular we're gonna talk about trees and We're gonna talk about decision trees so The point of a decision tree the first thing that you need to know about a decision tree Is that it's for making predictions? And so the the sort of basic interface to a tree in brush fire But just sort of conceptually is that you need to be able to give it a A bunch of features a vector of features a map In this case from from strings which represent names of features People also call these properties attributes, whatever Two values And sort of generically here. We're just gonna think of the values being you know some type B And you get back a prediction sort of based on these features based on these things You're telling me about I'm gonna give you back this prediction, which is of type T So it'll help a little bit if we make us a little bit more concrete So let's say that we are trying to predict Do you like cookies? So you might have you know a number of Observations here that you've made of sets of features here is something which is blue and furry and about yeah Hi, and does it like cookies and here is something which is small and green and does it like cookies and so on and you want a prediction back so You know you might have a very simple tree like this Initially to answer this question and so you know when you're evaluating so this is primarily a talk about about learning these trees and building these trees But first just to talk about how you evaluate these these observations when you get them using the trees It looks like this. So let's say that you have some observation Cookie monster and you want to know does it like trees does it like cookies? So you start in the root And every node in the tree is going to have some kind of predicate that it operates on these features, right? So in this case, let's take out the color feature and let's say is it blue Right and if it is blue we'll walk the tree this way and if it isn't blue we'll walk the tree that way so Cookie monster is blue and so we'll go into that node and now we're in a leaf node and And the predictions are in the leaves and so we look at what is the prediction in this particular leaf node and and yes You know a cookie monster does like cookies. Okay, you know, it could go the other way. We could have gone down the other path And that might not have been a leaf node there might have actually been more to the tree And so we were curse right we have some other question that we ask You know do you wear stripes and you know if you do go this way if you don't go that way and then again We come to new leaves and and those leaves might have additional predictions, right? and And so we said before that those predictions are are of type T this kind of generic type T people also call the predictions targets, which is why I use T and And let's let's talk for a little bit about what type we actually what concrete types we might use there, right? And so there's different kinds of predictions We might want to make and so you might think about different kinds of predictions like a like a it's called a binary classifier That this is what we've been talking about, you know tries to answer questions like do you like cookies, right? You might also think about a regression Something that tries to answer questions like how many cookies will you eat, right? So in the binary classifier case, you might have a Boolean as your prediction type in a regression You might have an integer or a double or something like that as your as your prediction type Sometimes you have what are called multi-class classifiers, right where there might be a number of different discrete answers So like what is your favorite kind of cookie and you might imagine having a string or something else like that as your prediction type And in this is what I mean when I say that that brush fire is is a generic decision-free learning framework so typically if you look at at the code that's out there for Doing decision trees, you know There are a huge number of variations a huge numbers of types of decision trees And they tend to sort of pick one, right? So, you know you have these questions like you do you split one, you know just into two or do you have multi-way splits? You know How many trees do you actually use to make your predictions and combine and how do you combine them? And how do you allocate, you know your training data to those trees to decide how to make them? And so on and and usually you kind of pick one point in that space and say like this is my implementation This is random forests, which implies a whole set of choices around here And the design goal with brush fire is rather than to make any particular choice to try to kind of abstract away a Generic way of thinking about this and so, you know kind of each of these choices Gets represented by some type some trait some type parameter That that lets us kind of build up whatever we want in this space of ways of doing decision trees, right? So okay, so back to the Target variable so so we said that you know you might think of a binary classifier Using a Boolean target and that's actually a pretty naive binary classifier You could do that certainly but in general people don't just want back true or false They want back a probability, right? How likely is it that cookie monster likes cookies? How likely is it the Kermit likes cookies? Whatever and so then you might think? Okay, so we don't have a Boolean in there. We we have a double You know some value between zero and one and and that would be reasonable But we have an additional design constraint on brush fire Which is that we wanted to be able to operate in a distributed way And so that additional design constraint ends up putting this additional sort of type constraint on on T Which is that we want our predictions to be something that we can sum up We want our predictions to be something that we can Build a sort of small piece of on each of a number of nodes and then combine And so you know we use the the algebra library underneath, but scholars E also has this type class There's a monoid type class, which just means something that can be summed right something that is commutative has a zero has a plus operation and And so that lets us operate better in a distributed context and so So our T needs to be a monoid and so in the case of a binary classifier There's a very simple representation that we can use we can just use a map With account for truths and account for falses and that will work that will give us Probabilities out because we can just look at the ratio But but it is something that can be summed and so really you can think of this not as being just a prediction a single label a single answer But a distribution of answers and our binary You know the classifier a distribution is a very simple thing. It's just two numbers You know in a in a regression your distribution might be something more complicated, right? And this has some other night properties For example if we have these these predictions in the leaves that can be summed up and we want to prune the tree We want to get rid of some of the leaves Then we can just sum up the leaves and now we have a prediction for the interior node that that's our new leaf Another way to think about this by the way is is from kind of a Bayesian perspective that we want predictions that are Updatable right we want to be able to turn the crank say this was our old prediction Here is our new information coming in and we want to update our prior and get a new a new probability Distribution here right and so from that point of view you might think of the the the simple map with a count of true and a count of Falses is representing like a gamma distribution And you might also want a Gaussian distribution, which is also some of all and so on right okay So I said I have this design constraint that it be distributed in and why is that important? Why do I even care that it be distributed in and one reason is that because I already wanted it to be generic? It's very difficult to to do a generic learner that is fast that is well optimized on a single node Right you're paying a cost for the generous Generosity a lot of the reason that people don't typically do this is because they want to pick exactly one approach and really Optimize it right and so if it's not going to be particularly fast on a single node Then you want to be able to throw a lot of compute power at it so that at least you're not waiting forever for your models Right another reason is just one of scale right these these models tend to be better If you can throw a large amount of data at them to to infer these trees And it's worth making the point that this isn't just about throwing a large number of observations Although that's useful, but also having a very large number of features in those observations, right? So you don't just want a lot of rows, you know tens of millions hundreds of millions of rows Whatever you maybe want a million columns And it's very difficult to do that kind of thing on a single note And the third reason is sort of this like you know, we were our banks because the money is there You know Why do you want to distribute a decision for your learner because the data is there because in practice a lot of us have You know huge amounts of training data sitting in a Hadoop cluster somewhere sitting on HDFS And it's easier to bring the computation to the data than bring the data to the computation And so what does this look like? concretely And so this is a a simple example Of using scalding which which is a framework for doing Hadoop stuff and Scala To to train one of these trees and just very quickly walking through this The first thing that we want to do is read our training data And parse it into brush fire expects these instance objects, okay? And parse it into instance objects that represent the training data And training data that the way that you learn one of these trees is by having a bunch of training data That has both the features and the prediction, right? And so you're trying to generalize from that to be able to make predictions later When you have features and no prediction, right? And so you have both the v and the t here both the the math of feature values and the prediction And we construct a trainer object with the training data and we ask it to expand the tree, okay? and and and You know I note here that this is 11 passes through the data We're trying to expand the tree 10 times and 21 map reduce steps So that's because this is a recursive process and it's a recursive process where each pass Through the data through the entirety of the data Expands our tree by one level, right? So let's say you start off with this simple tree, right? the recursive process takes each leaf in the tree and Produces new children for each of those leaves and so you would go from here to there in one pass to map reduce steps One pass through the data and from there to here and so on And we'll get to how that works in a moment, but but first we need the base case We need to be able just to construct And and that's the simplest thing and so it's a nice thing to start with so You know let's let's talk about Our training data, right? So our training data has both the features These maps from string to some value type And the targets and and this is kind of a degenerate case of the distribution, right? The training data has a distribution of labels But in fact generally that's like a map from like true to one or false to one in this case of a binary classifier Now that might not be true, right? You might want to wait some observations more and and do that by having some different kind of distribution But really the point here is we want to be kind of in this distribution space, but it really represents one observation So these these kind of get more interesting as you sum them up, right? And summing them up is exactly what we're going to do to produce our root node, right? If we want a root node just has the contains the overall distribution for the entirety of the training set And so we just want to sum these all up And in the distributed context in the map reduce context what that looks like is this right? We have a number of instances and I'm kind of using these dotted lines to show Separations between nodes right each of them is on a different node of this cluster And from each of them we can extract this this t this distribution And on each node locally we can sum them up to get a single subtotal for that node Right, and then we have the shuffle and reduce step where we send each of the individual subtotals To a single reducer and there we add them all up and now we have our root node Okay, not super exciting not a very interesting tree not a very useful thing to predict with yet But but a necessary starting point so Then what do we do with that the next thing that we need to do and we need to kind of recursively do Is choose what's the best way to split from there? What's the best first decision to make and and this is a greedy algorithm, right? And so it's always going to choose, you know, it can't look multiple steps ahead It's always just going to choose what what looks like the best split here The thing that's going to gain you the most information For for a later prediction So, you know, this might be we want to look at color and is the color blue or not This might be we want to look at color and is the color yellow or not This might be we don't want to look at color at all We want to look at some completely different feature like is the height greater in the fiber is the height less than five or whatever Right and and so we need to sort of look at all of the possibilities there all of the reasonable possibilities And then make a choice for which for the best one is I'm going to simplify this a little bit at first and and assume that we're only looking at a single feature, right? So we're not going to try to decide between features We're just going to decide, you know for a single feature. Let's say color. What's the best question to ask about it? What's the best sort of way to split back? And so that simplifies our our sort of notion of our training data right now We don't have this map of features. We've just got a single value Which in this case we now know as a string and again, we've we've still got these these targets, right? And how do we decide what the best way is? Well, we're going to need to assemble kind of all in one place everything that we know about this one feature and In one way to think about that I talked about the the t type the target type as being a distribution And now what we want is to assemble a joint distribution We want to know what the joint distribution is of values and labels or of values and predictions, right? And so we need a type for that and I'm going to call this type s and In this case in this very simple case where we have a small discreet number of values We can again use a very simple type to represent that joint distribution Which is just a map from value to prediction, right? And so what we want to do is we want to assemble in one place a single map which has the Distribution summed up for all of the observations that had say yellow and the Distribution summed up for all of the observations that had blue and so on right And we're going to do that again in this distributed context by first creating these these Tiny joint distributions just for a single observation, right? So we'll take blue and we'll take our target for blue and we'll create a map from blue to the target for blue and so on We'll do that individually for each row And then we want to sum them all up right kind of conceptually sum up And you can think of these by the way I mean in showing you these as maps if you think in in sort of linear algebra You can also think of these as sparse matrices, right? We're building a sparse matrix for each observation and then we're summing up the sparse matrices So again, what does that look like in a distributed context? Well the first step looks almost exactly like what I showed before, right? So for each instance, we're kind of constructing this this s this this joint distribution We're summing them locally on each node to get you know that the subtotal for for that s And then we're summing them all up and sending that to a single reducer that now has In memory, you know one object representing the joint distribution for that feature Okay Again, I mentioned this is like a very simple case where we have a small number of discrete values We might have a large number of discrete values and so we might choose An approximate data structure like a count min sketch to to represent that this joint distribution We might have you know a continuous variable in which case again We probably need some kind of approximation or binning or something of that distribution We use the algebra library which has a lot of these Approximate data structures are a lot of these things that they can represent kind of a a large joint distribution in a fixed amount of space Right and and do that distributed But for now it's as simple as just to think in the sort of about the simple like matrix or math case, right? So so that's what you know you have that And now once you have all of this information in one place now you need to decide what are the possible ways We could split this tree with it, right? and And so this is is the the splitter the responsibility of the splitter trade And so, you know What we are trying to produce is a bunch of candidate splits and a split looks like this You know you you have your node You're trying to split and you want to have a number of edges In each of those edges is a predicate on the value type V You know do I go down this edge or not? Do I go down this passive path of the tree or not and then a prediction? You know if I go down this path then this is this is the prediction This is the you know and this is just learned from looking at my training data, right? Everything that did go down this path so far, you know if I sum up the predictions for all of them This is what I get, right? So I want to be able to produce a number from a joint distribution I want to produce sort of some large number of candidate splits And so a splitter needs to have a type s which represents this joint distribution It needs to be able to sort of initialize that joint distribution from a single observations value and target It needs to have a semi-group which is to say it needs to be able to sum these up And then it needs to be able to split them and so given this joint distribution And it also gets the distribution in the parent which is sometimes useful Produce a whole bunch of splits and it's worth just pointing out sort of as a side note that there's nothing Enforcing that there be two paths, right? We can have a multi-way split here You know if you write your splitter that way the splitters that are currently rooted and brush fired don't they're all binary trees But there's no, you know, no restriction particularly for that. So You know that the particular binary splitter I've been showing you here has is type s just this map from v to t It's create is very religious creates a map. It's split is a little more complicated I'm not going to show it, but you know just to sort of show you what that looks like So okay, so Have your splitter your splitter on the reduce node takes this Takes this joint distribution and and produces a bunch of splits and now we need to pick one, right? We need to pick one of these splits And that's the responsibility of the very simple trait this evaluator trait Which just needs to take in a single split and produce a score And for simplicity, we just say that score is always a double, right? and and so for example Use and brush and brush fire a chi-squared evaluator, which just runs the chi-squared statistical test on you know, it assumes that your Your predictions are going to look something, you know, like a vector of some sort, right? So so discrete values, right? This wouldn't work for a regression, but it works for either a binary or multi-class classifier And you're going to have you know some number of splits And so now you have a matrix, which is like a contingency matrix Which you can run a chi-squared statistical test on to see whether or not you know each row in the matrix is in fact sort of Different in a statistical sense And and you get back a score from that right and get back a p-value saying you can use that as your score And you just pick the Pick the split that gives you the best score The other interesting thing about evaluators is is that they're composable Which is to say that if you have other things you want to do to influence the scores You can build up a more complicated evaluator And so this is an example of a min weight evaluator where you might want to have the restriction that not only Does this matrix, you know have good kind of statistical separation? But also it's not like that it just represents three or four observations, right that that there's enough there that That we think it's interesting and so this becomes like a stopping criteria, right? and And so you can wrap that around and it will you know take the base kind of you know p-value or whatever But then say we also want this hard cut off where it's like, you know I never want to see splits down that they get me down to just like a couple of observations or something like that and so Yeah, so so that that lets you build up these sort of more complex behaviors from from individual parts So, okay Remember there, you know, I said we're going to start with with a single feature But in fact these instances do have multiple features and every time we're trying to decide how to split this tree We do want to consider multiple features and so that all happens in parallel So we have this one path where we're spitting off these joint distributions for this one feature But at the same time, we're also creating these joint distributions for each other feature in each observation each instance And and those are all separately going to be totaled And then sent to say a different reduce node Which is going to have a splitter of its own Which is going to split that feature into a bunch of candidate splits And then what we want to do is look at all of the candidate splits Produced across all of the nodes for all of the features and pick the best one, right? Okay, except that so, you know, this this is what it would look like if we were only at the root node But remember that that we're adding a whole level at a time And so we are actually doing this in parallel at every leaf node in the general case, right? And so not only is each feature Being evaluated in parallel, but this is happening for each leaf in the tree in parallel And so as our stream of instances comes in the first thing we need to do is basically come up with the prediction Sort of from the tree as we currently have it, right? We walk the tree with the instance to the node it would currently leaf node it would currently go to and then we use that instance in Constructing these joint distributions and coming up with these splits for that, right? And so that's all happening in parallel Except that in practice in production you very rarely want to build models that are a single tree, right? Models that are a single tree tend to sort of make too much of particular coincidences It's found in the data we call this overfitting And it's generally much better to construct a large number of trees from random Subsamples of the data maybe random subsamples of the features and then average out all of their predictions later on And so actually even this is happening many many times in parallel Where each instance is getting assigned to one or more trees But probably not all of them and then it is getting walked down through the tree to find you know The single leaf and then that leaf has multiple features and so on and so there's a huge amount of parallelism going on here so that we can build things like random forests out of this and And the trees which tree something goes to is controlled by this sampler trait And so the sampler trait says how many trees we want it also says for a given instance And for a given tree how many times we want to include that tree that instance in the training for that tree And it's how many times because for example for random forests you're doing a bootstrap Sampling and a bootstrap can is sampling with replacement and so can actually include the same instance multiple times You know it might be zero it might be three times whatever in the same tree and to allow us to Stably make these assignments from instance to tree over these multiple passes in a distributed context We need some kind of unique ID that we can hash on or whatever In the instance and so you can see the instance case class down at the bottom or at least I hope you can It has the feature map. It has the target. It also has an ID and it also has a time stamp and In the time stamps in there because it's very very common that in deciding In particular whether to use something at all in your training set or whether to hold it out and and use it later For validation of these models to look at time and something like out-of-time validation and and so this leaves again to a Desire to have these kind of samplers be composable You might have a simple sampling strategy like K fold cross validation. I won't go into the details But but then you might want to wrap that in an out-of-time sampler That has a time threshold and then it will just have a hard cutoff where it says anything whose time stamp is greater than this I'm not going to use it all in the training, but then I am going to use later on in the validation So you can kind of again build up these more complex behaviors And so to get back to a somewhat more slightly refined idea of what the job would look like When we're constructing the trainer we in fact not only give it the training data We also have to give it a sampler And then the call to expand. Oh, this is really faint. I don't know if you can even see it But there's a comment there pointing out they're actually implicit parameters there for a splitter and for an evaluator And there's defaults for, you know common types of value that you might want to use common types of predictor that you might want to use Or you can explicitly construct your own splitter your own evaluator and pass those in and And there's really just one more thing that I wanted to mention and This is something that that maybe has been bothering people maybe hasn't But what is the value type when you look at something like this where you actually don't have you know a homogenous map, right? You I've got both numbers and strings in there And yet I keep talking about this single value type V. What's going on there? Well, I mean you can resolve that however you want the framework, you know Doesn't doesn't care at some sense about what this is But it's useful to give you a kind of default way of dealing with this and so brush fire includes this dispatch type Which is kind of like an either but in like an extended either with with four possibilities And those four possibilities are for the four kind of common types of features that you use in machine learning an ordinal type Which is to say, you know a numeric type or or at least ordered but with a discrete number of values a nominal type where There is no ordering and again There's kind of a discrete number of values a continuous type if you're using like a real number or something a double And a sparse type which I use to flag it's it's nominal But there's like a sort of an infinite number of values or a very large number of values like IP address or something like this, right? And often you want different strategies different spitters basically for each of these right? And and so this is you know Your instance ends up looking like this where you have a map not directly to the value But sort of wrapped in these this dispatch type various kind of Subtrates or subclasses of this dispatch type And then you have and this is sort of a ridiculous piece of code But you can have a dispatch splitter, right? And so the dispatch splitter wraps up a separate splitter for each of these sub types And appropriately dispatches with pattern thing, you know the incoming data And then it would have also for separate joint distribution types Which it will wrap into a dispatched object on those four separate this joint distribution types and so on And so this lets you have you know multiple different strategies for for how we deal with with splitting All kind of coexisting in the same path through the data and lets you have these multiple different types of features in your data So so yeah, that's that's that's it and I think I'm I'm running a long time I wanted to reference the the planet paper out of Google Which was the sort of initial inspiration for a bunch of this work. They talk about a particular not a generic but a particular Way of doing using map reduce to learn decision trees Which which is similar to the way that I do it scalding an algebra algebra bird out of Twitter are You know libraries that this makes heavy use of And this isn't yet up on stripes get hub repo But it will be and if people are especially interested and they want to contact me about getting early access to it That's cool. Thank you very much