 All right. So machine learning in 30 minutes is obviously a bit of an ambitious project. And I only started to appreciate the scope of this whole presentation once I got into making the slides. So I have several goals for this presentation. One is to talk a little bit about the general trends in machine learning, and hopefully get you guys excited about some of the things that are going on in there. Second is to communicate the fact that even though there is a lot of academic, math-involved literature in this space, it's actually very easy to get started. And a lot of it relies on very, very simple core insights that we all can make use of. And the last one is actually to hopefully get you excited about actually going out and exploring some of these ideas. So first of all, what comes to mind when you hear machine learning? Terminator. All right. What else? You're on AI. Sorry? AI. AI. What else? Lisp. Dagen. At work, yes. Chess. Chess. OK. OK. Roomba. So Roomba. All right. So when I ask this question a lot of people, what usually comes out is they get this picture of very complicated mathematical formulas with linear algebra, a lot of calculus, optimization theory, all this kind of stuff mushed into one field. And that's certainly true because it does require a lot of heavy machinery to figure out a lot of the concepts. But at the same time, I think unnecessarily so. And part of the reason for that is if you take any course in AI or machine learning at a university, college, or just on your own, you'll find that the textbooks often focus on the algorithm. So we have, hopefully this is nothing new. We have inputs. We have a runtime. So runtime may be your CPU, or a GPU, or whatever. And then you have the algorithm. And the algorithm is kind of the core. It's the core insight of how your machine is going to learn. And this is a complicated subject because when you think about learning in general, we all know that learning in general is hard. We know that natively. And here you're trying to teach your machine to learn something. So that's a complicated subject. Now, I think that this really should be expanded for a number of reasons. And we'll talk about why that is true. And I think when you talk about machine learning, you should really start expanding that box and thinking about the data inputs and the runtime. And there are several reasons for that. So first, the runtime. So my experience with machine learning was very theoretical. I took a number of courses. I explored the field on my own. And I found that oftentimes, the academics treat the runtime as a practical constraint. They really don't care about it. If you do a survey of machine learning faculties in most universities, you'll find that there is a lot of statisticians, optimization theory guys in there. So what they're concerned about is the mathematical proofs. Can you get the lower bound even tighter than the guy before? And that's interesting. But they often overlook the runtime. For them, it's a constraint. Somebody will build a machine that can run this. It doesn't matter if it needs to have a terabyte of memory. One day, we'll have it. But what I've proven is that I have an algorithm that can run in this data set. But in practice, and I always came at this field from a practical point of view, that's actually a major constraint. And as developers, we all know that. And the best example I have of that is actually my own personal experience where at one point I invented, or I thought I invented, what I thought to be the best recommendation algorithm that there is. On paper, it was supposed to be the best thing ever. Then I went and tried to implement it. And turns out, I couldn't run the algorithm in anything but the most trivial data set. And then I discovered that my local CS department actually had this fabled machine with 40 terabytes of memory, lots of CPU. And if I could just get access to that, all of my problems would be solved. So I spent the next two months corralling and bribing the people to get me access to this cluster. And then one day, finally, I got the SSH key. So I log on to this machine. I do all the regular commands free and just to see what the memory is. And it says there's 768 megabytes of memory here. And I'm like, what the hell? What happened here? You were promising me this 40 terabyte something memory machine. And turns out that they actually had a 50 machine cluster. So it was a commodity cluster of 50 machines. And here I am, I thought it was this supercomputer that I could just take my code, run it, and all my problems would be solved, not the case. So I quickly discovered that there's a practical constraint, which is there's these distributed systems. So I figured, hey, this distributed systems thing, how hard could it be? I'll take a little yak shave and solve that problem. So here I am today. Probably one of the longest yak shaves of my life. So this is all a roundabout way of saying runtime matters. And of course, now today we have access to EC2 and all these cloud computing platforms, which you all guys know all about. And that's really, really critical and important to this field. But there's actually not a lot of research. So in fact, you guys are ahead of most of the academic research in your expertise and knowledge of these distributed systems. The second one is the data input. So up until very recently, data input has been largely scarce. You could not get your hands on a lot of data. If you look at natural language processing or any other domains, you'll find that the researchers have been working on data sets with millions of topics or millions of files, whereas today we have trillions of pages on the web. And in fact, we are the ones, the Rubyists, are responsible for building a lot of the applications that are generating terabytes and terabytes of data, right? Unstructured data, structured data, and all this kind of stuff. And that is very, very important. Now, here's a gotcha. So we got this runtime, which is capable of processing these huge data sets. But at the same time, we're exploding the amount of the data that we're collecting by orders of magnitude. So what's the net when? Are we actually gaining anything? And it turns out that we are. So I'll let you digest this for a second. This is a paper that was published in early 2000 by a couple of researchers from Microsoft. And the English translation of that is more input versus better algorithms. And what they discovered was very, very interesting. They took a number of off-the-shelf natural language processing algorithms. They didn't invent anything new. And what they wanted to test was, what would happen if we throw orders of magnitude more data than anybody has ever done before? So let's throw an order of magnitude more data. Two orders, three orders. How would these algorithms perform? And what they found is that over time, as you increase the input size, the performance of all algorithms went up. And in fact, if you look at Lerner 5, which is the guy in the very bottom, about 78% when it starts, it turns out to be one of the better performance once you throw a lot of data at it. Now, this is really, really interesting because you didn't change anything in your code. All you've done is thrown more data at the problem. So it turns out that having a lot of data is actually incredibly useful. And this created a new sub-range of the entire field, which is called data-driven learning. So this stuff has been there before, but now much more so because it's actually relevant. We have access to a lot more data. And in fact, you can take this to an extreme. You can actually have your data be the algorithm. So what's an example of that? Is there anybody in the crowd that can actually read what's on the slide? How many concepts or words, distinct words, are there in this text? 12. 12. Now, to English speakers, if you don't speak the language, there's nothing in here that will tell you that there's 12 different words or concepts because there's no such thing as a space, there's no spaces in here. And this turns out to be a very big practical problem for a lot of people. So imagine you're trying to do, you're building a search engine and you need to search for some specific concept. It's obviously pretty hard to treat this entire text as a big, un-tokenized string, right? It's a giant input. So you want to segment it into distinct concepts. So how do you go about doing that? How do you come up with an algorithm to do that? Now, we can simplify the problem a little bit because I think this is no different than me just basically taking an English sentence and mushing it all together. And I think we could all figure out what this is supposed to be, right? With a little bit of exercise, we can figure out that word segmentation is tricky. Now, how did you guys do that? What happened and what's different about these two problems? Or how are we about solving that? So one is, you go out and buy the Grammar for Dummies book and you start building a model, right? So we can build a language model and figure out that word W is nonsensical in English language. So we can encode that knowledge into some sort of a model. Second approach is, well, you know, somebody must have done this and in fact many people have. So we'll just grab some sort of toolkit that does this for us, run it through and it'll work. And here's a third approach. We'll just take a guess, right? So let's explore the take a guess. Here's a very, very simple algorithm for taking a guess. So what I'm gonna do here is I'm gonna say, what is instead of trying to build a model of the language, I'm gonna estimate the probability that W is the right segmentation followed by everything else, right? So I'm gonna evaluate all, can recursively evaluate all the permutations of, how could I possibly segment the string? And then I'm gonna take the answer that gives me the highest probability. But here's the question. How do you estimate the probability of the letter W in the English language? Well, one way that actually works remarkably well is you write as Creeper for Google. You search for the letter W, you count the number of hits, and then you divide it by the number of pages on the web. And you don't even have to know the number of pages on the web, just pick a number. Call it a trillion, right? So if you get a trillion hits for the, or half a trillion hits for letter W, you know that probability is one half. So as an exercise, try and write this. It's actually incredibly simple and it works really, really well. And this is an example of just pure data-driven learning. You can also just download the Google's Ngram data set, which they released a couple of years ago, which lists all of the probabilities. So they've already done all of this work. It's available, they use it for their own research, and they've made it public to all of the researchers. So it's sitting there ready to be used. So the algorithm is basically, let's crepe the web, count the number of words, and we'll figure that out. Now, we've actually just described an instance of an algorithm, right? So it's a very simple insight, but it works remarkably well. And perhaps the most interesting thing about that is it's language agnostic, right? So we could apply that algorithm to any language, not just the English language. We didn't have to build a model of the English language. And we could have run that at any language, and that is in fact what Google does, right? That's why Google translation, or Google is able to have their translate product as AD languages, is because they don't have to build models of every language. They don't have to hire linguists for every language. They just count all the words on the web in every language, and then they just guess, and it works remarkably well. So algorithms matter, and there's many, many different algorithms in machine learning. But one thing that I've, or one concept that I find very, very useful to me personally is to think about learning as compression, right? So intuitively, if we can, if we have some problem, and if we can identify the significant concepts in that problem, we can say that we've compressed that knowledge. So an intuitive example would be you reading a page of text and you underlining two lines. You've essentially taken the page of information and compressed it to just two sentences, which will remind you of the core insight of that page, right? That is compression at work. And some of the best examples actually come from natural sciences. You take very, very complicated phenomenon like inertia or gravity, and you can capture the whole thing in a simple formula, right? It literally takes two or three variables. Those are some of the best examples of compression. It's very hard to model that, but it's very easy to express because we know some of the underlying rules. But the interesting duality of this is if we can represent some data set in a fewer number of bits, then we've extracted some, we've learned something about that data, right? That is in fact how compression works. If you think about it, we're eliminating the noise. You take an original file, you compress it to something that is smaller, and you've compressed the noise of the repeat attacks or whatever it is into a smaller file, right? And that is in fact, very much what machine learning is all about, right? So compression and learning are two very, very related concepts. And I wanna walk through a couple of different examples of how can that be applied. So classification is first. So let's say you have a problem, right? So somebody challenges you to build a model to predict a tasty fruit. And you hypothesize that there are two things that could possibly determine whether a fruit is tasty or not. It's gonna be feel and color, right? So you construct this little thing and you go out and gather a bunch of data and then you plot it. And then you get something like this and you say, well, you know, I could make a reasonable guess as to what makes a tasty fruit now because I can just draw this line and say, if anything falls on the right side of that line, it's gonna be in one set and if anything falls on the left side, it's gonna be in the other set. So when somebody, when a friend comes to you and says, hey, I have this thing here, is it gonna be tasty or not? You have a very good predictor or a very clean model, rather, for predicting whether that works or not. And what we just described here is what's called the perceptron algorithm, right? So it's an example of some neural networking. It belongs in a class of neural networking algorithms, but it's actually very, very simple and easy to implement. So the question here is, given a number of data points, how do we devise an algorithm that actually finds that line, wiggles that line, such that it separates the positive from the negative examples, right? So I'm not gonna go through the how to do that, but it's a very simple algorithm that you can apply to any data set that is separated, could be separated by a line. Now, of course, you can notice that there are many different lines that we could have drawn, right? So it's an interesting intellectual question as to why would you pick one line over the other, right? And I'll leave that as an exercise, but in practice, it turns out that usually you wanna find a line that maximizes the margin between negative and positive examples. And that's actually a very, very good improvement on the perceptron algorithm. And you guys would be surprised how many problems can be solved with this little guy here, right? It's kind of the foundation of the whole field, but it can solve many, many different problems. Now, that's a clever algorithm, but it doesn't work for all cases, right? So imagine you have a different problem where you gather a bunch of points. In fact, it's just before we had two things, it was feel and color. So imagine now we just gather color, right? And you're saying, okay, well I'm gonna apply this perceptron thing to it, and you're gonna draw this line. Well, of course, you're gonna misclassify a whole bunch of points, right? Because there's no way you can draw a line through this such that you don't make any mistakes. So no matter what you do, you're kind of screwed. But coming back to what we were talking about yesterday, actually applying some lateral thinking, nobody actually told us that we have to be in this one plane, one dimensional plane, right? So we're gonna do a little trick, right? I'm gonna arbitrarily say that we're gonna make this a two dimensional problem. And what I'm going to do is say that the y value is the x squared value, right? So let me just flip back and forth a few times and you'll see what happens. So I arbitrarily drawn a y axis, right smack in the middle, and I've made all of the y values to be the square root or a square rather of x. Now, it seems kind of arbitrary, but what this allows us to do is to draw a line through our examples. And all of a sudden, we can distinguish all of the positive from negative examples. So once again, a very, very simple observation, right? But it turns out that we've just reinvented support vector machines, right? That is the core insight behind support vector machines. Support vector machines have a lot of heavy math and optimization theory behind them, but that's the core insight behind the whole thing. So what it does is given some data, it throws it into some n dimensional space and then tries to draw a plane that separates all the positive examples from all the negative examples, right? And it's actually very easy to get started if you're a Rubyist. So there's basically one canonical library that everybody uses, lib SVM, written by some guys at MIT. And there are language bindings for virtually every language that I'm aware of. So there's JRuby, Ruby, and all the rest. So all you have to do is create a problem. If you wanna do spam classification, it does really well on that kind of stuff. So once again, there's a link here to explore that a little bit further, but very simple insight, but complicated execution underneath. Next one is recommendations. This is obviously a big one for a lot of web apps, right? So the general problem is something like you have a bunch of users and you have a bunch of objects and those objects could be anything. Let's say it's movies or code commits or something and you get the users to rank each of the objects to say, yeah, I like that or no, I don't like that, right? And then you have a user Bob underneath who's estimated or ranked three out of four and you wanna predict whether Bob is going to like object D or not, right? So how do we go about doing that? And it turns out that there's very simple insights that we could use from basic math or linear algebra that can solve this problem. So we're not gonna get into the details of it, but the actual core observation is if you paid attention in your linear algebra class, which I didn't, you would have known that you can take any matrix that looks like this and decompose it into three components. And the reason that's interesting is the middle matrix actually allows us to do some really funky stuff. Like you can just delete values out of it and then when you remultiply all of them back together, you get an approximation of that matrix. So that sounds very abstract, but let's look at a real example. So a photo is in fact a matrix, right? You have pixels. So here we have a photo that's 512 by 512 pixels and what we're going to do is run SVD on it. So we decompose that matrix into three different things and we delete, so this K value at the top is equals to 16, which means we delete over 400 values out of it. We just null them out. And then we remultiply the whole thing and we get the image on the right, right? So what we have done is actually we've compressed the image. We've just invented an image compression algorithm that uses nothing but very, very basic linear algebra, right? There's nothing about color here. We didn't talk at all about the fact that this is an image or some other different type of matrix. So very, very clever. And it turns out that this approach in general is the bread and butter of the basically all vision systems, right? If you want to extract significant features out of an image, you basically keep compressing it until you lose everything and then you can go back and figure out where the edges, what are the significant big things in here? And the reason that's interesting is by compressing, we're learning something about that data, right? So we can discover the interest, like where the hat is, the big kind of fuzzy hat. Instead of looking at each individual pixel and seeing every straw in that hat, we'll just know that this is where the hat is. And based on that, you can actually build a very, very useful and well-performant recommendation system. And the way you do that in Ruby is, well, you just call one function, right? You install the lin-alg library, which uses the GNU scientific library underneath. You create a matrix and then you call this singular value decomposition method on it and it does all the work for you, right? So on some larger data sets, you do need a little bit more to make this happen because this doesn't exact the decomposition which requires a lot of memory, but you can approximate it as well. So there are tools for that. But once again, very simple insight from linear algebra and it can be applied to image compression. You can build recommendation systems. You can basically do anything that looks like a matrix. You can run it through SVD and do some really interesting stuff. All right, clustering. So clustering is an interesting problem. You have some set of data points you've collected, right? And here I'm plotting them on a two-dimensional plane and I've even color-coded them. And when I present this picture, most of us, even if I omitted the colors, most of us would have, with high degree of accuracy, agree on what the clusters are, right? Our brains are very, very good at clustering this type of data. And we're not even aware of what our brain does. But basically what we're doing is we're saying, well, what's the similarity between these two points? If they're similar, I'm gonna call them kind of in the same cluster, right? So the question is, in any clustering problem, is how do you define that similarity, right? Like, do I need to know, obviously you can build very domain-specific similarity functions, but is there a more general way that you can do this? And it helps to look at a simpler example, right? So forget the image on the right and let's look at the example here. So you have three strings, a bunch of A's, a bunch of B's, and then A's and B's, and even without discussing what the similarity function is, I think we can agree that the similarity between one and three, or most of us will agree, is going, the similarity of one and three is greater than similarity between one and two, right? Because, well, string one has a bunch of A's and so does string three, so they're kind of more similar than just completely different. And same thing for two, three and one, two, right? But the question is, how did you do that, right? How did you know? And I think some of you actually ran a compression algorithm in your head. You're not even aware of it, but that's what you did, right? If you're familiar with compression algorithms, what you've done is you basically said, well, if I treat the AAA as one token, right, I can actually represent that more compactly. It's just one A followed by four or another three characters. I can represent that in a more compact way. Three has that, so I can compress three as well. So you basically ran the Lempel-Ziv algorithm on it in your head. There's obviously other ways to do this and we can probably even disagree that maybe A and one and two are, in fact, more similar than this similar, but that's one way to do it. So this is a little bit code dense, but actually very, very interesting. It takes some time to appreciate what's going on here, but first I have a method called deflate, right? So I'm actually using gzip to take any number of files. I'm just gonna read them, concatenate them into one giant string and compress them and I'm gonna return the size of that compressed string, right, that's what the deflate method does. Then I'm gonna read a bunch of files from my data directory, whatever they are, and I'm going to deflate each one. So first I'm going to compress the one file, right? Then I'm going to compress the second file that I'm comparing it with. And then I'm gonna do a trick. I'm going to concatenate both files together and compress those two, right? And the intuition here is that if you compress one file and you compress the second file, if they're completely dissimilar, then the sum of those should be should be greater than if they're actually similar, right? So if the two files have some commonality in between them, when you compress them, their size will be smaller than when you compress them individually. So very, very simple observation, but this is incredibly powerful because I can run this on virtually anything that Zlib will compress. So obviously text, right? But anything binary data, I can run it on maybe even MP3s. So MP3s are already compressed, but they have some metadata inside of them, so maybe I could compress that. This is completely domain agnostic, right? As long as I can compress something. So in fact, you don't have to use Zlib. If you have some magical algorithm for compressing some other type of data, you could use this to cluster data because you can basically define a very good similarity function. So all the actual scoring or all the actual magic happens in the last little bit where we call the score, right? We basically say, take the compressed size of A, compressed size of B, and that should be either greater or equal to the compressed size of both, right? So the larger the score, it means the more we've compressed it. And that's how you define your similarity function, right? So now if we go back, in our clustering example, we now have a very good way to measure similarity between any two files. If you run this, if you have a set of files in any given directory, let's say you lost all of the file extensions on it. So you have no idea what they are. You could run this and it'll discover the different files because it'll be able to compress them based on similarity. You can detect languages, you can do all kinds of stuff. So this is really, really cool because there's no knowledge of the domain. You have, you know, this is generally applicable. So we talked about the runtime, we talked about the data input, and we talked about the algorithm, right? And the idea there is that many algorithms that have complicated machinery underneath have very simple insights at the top. But there's another interesting trend that's happening right now, and that's what's called the ensemble methods. So this is another paper that was published in the early 2000s, and what authors discovered was that instead of building a very complicated model of the world, let's say you want to model the, let's say you actually want to predict taste, movie taste, right? So instead of trying to build one complicated model that will have all of the features, whether you like the soundtrack, whether you like the actor, whether you like all of these other things, build many discrete models, very, very simple ones, right? And then just make them all vote on an answer. It turns out that that's actually oftentimes a much better way to model anything than building a complicated model. And the best example of this is actually the Netflix prize. Anybody here participate in the Netflix prize competition? No, but you guys should have, because it was very, very interesting. This was running for about over a year, I think almost two years, and a lot of teams competed in this, right? And what happened was that the goal was to improve the Netflix algorithm by 10%, to improve the recommendations by over 10%. And for the longest time, people were stuck at like the 9% level. It was very, very hard to go over that 10% line. And then what happened was a bunch of teams came together and said, look, you have slightly different insights. We have slightly different insights. What if we just take our algorithms, put them together and make them vote? So this one team called Belcore, it was basically a combination. I believe it was three teams that came together, three teams, but seven team members, combined all of their stuff, voted on the answers, and they surpassed the 10% line. And that caused a race. So all of a sudden, all the guys underneath them said, okay, well, clearly we're not gonna win a million bucks by ourselves, but maybe if we pull all of our records or all of our algorithms together, we can do better. So in a span of like two days, this ensemble team was created, which basically took together something like 20 or 30 solutions and pulled them all into one. And they also got the same score. And ultimately Belcore won the prize because they submitted their answer. They were actually exactly the same in their estimation to four decimal points. But Belcore won because they submitted the answer 20 minutes before the other team. I mean, that's a tough way to lose, right? But this illustrates a very interesting point in that you can do very interesting things with this approach. And intuitively, the reason it works is you can have different models capture different things, right? So you can model the soundtrack, you can model something else, and you can combine those things together. So you don't have to build complicated models of the world. You can build very simple ones and then just join them together. And one of the best examples I have of that is the GitHub challenge. Now, did anybody do the GitHub challenge last year? Oh, come on, guys. Hopefully after this talk, you'll be more incentive to try something like this. It's, so one of the best, most interesting examples of the GitHub challenge to me, at least, was in a way that they've set up the challenge. So when you submitted your results for the GitHub challenge, you actually made your answers public because you committed them to the GitHub repo, which meant that what I could do is I could find all the submissions, I could download all of the answers from everybody, and I could just build a voting algorithm to say, you know, I'm just gonna combine his answers, my answers, and his answers, and I'm just gonna create a new answer, like the meta answer that will combine 50 different answers. And I submitted that, and I was actually, you know, in the top three for a little while. Ultimately, they disqualified all of the people that did this, you know, or, you know, I disagree. I think I should have qualified, but nonetheless, right, they had a point. But the point is this works remarkably well, right? Download the leaderboard, set up a voting mechanism, and go nuts. So in summary, three things. Data-driven learning, right? So aggregating a lot of data is really, really helpful. So you guys are producing a ton of data. Think about how you can use that in your own way, in your own world. Second is runtime matters, and there's a lot of stuff that we can do to improve the state of art. So Hadoop helps. There's a really interesting project called Mahout, which is machine learning on top of Hadoop. But overall, there's not a lot to make this happen. And the last one is this ensemble method for algorithms. Build simple models, find simple insights, and combine them. And you can often do very, very interesting things. You don't have to build, you know, the next language model and hire 10,000 linguists. So here are a few other resources for following up on this stuff, and you can learn about how to make it actually work in practice. But that's it, and I don't know if we have time for questions. Unfortunately we don't, but you can catch them on the break. Thank you very much, Oya. All right, thank you.