 I'm excited to announce our last speaker for the day. And then we have, oh, well, hold on. Give me a minute. He is an assistant professor in the information science department at Cornell. His research is on developing machine learning models and algorithms focusing on applications and humanities and social science. He is also the chief architect for the mallet toolkit, which is an incredible resource if you're at all interested in machine learning. So please help me welcome David Nimno. OK, thank you ever for coming. This is a really exciting place to be, and I'm so glad. I'm sorry I'm the last thing that's between you and dinner and dinosaurs, which is a difficult place to be. So what I want to talk to you about is my experience trying to teach machine learning with data visualization and D3 and JavaScript in particular. And so I want to start by saying, what is machine learning? So this is my core area of research. Like many people, I want to say I'm a terrible data visualization person, and I'm starting to be not a terrible teacher, I hope. But so machine learning is my core area. And I like to think of it as this three-step process where you start with data. We try to learn a model from data that summarizes some pattern in that information that can then be used to give us some kind of insight about the data. Now, often this is something like trying to make a prediction, like is this transaction fraudulent, or will this user click on this link? But it can also be used for exploratory purposes. So for example, in the recent trend towards data journalism, something that people are really interested in is, say, taking 10,000 documents that have been dumped on your desk by someone who hates you with no index and no organization and try to find out what's interesting. Where should we spend our time looking at this document collection? That's actually something I try to do with my research and topic modeling, which I'm not going to discuss today. It's kind of relieving. And so three things have really made machine learning an enormously powerful tool in the last 10 to 20 years. The first is a lot more computation. So things that used to be mathematical curiosities are now being done at large scale by major corporations. The second is access to a lot more data. So there's things that don't work if you don't have enough data. But if you have enough information, then you can get some signal there. And the last is that we've moved away from symbolic representations of knowledge towards statistical inference tools. And this has proven with the combination of data, computation, and stats to be an incredibly powerful combination. And one of the things that the grad students that I practiced this talk on mentioned was that you may not realize this, but your life is in very serious ways shaped by algorithms and models. Maybe if you're totally off the grid, this might not be true. But if you have a phone, if you use a computer, your experience, what ads you're shown, what news articles you're offered, eHarmony is using algorithms. So our most intimate moments are controlled by algorithms to a large extent. So I think that everyone needs to know about machine learning. In particular, I'm really interested in trying to produce tools that will be useful for people studying history and literature. Okay, so here's the problem. So to go back about a minute and remember I said statistical inference. So what I'm saying is that there's this incredibly powerful tool that I think everyone needs to know about which involves learning statistics. An area that most people respond to with some variation between mild discomfort and what I recently heard described as Lovecraftian horror. So I think that this is an important time to think about how we teach statistical inference and machine learning. And I'm gonna make a case that combining visualization with mathematics, with code, those three things is a very powerful tool for conveying insight and intuition about math. So this is kind of the traditional approach that we have towards teaching machine learning. It's very notation heavy. This is a very concise and powerful language for describing mathematical relationships, but it is a language. And if you don't speak that language, then this means nothing to you. In fact, for a lot of the people I work with, it would be better if this was ancient Greek. They would be more likely to understand it. So by itself, mathematical notation and the traditional sand at the board writing equations way of teaching or communicating about math is not necessarily the best thing. So what if we go step back from the mathematics and talk more about code? So actually implement things and see how they work. And I think that's also a very valuable tool, but in isolation it can also be extremely confusing. Does anyone have any idea what this code does? Yes, so that's a good guess from the base distribution, statistical inference, yes. So this is probably the most obfuscated code I've ever written. I can say that because I wrote it. So this is very confusing. And it's not at all clear what this is doing or how it relates. And the problem with looking at code for machine learning models is that it's often written with an eye towards speed and efficiency and conciseness and not with an eye towards clarity and readability. All right, so the third leg, visualization. And this is a wonderful tool that I found that is bringing together a lot of visualizations of machine learning models. But I would suggest that it also is not sufficient in and of itself. So what this is showing you is very complicated visualizations of powerful algorithms shown from multiple different angles and they're beautiful. And if you're mathematically prepared, I think they can give you some good insight. But by themselves, I don't think for the base level audience, especially for undergraduates, that this is going to be effective. So I propose, and I'm sort of actively testing this hypothesis, that the best way to convey intuition about machine learning is to create little bite-sized nuggets that combine visualization, code that people can actually write themselves within a short amount of time, and mathematics that makes precise the relationships that are reflected in the visualization, that that is, I'm not gonna say the most, but a, effective way of communicating these ideas. And I wanna show you four examples of some different concepts that I've actually been using in an experiment that I've been running at Cornell. They've kindly given me 120 undergrad participants in this study. They think they're taking a course. I'm sure they'll see this video and they'll complain that they're all doing great. And so the first example I wanna show you is, well, so one of the earlier talks talked about sentiment classification. So let's say you get a document. You wanna know, is it a positive, say, Yelp review or a negative Yelp review? The goal is to take that input and make a decision. A model for doing this is to take every distinct word and give it a numerical weight, either positive or negative. And then you make a decision by adding up the weights for each word in the document. And if that sum is positive, you say it's a positive review and if it's negative, you say it's a negative review. And one way to visualize this model is like this. So I'm putting every word on an X-axis and a Y-axis. The Y-axis just represents their frequency. And is very frequent. Weighted is not very frequent. The left to right represents the ratio of these words in positive, or five-star Yelp reviews and one-star Yelp reviews. So the words horrible, rude, and worst are about 20 times more likely in negative reviews than in positive reviews. And that the actual X position left or right of the middle line is exactly that weight that we put on each word. So I could take a document like I weighted some number of minutes, but it's always great. And I could add up the X positions of those four words and I would get a decision. Whether that's a positive or a negative review would probably come out somewhere in the middle, which you can see because they're about the same distance right or left. So that's the predictive mode. You can also use this to learn something about the phenomenon of Yelp reviews. So one thing I noticed is, first of all, there's a lot more words that are significantly positive than words that are significantly negative. Hey, there's some negative space, awesome. So fresh, favorite, delicious, amazing, friendly, love. And they tend to be kind of adjectives that are about food, but the negative words manager, told, minutes, didn't, she, no, would, and also rude, worst, horrible, weighted, these are mostly about service. And actually there was some recent work by Daniel Rasky at Stanford that was on, made the radio, describing how different Yelp reviews sort of have different characteristics. So this is useful information for managers to know that service is what really drives bad reviews, but it's generally useful. So here's the code that implements this. That's D3. I'm taking my words. I'm setting their X attribute to the result of a function. Let's look at it from the inside out. So I'm taking the number of times this word occurs in five star documents divided by the total number of tokens in five star reviews, dividing that by the number of times it occurs in one star documents divided by the proportion in one star tokens. So taking that ratio of probabilities, taking the log of that, changing that to log base 10 and then scaling it. So that's all there is. So this is using a very complicated algorithm called counting. So I counted up the words, divided them by the total number of words and gave ratios. So this is a very simple algorithm. So the next thing I wanna talk about is a more complicated algorithm. And what I wanna do is cluster these points. So I wanna say that there are groups of points and there's sort of regions of density. Now we can see this. We can see that there's sort of a cluster over here and cluster over here and something over here. Do people, is this easier or hard? It seems pretty easy. So that's an illusion. It's an illusion because we have this incredibly powerful machinery for identifying clusters in two-dimensional space. It turns out the goal of creating clusters of points that have the sort of lowest distance from each other and the furthest distance from points in other clusters is it's intractable. So I'm not gonna throw technical terms at you but it's basically as hard as any problem that you're likely to solve. There's no way to find the best solution to this problem without basically checking every possible partition of the points. But it turns out that there's a class of algorithms that can get us a pretty good solution that's quite simple. And that's this iterative pattern. So why don't I just throw down some random cluster centers. So those red numbers. They really have no relation. I just sampled them from a uniform distribution. And I'm gonna set all the points to have a number that they reflect their closest center. Again, this is not a very good clustering. I just made it up a minute ago. And now I'm gonna move the cluster centers so that they're in the centroid of the points that are assigned to them. So they moved. And now I'm gonna re-cluster the points and the ones that change are gonna sort of grow and then shrink. I don't know if anyone can see any of this on the screen. But okay. So then I'm gonna keep doing that. Okay, there's only a few of them that are kind of moving now. And now it's converged. So this is this iterative pattern where you start with a terrible solution and slowly make it better, usually by alternating between maximizing two different objectives. And that turns out to be incredibly powerful and a large class of machine learning algorithms are just so slight tweak on that. So there you go. I promised the organizers I wouldn't try to teach anyone here much machine learning but if it happens, it's an accident. Okay, so this turns out to be a really lucky case because in some cases, if you start, sometimes there's a center that's sort of in this not very dense region right here. So it's not guaranteed to come to any particular solution. And the numbers of the clusters are random. If I change the numbers, it would be exactly the same. So these are important things to know about this algorithm. There's no guarantee you'll get to a global optimum and it's invariant to label permutation. Let's look at the code. So as I said, there are two things that we have to do. So we have to cluster the points. So we find their closest centroid. I'm gonna do something for each point. I'm gonna set a variable that's recording that the shortest distance to a cluster. So it's a positive infinity first. And then for each cluster, I'm gonna calculate the distance. This is our friend Pythagoras or something that's the square of the distance. And if that distance is the shortest we've seen so far I'll set it to that cluster. So this is something that undergrads can write themselves within a 50 minute class if you explain to them how to do it. Here's how we change the location of the clusters. I'm gonna do something for each cluster. I'm going to grab the points using a filter that are assigned to that cluster. And then I'm gonna set the X value for the cluster to the mean of the X values of those points. So this is why this is called the K-means algorithm because you have K clusters and you set them to the mean. And I'd like to suggest that an essential part of understanding this algorithm was seeing it move. Is that fair? And having access to the code that allows you to understand why they move to where they move. Here's another example. It's gonna be a perceptron classifier. Got a bunch of points. Can people see that there's blue points and there's red points? Okay. I wanna separate these blue points and red points. The goal would be to do this in high dimensional space so I could train a classifier for fraud detection or something like that. And what I wanna do for this example is give you an example of what's called an online algorithm. So in the previous example, it was an iterative algorithm but I alternated between moving all of the means, so the model, and reclustering all of the points. So I did the entire data set all at once. If my data set is gigantic, then I have to wait until I get through all of it before I can change anything. So in practice, a lot of commercial machine learning is done in an online fashion where we're taking little bits of data at a time. And this is how this perceptron algorithm works. So let me just draw an arbitrary line. Is that clear at all? Okay. So it's kind of going like that. I'm setting the background color actually with a line that's perpendicular to that that has a stroke width of 2,000. If anyone can think of a cleaner way to do this, I'd be interested, but it does seem to work. And so what I'm gonna do is I'm gonna hit the sample and update button and it's gonna grab a random point and if it's on the right side of the line, so it's not got a little black outline around it, it's gonna ignore it. And if it's on the wrong side, so if it's got that black outline, it's going to try to move the line, the decision boundary line, so that it's perpendicular to that point. Okay, so the first one was correctly classified, so we're gonna ignore it. Second one too, oh, we're getting lucky, okay. All right, there's one. All right, did you see that move? Okay, I'm just gonna sort of click things and it'll become increasingly unlikely that, oh, there was two, that we're gonna get a misclassified point. There's one. Okay, so at this point, we have probably about the best classifier that we're gonna get for this data set and so what I wanna show with this is both this pattern of taking small bits of data at a time. Imagine this is a stream of data coming in where you're not even keeping anything that you've seen before and you're just updating the classifier that you currently have. The other important thing is, could I have created a line that would separate these points cleanly? It's a little hard to see with the colors, but no, there's actually, you can show that there is no single line that would have separated them. So this also introduces the students picked up on this immediately that this objective is not necessarily satisfiable. Let's look at a code. I'm not gonna talk too much about this because it requires a little bit of linear algebra, but this is really just adding and multiplying. So the first part is where we're gonna make a prediction for the point. I'm gonna take the inner product of a vector that defines that classifier line and if that inner product is positive, it's on the wrong, at the right side or on the positive side. If it's negative, it's on the negative side, so we multiply that by the label, which is either positive one or negative one, and if that's less than zero, so if it's gotten it wrong, then we change the classifier vector so that it's either pointing towards or away from the point that we misclassified. So again, not very hard, but that's the core of this algorithm. Our last example, this is a linear regression. People familiar with linear regression? Okay, this is one of the models that people are most familiar with. The data is actually, also from Yelp, the x-axis is showing the log of the, or on a log tail, the number of reviews that a business got. And the y-axis is showing you the average rating, the average star rating for that business. And you can see that if I fit the best fit straight line, it's sort of going up. There seems to be a positive relationship, but it's maybe not a great fit. There seems to be a lot of variability for small number of reviews, and maybe the best thing to say is that if you have a lot of ratings, you're probably not a bad business, which seems reasonable. Now mathematically, what this model is saying is that if you increase the number of reviews in expectation, you will increase your star rating. So it's saying that the linear regression model says that there is a linear relationship between two variables, an input and an output. And let's say we wanna test that assumption, and find out is this actually a robust relationship, or is this just an artifact of outliers or a small sample size? And we often do this with something like a T test or an asymptotic test, but I think that there's a much simpler way to show this, and this actually goes back 100 years to early work in statistics that at the time was thought to be a curiosity because it was too computational. So what we're gonna do is we wanna see what would happen if there was no relationship between the input and the output, the x-axis and the y-axis, and get a sense for what that would look like. And I'm gonna do that by shuffling the y-values, plotting the linear regression line for that shuffled data, and then moving it back. So let me do that a few more times. And you can see from the transformation, or the transition, that I'm keeping all the x-values for the points the same. I'm just changing their y-values so that they only go up and down, they don't go side to side. And I'm building up a sense, an empirical distribution of what that line would look like if there was no relationship between number of reviews and star rating. And what I'm noticing is that my actual value, my actual thicker blacker line, has a higher slope than any of those random permutations. And that suggests to me that this is a fairly strong result, that it's unlikely that I would've gotten this result just by random chance. But I like to show this visualization because it shows that this idea that we're randomly permuting one of the variables, and then saying what would that model look like? And building an idea of the distribution of models, not by appealing to asymptotics or p-values, which people don't understand all that well, usually when they come in, with this empirical view. So let's look at the code. I've taken out some of it. But the basic idea is that I have a randomized data set. I'm gonna shuffle the indices of this data set and create points by taking the real x value and the shuffled y value. So the x values are the same, but the y values are randomly shuffled. So there's the same number of them. They have the same mean. You actually saw that the lines all went through at a single point. That's because the mean of the x's and the mean of the y's is the same. I'm gonna move the circles, create a new linear model from my randomized data, draw a line for that, and then move the circles back. That's all it is. So as I said, I've been teaching a class, I long enough in this semester that I actually was able to get some feedback from the students. I wanna tell you a bit about them first. Their reported programming skill is pretty good. So they've taken a programming class. They know how to write code in Python or some JavaScript. They've been exposed to statistics, but a lot fewer of them feel that they have a strong background in it. So I think that what I'm gonna argue is that these are a fairly representative set of people who code, maybe took a stats class in college, know something about linear regression maybe, but don't have a very solid feel. When I asked them anonymously, visualizations helped me understand algorithms and mathematical models. Again, this is preliminary data, but they do seem to feel that this was helpful. So this is encouraging. So I wanna end by talking a bit about what are the limits of this type of method. So I've told you that I want to take little bite-sized chunks of machine learning, take things that have clear visual explanations, but also have fairly easy or concise code attached to them that presents the mathematical relationships in actual code. So what does that leave us? How restrictive is that? And I wanna suggest that it's increasingly not restrictive. I think there is a growing body of work that can be presented in this way. This is an example that I did for a little graduate workshop. This is Hamiltonian Monte Carlo, which I'm sure you all did in high school, right? Okay, no. So this is something that took me about 45 minutes to explain to graduate-level computer science students. It's sampling from a Gaussian distribution, so these points should have sort of the same density as those ovals. But it's still clearly visualizable. The idea is that you have a little point that's sort of bouncing around in a sort of physics-inspired way that either accepts or rejects and draws a little point, if it accepts. Not gonna go any further into that. But this is an example, I think, of a non-trivial algorithm that I can do with this. The other thing that's really important to remember is that JavaScript, I was kind of shocked to learn that the discovery that set me on this path, that because JavaScript is so important for the browser makers, JavaScript is really fast. And I think it will, the thing that is holding JavaScript back from being a serious scientific programming tool is really libraries, and that's something that we can fix. And so getting back to the idea of teaching machine learning, I think that given that we have really good capabilities, so a lot of processing power, a lot of fast computation in a platform that's installed on everyone's browser already. Given some thought, and perhaps a sequence of examples that would build up local libraries and patterns and templates, I think that we can go very far with this type of pattern. And I'm really optimistic for using the style of teaching, both in the classroom and in an ongoing MOOC setting, for example, or just putting out a lot of gists and letting people look at them. So yeah, I'm very optimistic, and I'd love to hear more about your response, and I'll take any questions.