 Hi, I'm Julia. I am from Google. I work as a software engineer on the open source team as of today. Pretty excited about that. And I am so thankful that I get to talk to all of you, first of all, because I have not been able to participate in a community in a long time, and y'all are awesome. So, thank you. But second, I get to talk about whiskey. And I get to talk about machine learning. And since I failed the Turing test, I identify really well with some of these algorithms that I'll be talking about. So, you know, back in the day, I've done some stuff. This was like a lifetime ago. It wasn't my first or second career. This was my third career. And I did machine learning research. I got to play with robots. I set them loose in the lab and pretended like that was my social life. And I did what any good theoretical computer scientists would do. And I built an autonomous aerial robot, otherwise known as a blimp. This was bubbles. And when we figured out what we needed the robot to do, what we did was we calculated how much lift we would need. And then we said, well, okay, we've got these constraints. What shape can we fit in those constraints? Well, it turns out you can fit a dodecahedron inside the constraints that we have. And clearly, that was the most optimal structure. So, I made a pretty big dent in the world's helium supply, and I apologize for that. We did get like a pity award thrown at us for innovative hardware design, which I think is code for I'm not really sure how this works. And to be honest, I don't know how it works either anymore. But, you know, that was a lifetime ago, and I went into a PhD program for machine learning, and it really wasn't for me. And I was pretty sad because this meant in my naive world view that I could no longer do machine learning, because clearly they check whether you received your PhD before letting you run any of the algorithms. So, I realized far too little time ago that that was not, in fact, true. I can actually do machine learning, and I've gotten the chance to in the past few months. Because, like, at its heart, all machine learning is is taking data, running it through an algorithm, answering questions, getting some sort of insight. And when I first started, I was like, I want to solve the world's problems. I want to be able to help identify disease and make people less sick. And then it boiled down to that there were not enough hugs in the world. So, let's try solving that. So, I built a system called Can I Hug That, which takes an image and tells you if you should hug it. I pulled it out of my hands. I hugged my colleagues. They gave me examples of things they would or would not hug. Everybody says you should not hug a cactus. I agree. But it gave me cool things like this. So, one of the things I trained it on was pillows. So, a crocheted version of an octopus was like, oh, yes, please do hug. But the real thing, no, not so much. I fundamentally disagree but I can't argue with math. But it also did things like tell me not to hug barbed wire and that I should, in fact, hug a garcant perhaps with my mouth. I appreciate this because sometimes I need the sanity check. But what about whiskey? How can I apply this to this amber, the other amber liquid that we love, which I don't have with me right now, that is whiskey, specifically scotch. Well, I only got into scotch about three years ago and that happened to coincide with when I read just a simple tutorial on doing machine learning with a whiskey dataset. And if any of you are machine learning aficionados, you may have seen the one I'm talking about. There's only one of them. But it gave me pointers to some pretty good objective data. And while I'm not great at internalizing matrix operations, I can, however, make computers do them for me, which lets me do some cool things. So let's talk about the data that I have. Just to set some vocabulary, who is familiar with the term feature vector? I'm going to define it, so don't worry. So if you're talking about a 2D space like this, this is your basic X, Y axis, and you have a point, your feature vector is X and Y. If you add a dimension, you get X, Y and Z. That is your feature vector. It basically just describes a data point. If you add in some more, let's say a bunch, it gets a little bit more complicated to visualize because our brains don't fundamentally want to visualize in more than three dimensions. At one point I was dreaming in ten dimensions, and I don't really know what was going on, but it was not a pleasant time. But it gives us the ability to represent all points in a piece of data using one succinct line. So what do we have? What data do I have? Well, this takes my first, this is my gateway scotch. It is not, as I said, for way too long a highland. It is a space side. And this is some of the information that we have about it. We have how robust is the body? What kind of notes does it have? Some of the other ones are smoky, medicinal, stuff like that. And the two that did not come in the data set were the region and the latitude and longitude. Now, the region I painstakingly compiled from Wikipedia, and the latitude and longitude actually came in through a different coordinate system, which is something I learned is that the world does not necessarily operate on latitude and longitude alone, there's another one out there. But if we distill this particular one down into a feature vector, it looks something like this. We get the notes, the latitude and longitude, some of them are strings, some of them are numbers, or ints, some of them are floats. And it goes to something like this. We assign numbers to the strings so we can differentiate between them. And we just condense it into something like this. So this is the Balvenny Scotch. So folks that may have done some matrix operations in the past will note that this is in fact a fully fledged matrix, right? We can use traditional matrix math on this. Who's heard of TensorFlow? That's actually more than I was expecting. TensorFlow is, I've described it as being mostly for folks doing deep learning research. And deep learning research is an area of machine learning that attempts to solve problems we always thought you needed human intelligence to solve. So if you followed the AlphaGo competition, that's an example of deep learning. It was held up as this bastion of there's no way that a machine will ever be able to do this. And I love seeing the commentary because that very much held true in the post-game analysis. But at its heart, TensorFlow is just for numerical computation. It does matrix math fast and pretty simply. It operates with a concept called deferred execution, which lets you define what you're doing and not compute any of it. And then once you've basically coded up your graph and defined what your task looks like, then you can start running it. I'll show you an example. So this is my biggest fear right now, which is where I have to confess to you that TensorFlow does not have a Ruby interface. I'm so sorry. I know. But there is call for one. I don't know if you know this, but the folks at Google, we don't necessarily have many Ruby experts among us. Aja not withstanding. So. And you know, if you would like to contribute, it is open source, so you welcome input. So what does TensorFlow code look like? Well, it's got Python, C, and C++ interfaces. So I'll use the Python one because pointers. So you can use it in like an interactive form, which is what we'll do right now. We'll just kick off a session, so we don't have to mess with state. And we'll just define one simple matrix. This is our whiskey is fun, and it is a matrix that has three rows and one column. And if we say, okay, beer is okay, let's define one with three columns and one row. Which means that we can, in fact, multiply them. However, if we go ahead and print out matrices OMG, what we'll get is what's called a tensor object, which just represents something that we can use in TensorFlow. It does not have a number assigned to it, because we haven't actually evaluated it. All we've done is say what it will have, not what it does. So to be able to evaluate it, we have to call an eval function. And this is when we'll actually get the results of our computation. This is a really big shift for me personally. It took me a while to really get used to something like this. And about maybe at 3 a.m. on Tuesday, it started clicking, so it took me a while. And of course, let's close the session. So what do we have? Well, we have some whiskies. Oh, no. Oh, my God. That did not go well. So let's go back. So we have some whiskies. And I have mapped out the coordinates here. It's just on a nice little Google map interface. And if we hover over them, they're color coded by region. So this is a low lens. Let's find one that I can pronounce. Let's see. Maybe not. Space side. I can say space side. There is in fact a space side distillery. And guess what? It's in space side. So we can see roughly where each of the regions are on the map. If you know your whiskies, if you know your scotches, you're going to have some questions about this map as you should. And we'll get to those later. So the first thing I wanted to do was see what sorts of groupings I could find with this. And to do that, we go to an algorithm called K-means. So K-means has a couple of components. The first is the letter K. The other is the letter means, or the word means. K is your number of groupings that you want out of your data. It's as simple as that. And it takes, so if we had data points in 2D, which this is, I believe, one of the famous manageable data sets called the Iris data set, if we ran K-means over it and specified K as 4, we'd get something like this. And that's really intuitive to us looking at it visually because, hey, well, I can draw circles around them. That's pretty easy. But if you're talking about 10 dimensions, that gets a lot harder, right? So the algorithm is pretty simple. You just pick your number K. That's really weird to say, by the way. And as long as you don't have a good grouping or you haven't exceeded, like, how long you want the algorithm to run, what you're going to do is have each point, each data point assigned to the closest cluster center. And then once you've done that for every data point, you update the coordinates of the cluster center to be the mean of the points assigned to it. And that will eventually iterate you to a roughly good grouping. But it has a big weakness. Can anybody identify it? That's actually one of the first big weaknesses, that that's choosing K. You have to know what K is or else you're going to get really bad results. Anybody else? Yeah. That's a huge one. Picking your initial cluster centers, if you get a bad first go, you're really going to have a bad time because you're not going to get good information out of this at all. And it also means that your results can vary pretty wildly depending on your initial state. There's good research that goes into choosing a good first set. I didn't mess with that this time because it was hard and kind of inexplicable for a conference talk. So let's go to a container that we have. So this container is running, it's just a Docker container that has TensorFlow installed. And TensorFlow, you know, while you can absolutely install and compile it on your own, you probably run into some of the same problems that you have to deal with, which is that you have to deal with Fortran. And every two years I have to deal with Fortran and every two years I say why do I still have to deal with Fortran? Luckily Docker has helped us out with that. So what I'm going to do is run simple K-means, nothing, no fancy bells or whistles. And I'm going to take a data set that has the region, so if it's in space, highlands, lowlands, islands, et cetera, and what flavor profile it has, those flavor notes. We're going to group it into four clusters and we're going to let it go for 10 steps. So that actually goes more quickly than I would like, so let's make that take longer. Okay. So what it's outputting is basically the cluster assignments. So what we're doing is we're getting the results, the assignments, so we have a zero to three in this list. And I'm just zipping it up with the distillery name. If we scroll up, we can actually print out the centroid. So where it thinks the epitome of each cluster is at these flavors. So if we map those, we can see, so this is a previous run, so let's just refresh. And you can see that what we have is a little bit different than the original map we had. First of all, we had six regions, but we only have four clusters here. So if we zoom in, we can see that this scotch is supposed to be similar to this one, the RN scotch. But they're pretty far apart geographically, so it's clearly not determining the cluster centers based on the geography. And that makes sense because we're not passing it the geography. You would think, though, that scotches in similar regions will have somewhat a similar flavor profile, but that's not what proves to be true in this particular grouping. Okay. So I would love to show you the code for the K-means algorithm, and I link to an example at the end of the slides. It turns out that it's a little bit too complex to show in a 35-minute slot, but there is a really good explanation out there. So who's experimented with neural networks? We saw, like, an example of a diagram in Ajah's talk yesterday. So we're going to use a simple feed-forward neural network to try and do the same thing that we did with K-means, which is pick four groups of scotches that roughly go together. So what is a neural network, first of all? It's basically a graph that takes input and attempts to give you output. So in this sense, we're trying to do a movie review and determine if it was a good review or a bad review. So if we look at the review and the text of it itself, we would think that this is actually a good review, but the neural network actually can't get there as stands because what we're doing is we're basically just passing numbers along the arrows to the output node. And this is actually too complex of a task for a neural network as simple as this one. So it will be very confused as to what the output is. This actually made the whole research community grind to a halt in the late 70s because they thought that a neural network couldn't solve XOR. And apparently it can if you add in a hidden layer like this. So what we do is we pass along the words to each input node. Those get transmitted and evaluated by our hidden layer. And each of the lines you can consider as transmitting numbers along. But what the hidden layer is really doing is trying to break down what makes a movie review good or bad at a very high level. So words like great or good or amazing or excellent eventually with enough training would get to a state of good. Words like horrible and awful and I wanted to die after this movie would get transmitted to bad. So the algorithm for a simple feed forward neural network goes roughly like this. We initialize each of the layers to some set of numbers. And there are good papers out there to talk about how you initialize them. And then we decide how we're going to optimize. For ours we're just using a simple gradient descent optimization. And then we go through and train. What we do is we split up the data that we have into training and testing. So for the whiskey dataset what I'm doing is I'm getting 76 of our instances of our whiskeys and putting them towards training and 10 go to evaluation. So what does this do? So if we do our ANN which is artificial neural network we give it the training data, the file that has 76 of the scotches and the testing data which has only 10. And we give it a certain number of rounds of training to go through. We're going to give it about 12 nodes in that hidden layer and tell it where to give the output. So hopefully this isn't going to embarrass me too much because it has in the past. Oh, awesome. Great. It did. Lovely. So 193, of course. It's always a space, right? It's the least in Python here. So it gives a similar output to the K-means algorithm. But if we scroll up enough we'll get that we have really terrible accuracy into actually grouping it into what we want. I'm going to talk about that in just a minute. But I'm going to violate the cardinal rule of machine learning and see if we just get a better result if I run it again. And now we're on par with a coin flip. Yes. This is great. Oh, my gosh. I'm teaching everybody bad habits right now. And we go back down. So this has actually turned out to be a pretty decent run, this approximated about .7. And if you notice, I'm going to just flip back and forth from the region to the single malts. And you can see in, I believe these are, let's say, one of these are highlands and I believe one of these are space sides, either purple or green. And it's done a pretty good job of differentiating between those. And for really, really good reasons. Really good reasons, I promise. So I had very different goals for this talk than what actually happened. And I learned some pretty valuable lessons. What I wanted to do was take the coordinate system for all of the distilleries and based on a new set of coordinates, tell us what the flavor profile of our Scotch would be. So if I really wanted smoky Scotch with a hint of medicinal, it would tell me, like, whether I was in a good place for it. That didn't happen. Well, first of all, I've been pronouncing everything I've ever said relating to Scotching correctly. And this is especially embarrassing for me, because until I was about, well, this tall, I was a short child, I actually had a Scottish accent. So my diction teacher would just absolutely be horrified at me to hear this. But the problem, not to shift blame here, but the data was the problem, we had about 86 data points for our entire data set. And yes, they had, okay, dimensions, it wasn't tiny, it wasn't just X, Y, but there wasn't enough data. And when you've got a machine learning system, the more examples that you can throw at it to tell it, okay, this input gives me this output, the better. So if you're training, like, an image classifier, like, can I hug it or not? For a very, very basic classifier, it's going to take a lot of images. To retrain one that had already been done took 320 images. And you think about, if each pixel is a data point, that's a lot of data. When I was putting this together and I said, okay, do you think it's reasonable for me to do this? One of my colleagues said, well, that's going to be really hard. Well, K-Means is pretty easy and this is like the basic neural network. It was like, no, actually the data is going to be the hard part. And I disbelieved him. I didn't believe him. He was completely right. So when you saw the map of the different regions, who here saw a problem with the map? I can tell, okay. Well, no, it wasn't really clustered and there's a good reason for that. The coordinates were wrong. And until about Tuesday, it actually put one of my nice single malt scotches in Germany. And I was like, oh, data is never wrong. Like, this has to be right. There just must be this outpost. Yeah, not so much. So my little script to clean up from a coordinate system I didn't understand and clearly still don't. It had some trouble along the way. So when it put one of our smoky PD scotches in the middle of Scotland, guess what? It doesn't actually belong there. This is still rough. When I was doing research, I had this whole system down pat for visualizing and transforming and thinking logically about my data. And that is basically a muscle. And if you haven't exercised it, it's really hard. That said, all it takes is practice. And if you're amongst where I was and still am to a certain degree in saying, well, I don't know if I can ever get that knack, it takes a willingness to fail and ask questions and get better. And I think we've heard sentiments like that echoed throughout both of the days of this conference. And it was nice to see that there was some solidarity out there for mistakes. I really needed more data. And what I should have done is I know a lot of people who really enjoy their whiskey. And the data is not that complex. I should have just started from scratch, created my own data set and compiled it and run it through my algorithms. I could have easily gotten at least 200 data points, which is over double what I actually had. So in hindsight, I should have drawn on the people who were so willing to sacrifice their livers for me and just held a party. So clearly I just need to have more fun. So if you're interested, here are some resources on machine learning, TensorFlow, and the algorithms that I talked about, including this really awesome gentle intro to ML that a few like six months ago I started rereading. And this was great. And it's a good high-level intro to some of the concepts. And if you're interested in getting started, definitely go through it. I described TensorFlow previously as primarily for people doing research. But when people started publishing what they were doing with it, I was really inspired because there were people just tackling problems that I never thought existed. So I think there was a great example of taking a sketch that you drew and saying, okay, I want this sketch to be in the style of Monet and running it through a neural network and applying a style to it. So clearly I don't have to suffer from my own artistic ineptitude anymore. So machine learning and whiskey, the goal or the takeaway is more whiskey. Well, that's really all there is, more whiskey. So to fill the requirement of the theme today, I always have a thank you octopus up there. And now apparently he's going to be armed. So I'm going to back away. But if there are questions, I'm happy to answer. No. Well, so that's a really good question, right? Oh, so how if you took the age of the scotch into consideration, how would it change the data model, the clustering algorithms, et cetera? And that's super interesting because what you would wind up with is just multiple pieces of data for each scotch. And you'd have another comma and you'd have an increased dimensionality for that particular set of scotches. And that would actually be a really good way to easily increase the size of the data set, plus have an excuse to spend multiple hundreds of dollars on scotch. So unless anybody wants to volunteer. There's a question up there? Right. So the question is like if we handed a human the scotches, they're able to group them pretty easily. What's the difference between our capabilities and what the neural network could or couldn't do? Neural networks were biologically inspired to an extent. And there's a lot of difference. When we are tasting something, we're not just tasting that thing in isolation. We're drawing upon years and years, decades upon decades of experience. Maybe not tasting scotch, but tasting things. Good or bad. And that makes us more attuned and able to reason intelligently about what we're tasting. If you think about it, like take the neural network that I created and destroyed in a matter of ten seconds. That's a far cry from 30 years. So it hasn't had the exposure and the training that we have. It has had very specific training, but it hasn't gained the ability to reason as well as a human has. Which is part of the reason that deep learning is such a big area of research. There's a question there? Yeah. So there are classes of algorithms that, so the question is, none of these data sets were streaming. They're all static. And how does that change things? There are classes of algorithms known as online algorithms that take new data as it comes and retrains and trains and trains. And again, the more data that you have coming into the system, the better it will be to an extent with some exclusions apply, footnotes. But it does slightly change how the math takes place and your logic. But yeah, most of the time, the industry seems to switch back and forth between batch, which is what we did, and online training pretty regularly, at least in the time that I've been keeping up with the research. That's a good question. And there are people, so what does the, how many nodes in the hidden layer, like I think to paraphrase is how many nodes in the hidden layer is the right number, and what do they represent? And to really dig into that, you'd have to go into the math behind it. There are good guides for choosing the number of nodes in your hidden layer. I did not follow any of them. So take that for what you will. I tried, like, so one of the thoughts is that the hidden layer basically decomposes your input data and tries to reason about it, like with the movie example. And it will try to learn different aspects of your data. And while we can try and think logically about what that would be for maybe the scotch dataset, maybe that would be flavor, body, and if I had the coordinates in there, it would also have the coordinates. But that's me kind of guessing. I would have to dig into the math behind it a little bit more. A little bit over today. We want to try and gain back some time, so I'm sure a lot of you have questions, because this is fascinating. Please catch up with Julia in the hallway. Thank you.