 Good luck, you guys. How do you turn on the projector? Oh, it's... Hey, Grant! Oh, there you go. You're on? Yep. That's good enough. Can I... Can we set this to 16? Grant was having problems with the network cable. Yeah, he said it didn't work. Can we set this to 16? Are you guys really putting down two gig images? No, I'm going to try that support. Under those sticks? Yeah. Okay, good luck. Because there's... Yeah. Do you need my... I'm going to go grab some lunch. Do you want me to come back and try to help? Yeah, if you want to, for sure. Yeah. So what did we do with this? Anything else? It just went away. It's... Yeah, it's somehow... This doesn't do anything. It should come back. It's supposed to. What do you plug into that? Yeah. Unplug it and plug it back in again. It makes me so sad. What? Do you know that's off? But it was working two seconds ago. What do you... Just click each one? Oh, it is off. Yeah, it was on. This is that. Okay, now let's see. Move the mouse around underneath. I hope everybody memorized that first slide. That's it. No. Let's go and get the organizer. I'll be right back. All right. So if you could get on the Wi-Fi, go to github.com slash radanalytics.io slash workshop. And I'll write that out here. If you can go here, and if the wireless is working for you, there's a readme, and you can download some Docker images that we're going to use for the hands-on portion of this workshop later. Well, we're seeing if we can get the projector working. I'll tell you a little bit about what we're going to do today first without slides. So my name is Will Benton. Mike McCune. We're both engineers at Red Hat working on sort of bringing data processing to OpenShift. So if you want to build a data-driven application in OpenShift, that's sort of what we're focusing on, figuring out the best way to do that and sort of how it makes sense. So today, we're going to start by, if I can show you slides, we're going to start by introducing what kind of applications you might want to develop that have a data processing component. We'll introduce Apache Spark, which is the technology that we're using to build a lot of our applications internally. And... There's a call-in. Excellent. Thanks, Steve. We're flying without slides for now. And we'll then sort of give you enough data science to be dangerous, and then go on to talk about how to actually build an application in OpenShift. And for the hands-on portion, as you can see, can everyone get to the GitHub site? Yes. And are you able to pull down those images? Sort of. So Mike and I are going to try and get a USB key out for people if the images don't download. But for the hands-on portion, we're going to have a sort of notebook where you can work through some exercises with Spark and also a notebook where you can work through the sort of prototype for the application we're going to build. And then we're going to actually build an application in OpenShift that uses Spark to do data processing. And if you don't have a machine that you can run Docker images on, you know, maybe you just have a phone or a tablet or you have an ARM laptop and you can't run x86 Docker images, if those are... that's the case, you can still follow along with the notebooks on GitHub so you can... GitHub will render the notebook files and you can sort of see what they look like and try them out interactively later when you get a chance. I'm going to switch to VGA and see if that solves our problems here. And if it doesn't, hopefully, someone will be able to help us with the projector pretty soon here. If not, you may get to see how well I do this part of the talk with the whiteboard. So just so I can get a sense for where people are coming from, how many people in here are familiar with Apache Spark or have used Spark? How many people know what Spark is? How many people have used Spark? How many people are using Spark in production to get work done? Awesome. Awesome. So I may make the answer to the question. You're on your own. So... Let's see if this VGA works any better. There we go. So this is... I was hoping it would be 16 by 9, but it's even thinner than we would expect. So... That QR code may or may not work now that it's squashed like this. I don't know if there's a way to change aspect ratio here or not, but this seems to be working for now. So we'll cross our fingers, and if that code doesn't work, it just directs you to the readme, and I think everyone is already there. That was the type of readme that orders to come on the Docker team that all know Google's missing. Excellent, excellent. So you found the first bug in this participation exercise. So when you do the Docker poll, you'll need to specify colon notebook because there's no latest tag for that Docker. Excellent, and a pull request. I love this. Thank you so much. So I will address that pull request as soon as I have Wi-Fi. So can we change the aspect ratio on this? No, no, no. Okay, thanks. That's it? Well, it wasn't working, but it's working on VGA. So I apologize for my compress slide. You may be used to seeing things that are supposed to be 4.3 filler box, but we have it sort of the other way around. Imagine that your eyes are anamorphic, though, and that will be okay. So as I mentioned, we're going to start by introducing this concept of insightful apps. We'll do a sort of crash course in Apache Spark and data science, starting with some basic data science concepts and then sort of moving on to sort of just a quick introduction to Spark. But the real need of this workshop is going to be in this hands-on section where you get to play with Spark. Try it out. We have a Spark notebook where you can use Python code to play with data from FedMessage. Does anyone in here know what FedMessage is? Awesome. So FedMessage is a unified message log that all of Fedora's infrastructure runs on. So anything that happens in Fedora turns into a message on this bus, and I have a very small chunk of data in the Docker image that you can play with and you'll actually get to train a machine learning model on it. And then we'll go on and build a different application. So we actually have a lot of hands-on stuff we can do. And the cool part about this is it's all in notebooks, so you can experiment with it and try different things. So we've already gone through this part of actually pulling the images. The details are there. There's a pull request. If anyone wants to review that and give it a plus one, I'll merge it as soon as I can. But let's start by talking about this category of insightful applications. And all we mean by insightful applications are these applications where you trade data about yourself for getting something for free. No, not necessarily. It's these applications that learn from how they're used and from data that people provide them. So an application that gets better the longer you use it. And if you think about any application that you give people money for on a regular basis or any application that you love using, it's probably one of these applications where you're using machine learning or you're using data so that you provide users with a better experience or you create value somehow from the data you're collecting. Now, to do an application like this, you need to have an analytics component. You need to have something that's actually doing this data processing. Now, in the old days, we would have relational databases. And because you couldn't have a relational database that was good at everything, you'd have one relational database that did your transaction processing. Really high concurrency, fast writes. And then you periodically copy that over to another relational database that was optimized for fast for queries but that wasn't optimized for updates. And then if we look at this picture here, we have sort of compute-focused things in orange. We have data sources or streams in that sort of teal color. And we have UI components in green. And by having two databases like this, you could support business reporting and sort of saying, well, what should we be investing in next quarter and sort of high latency but also interesting queries. Now, this didn't scale out very well. Those of you who've tried to scale a relational database know that it's sort of an active area where people are trying to make it easier to do. So there are some approaches to taking this basic architecture and modernizing it in the last decade or so. And one of them is the sort of Hadoop project had this so-called data lake architecture where you scale out your storage and you have compute jobs that migrate to where the data are. So you have scale-out storage and then on top of your scale-out storage you have scale-out compute as well. Another approach where you have sort of the batch processing and the stream processing components is the so-called lambda architecture. People have heard of this. The idea is that you have precise batch analyses and you have imprecise streaming analyses that are just operating on some window over your latest data. And then you federate those together, striking some balance between like, hey, this is your latest information and hey, this is actually accurate results. And you present that to the user. But one thing that all of these... So there are various shortcomings with all of these architectures and the biggest one is sort of fundamental, which is that these things treat analytics and machine learning as something you do on the side. Like that's a workload. It's something you're going to sort of plan for. You're going to run it separately, like if you're a company, like doing payroll, something you just do and then it's done. With the kinds of applications we care about, this analytics is really just part of the application. It's something that we're dealing with a bunch of different data sources. We're federating them. We're training models on them and we're using that to support a wide range of applications and user interfaces. Analytics is not something we run on the side anymore. It's just part of our applications. And the punchline is that OpenShift is really great for this because you can manage your analytics and your application with the same infrastructure. Making sense so far? Cool. So how many people have some machine learning or data science experience? Okay, so some of this may be review for those of you who've done this stuff before. The idea is not to say you're going to leave here as an expert data scientist. The idea is to have you leave here sort of inspired and knowing where to go to learn more and having just enough sort of knowledge and techniques to be dangerous. So what are the basic concepts of learning from data? Well, I want to start with contrasting writing a conventional program to make decisions with training a machine learning algorithm to make decisions. And I really like bicycling. So let's say I wanted to write a program to classify different types of bicycle. We take a bike and we check to see if it's handlebars or flat. If it's handlebars or flat, then maybe it's one of these fat-tired winter bikes. Maybe it's a mountain bike. Maybe it's one of these Dutch city bikes that people ride in a very upright position. If you have the drop handlebars, maybe if you have small tires that are less than 27 millimeters, maybe that's a bike that people would race on the road. If it has knobby tires, maybe it's a cyclocross bike. And otherwise, maybe it's some different kind of bike. And if we get to the bottom here, if we haven't figured out what we have yet, we have to sort of throw up our hands and say, sorry, I didn't account for that possibility. This is not a very exciting program, but we've all written a code that looks like this at some point, even if we're maybe not willing to admit it in our workshop room. Now, on the machine learning side, we proceed sort of a different way. We give a bunch of labeled examples to an algorithm that says, how can I distinguish between these different things? And comes up with rules, basically, or comes up with a model for identifying different kinds of bikes based on their characteristics. And the cool thing about this is that this technique is actually pretty resilient to things that aren't like things we've seen before. So if you see this thing that looks sort of like a road bike and looks sort of like a spaceship and you know what it is, maybe eventually you can say, well, it's a little bit like a road bike, but it's sort of an outlier. It's not really making sense. And once you label that thing, you don't need to add a bunch of extra clauses to this messy series of if statements. You can just learn from a new example and identify that you have something else here. Right? So the first step of writing a program that works this way, and really I think the most important step is to teach your engineering. And that's the process of going from something in the real world or an object in our program to something that a machine learning algorithm can operate on. And really the things that the machine learning algorithms want to operate on are vectors of numbers, optionally with labels. So here we have a vector labeled mountain bike that has six elements in it. So we have a label. And then we have the handlebar type. We have zero because it's not a drop bar or one because it is a flat bar. We have the tire size in millimeters. We have whether or not the tires have knobs on them. And then we have whether or not there's front or rear suspension. And so we have an example with a mountain bike. We could look at a different kind of bicycle mountain bike and see that we have a different vector corresponding to that kind of bicycle. So there are some techniques in this that are really important. Some things that you want to deal with are just numbers. And you can just put them into these vectors and it's fine. It's not always the case. And I want to go over some techniques for when it's not the case. So the first thing I want to call out is this so-called one-hot encoding. You know what they all are. So maybe you have two different types of handlebars that you want to consider. Drop bars and flat bars. We would take those two possible values and convert it into a two-bit vector. So this vector is true here if it's a drop bar and true here if it's a flat bar. And only one of those is set because you only have one of the other. You could extend this to two different colors. If you had red, green, and blue you'd have a three-element vector. It sort of works that way. It's called one-hot encoding because only one of these bits is set. And this is a really nice way to encode these features where you have a bunch of possible values that don't have any natural numerical interpretation. The next thing we want to think about is value scaling. And if we look at this vector what do we think is the most important numbers without knowing anything about what they mean? What's the most important? Tire size. Very good. Here's a coffee. You drink coffee? Excellent. If you don't drink coffee you can still answer questions. Okay. So yeah, so if we just look at this as a vector tire size is going to be the most important thing that an algorithm is going to identify. So maybe what we want to do is we want to transform that and say well, we're not going to have arbitrary real numbers for tire sizes, right? It's going to be the smallest tire we're likely to deal with is maybe 19 or maybe 17 millimeters wide. The widest one is probably 130. So we can scale these values between 0 and 1 so that we're mapping to this wider range. And here we've just linearly scaled it but you might want to scale it so that it's normal data as well so that it falls on a bell curve depending on what you're trying to do. Other techniques that are useful in feature engineering are approximations, this aspect ratio is just killing my photo here. Approximation techniques. So maybe you have a high resolution thing you want to quantize it so that you're dealing with a lower resolution thing. Maybe you have a continuous value and instead of looking at that real value or looking at the distribution you want to get a histogram and you want to say well this falls in this bucket or it's in this range so you discretize it in another way. This is super important for looking at real values and making yes or no decisions about them, right? Another technique that's really important and I think cool because it's easy to understand and it works well is so-called feature hashing. Say you wanted to have a feature vector for a document, right? And you wanted to use a technique like this one hot encoding where for every word that you could possibly encounter you would have a value there if that word appeared in the document. So I took the first five and the last five words from the words file on my machine which has like 260,000 entries in it that's going to be a pretty large vector and it's going to be mostly zeros and it's going to be a pain to deal with for you or for the algorithm, right? And it's not necessary because most documents aren't going to have that many of those words, right? So a really cool technique that works well to sort of narrow this down so-called feature hashing basically we take two hash functions here's Python code to do this it's very simple we one one of the two hash codes decides which hash bucket we put a value in and the other one decides whether we add to or subtract from the value in that bucket and we want to sort of we're randomly spread out across all the buckets presuming we have a good hash function and we're randomly spread out between adding or subtracting so the idea is that over this whole vector if we add a bunch of things that we've hashed into it together we're going to expect that the average of them is all going to be zero which gets us this sort of nice scaling property and makes it work better for a lot of algorithms so we can see how this works with a short sentence say like we want to look at the quick brown fox jumps over the lazy dog and we can see that we adding things and subtracting things into this 128 buckets and you know that's a short sentence so you may not say that's interesting but I could do the same thing with a wikipedia article or a novel, right and I would still be able to get a 128 element vector out of it so that's a, I mean you can see why this 120, I mean even without knowing the performance characteristics of a particular algorithm you can be pretty sure that dealing with something with like a hundred or a thousand elements is way better than dealing with the 200,000 or 200,000 or 400,000 element thing, right okay, any questions about future engineering? Can I set up these features can I later modify them so is it painful or that's a great question and it can be painful so what you want to do to make sure it's not painful is how do you write software that's reliable? I mean that's a hard question, I mean like I'm not that's not a trick question but how would I write software that I could trust right, I would start with tests I would use source control and I would have CI right, so you can do really similar things for your machine learning pipeline so maybe I say oh I want to hash this into 256 buckets and maybe for whatever reason freak accident, my hash function has a collision you know and I need to change the hash function or maybe I need to change the number of buckets because I just don't have enough to capture the thing well I can't go from the 256 bucket vector to the you know a larger vector because there's no way to sort of pull things out of the hash and turn them into a different space but what I can do is if I have sort of a disciplined repeatable way for running this pipeline I can say rerun this with the same raw data and just use a different technique and so it's it is a lot like if you think about infrastructure automation, if you think about running containers and sort of declaring things and setting them up and if they don't work you replace them with something that does it's really a very similar technique so having a way to manage that workflow is do you drink coffee having a way to manage that workflow is important and really focusing on automation and repeatability is how you're going to solve the problem there keeping the raw data around is important for that reason too great question other questions so I want to talk just about a few of the different kinds of algorithms I'm not going to talk about specific algorithms I want to talk about the kinds of questions you can ask with machine learning just a few of them and we can see what those things look like so the first thing I want to talk about is classification and usually we're talking about binary classification where we're dividing some set of things into two categories yes or no, true or false fraud or not fraud, failing, not failing if you get into medical applications or sickness and health all kinds of various ways to divide things and I'm going to use as a running example this plane of shapes so we have a few different dimensions here we have a position on a two-dimensional plane so we have an x and a y coordinate we have a color, which is one of those things that we would probably encode one-hot encoding with a variety of colors or we could think of it as a point in three-dimensional space green and blue or hue saturation and value or LED if you really like color spaces any photographers in here? okay and then you could do shape in a few different ways you could do it as shapes, a finite domain of things you could do it as a number of sides you could do it as a number of corners you could do it as some other representation there's really any number of things you could do and depending on what kind of decision you're trying to make the machine learning algorithm can wind up slicing this data up in different ways so that's an important thing to remember a natural interpretation is to think maybe we want a linear separation so yes are going to be the things that are below that point and the things that are no are going to be above that or vice versa that's one way to divide this data maybe our algorithm would decide that the most important thing was the color and that I'm sorry this is blue I'm calling it blue, but I guess it's teal on the projector we're saying that the teal is sort of the important thing so we're going to separate out the things that are teal and then finally maybe we would look at the shapes and we would say well the things that have an even number of corners are the yeses and the things that have an odd number of corners are the noes when you have this multi-dimensional data you can make these decisions on really any one of these dimensions or any combination of these dimensions as long as there's sort of a clear cut way to divide it and in fact there are other tricks you can use into a different space so that there doesn't even have to be a line between them it could be things that are inside a radius versus things that are outside a radius another technique is called clustering which is where we want to say I want to find some number of groups of things in my data so that all the things in a group are closer are more similar to each other than all the things that aren't in that group and so with our shapes here maybe we would cluster things and we would decide that the most important dimension is actually the shape so we would get a clustering that looks like that a third kind of algorithm is recommendation engines this is a really important class of algorithm for commerce or a lot of things you do if you watch movies on Netflix or a service like that they'll say this you might also want to watch this your music player may have a feature that says make me a playlist of songs that sound great with this other song if you're buying something online your commerce vendor will say hey you have those shapes in your cart you probably want to add these other shapes to your cart this is an important thing and it's basically if you have a lot of data about what things frequently appear together you can train an algorithm to identify things that go well with some subset of things the next thing I want to talk about is outlier detection and basically this is there are a few different ways to do this but a natural way to think about it is that if you have some clusters of things or if you have some way to sort of identify common things there are some things that are sort of dissimilar to any group of things that you would identify as being similar right or they're sort of dissimilar to the normal thing in this example we have an outlier do you guys see the outlier? the green yeah we have one sort of turquoise or green the projector makes it hard to talk about color words octagon here and we don't have any other octagons we don't have anything else that's that color so this is something that this is something that's there are ways that it's similar to some things on this set of objects but it's really far away from all of them right and this is sort of a weird concept but I find that a really helpful way to think about outlier detection is to think about maps and if you think about maps in the real world and you say well there are a bunch of things on the map so here's a map of Bruno and here are three buildings on this map if I'm at the tennis court over there and I want to say which one of these buildings is closest to me well it's this one the western most one and it's maybe a five minute walk away it's not very far so the closest thing to me on that map is pretty close now I could take a map of the whole world though and I could say that these things is closest to me from anywhere in the world and there's still going to be an answer like if I say instead of being at the tennis court maybe I'm in Budapest like which one of those is closest to me well now it's the one on the eastern most building but it's not that much closer to me than the western most building or the northern most building pretty much all of those buildings are really far away from me so this is sort of a way to think about the detection is that there's always something that's closest to any object but the thing that's closest to you may not be that close and if the thing that's closest to you is significantly less close than the thing that's closest to most other things you're an outlier make sense? good so speaking of outliers and making sense of data as you might imagine when we're constructing these feature vectors we get vectors with a lot of elements hundreds or thousands of elements totally common tens of thousands of elements is common and we want to make sense of these data and for me I have a pretty easy time thinking about data with two dimensions a vector with two numbers in it well hey you have a point on the plane that's pretty easy to think about a point on the plane sometimes I'm clumsy but I still don't have that much of a problem thinking in three dimensions I may run into things but I don't have a problem thinking about a point in space either you can think of space over time you can think of four dimensions once you get to a lot more than three or four dimensions though the intuitions we have for what this thing means start to break down and we really can't think about more than a few dimensions fortunately there are disciplined ways that we can treat things that are similar and dissimilar in these higher dimensions and the other sort of fortunate aspect of this is that most real world data that has a lot of dimensions only has a few dimensions that are really meaningful because if you have like a thousand dimensions and every one of them is super important you have noise you don't have a real thing so almost always only a few of these dimensions relatively a few of these dimensions are are meaningful and so first I'll talk just quickly about how we can make sense of things in these dimensions even if we can't picture them and then I'll talk about how we take something that has a lot of dimensions and reduce it to something that has a few dimensions which is easier for us to think about and it's also faster for the algorithms so if we think about similarity and distance if we think about how similar two points are on a plane we think of maybe euclidean distance right like how far these things are apart now euclidean distance generalizes you know how to calculate it for two dimensions but you can generalize it for an arbitrary number of dimensions right this is this is old news to everyone I'm not asserting that it is we know this or we believe it we either know it or believe it that's all I want here great so you can generalize euclidean distance Manhattan distance which is if you think about only taking horizontal or vertical steps on a grid of city blocks in two dimensions you can generalize this to arbitrary dimensions as well most of the spatial distances you can calculate for any number of dimensions a similar thing that's really useful for sparse vectors like the kinds we would see with one-hot encoding or with with hashed features is angular similarity and I have a hard time thinking about what a hundred-dimensional angle looks like but I don't have a hard time thinking about a two-dimensional angle and we can say well there's some way to measure the similarity between two angles and you can do that in arbitrarily many dimensions as well if you only have binary vectors which you do for a surprising number of applications you can use a set similarity metric like the jacquard index as well I won't go through it but it's pretty easy you sort of think about how many things I'm not doing intersections and unions here in cardinalities but basically it just turns into like how many things do these things have in common versus how many things could they have in common and you get a number right so that's pretty pretty straightforward okay so there are a lot of ways to do distance metrics obviously we're not covering all of them this is just sort of to give you an idea of don't panic there are ways to deal with this stuff now most real-world data only has a few interesting features though right no matter how many dimensions we have in our vector we probably won't need all of them to make a decision and even with this trivial bike example where we have six elements in our vector we don't really need to worry about a lot of these things right like we either have drop bars or flat bars so we don't need two elements two features to do that we can just say yeah it's a drop bar and again this is only for this example if we had more bikes maybe it would be different right or if we had more kinds of hell bars it might be different both of these things have knobby tires that feature doesn't tell us anything for this training set again front and rear suspension in this case this mountain bike has all kinds of springs in it and so if you have a front suspension you have a rear suspension but you don't have a rear suspension without a front suspension and so those features together don't tell us any more than just one of them so we can eliminate one of them and we're left then with just three features that we need to differentiate between these two kinds of bikes now we did this because we have a couple things to consider and we can look at these and sort of identify I mean sort of get an intuitive picture of this but there are disciplined ways to do this on real world data and I'm going to cover a few of them very quickly just to give you an idea of what they look like the first one is very very simple and it's called random projection and it works just like it sounds you have relatively high dimensional data like this 10 dimensional vectors here and you just multiply that data by a matrix to get it into a smaller dimension and this matrix where did this come from I generated it randomly it's just random numbers just sort of picking things and combining them randomly and the trick here is that these elements in each column add up to one and if you do that and try and go too small where too small is a concept that I'm not going to define any further then you preserve you mostly preserve the distances in the original data so by doing this with this data we get things that look like this so it's actually this technique works surprisingly well and you can use it to cut down your data by like an order of magnitude without really losing any information although the points in the two dimensional space don't there's no way to sort of recover what they mean in the high dimensional space just by looking at them so another approach, those of you who have done statistics have heard of principle component analysis this ring of bell for anyone people have done this before great that's fine I'm going to give the few second explanation I just want to make sure I'm not saying things people already know this is a very old technique in the 19th century the idea is just that you sort of automatically identify the features that have the highest variance and construct a transformation matrix so that those are the most important in the low dimensional representation and that the things that are correlated together or that are low variance are not reflected in the low dimensional representation I'm not going to show you what the transformation looks like for this but I will show you that you sort of get and this works pretty well for again reducing you know by about a factor of 10 in dimensions I used it this is just an example of looking at this on log data from infrastructure logs and that's about a thousand dimensional data and I wanted to put it into two dimensions and it's sort of you get these clusters of things that are similar with it so again the downside to something like this PCA is that the result doesn't really tell you anything about your data unless you really can look at a transformation matrix and say well this has told me a lot about my data which I can't do that so that's a sort of linear approach you can see there are sort of lines of clusters there and that's sort of why you have that another approach which I'm not going to go into a lot of detail about how it works but if you're interested in probability and you're interested in data science I would say Google for a t-sne this is a very cool technique and basically it trains a pair of probability distributions it uses different probability distributions but the idea is that you have a conditional probability so if two things are similar in the high dimensional space then one probability distribution will say give you a value close to one and you train a probability distribution in the low dimensional space so that things that you map into the low dimensional space are also similar and if that doesn't make sense that's okay I'm just going to show you a picture and we'll see that it gives you a sort of interesting result with that same log data we looked at for the PCA where you get things sort of almost pushing each other apart in the low dimensional space things that are very similar will clump together things that aren't similar will push each other far away sort of a cool technique there are a lot of cool demos of this online it makes for really nice visualizations so if you're interested I would say look for more on that once you're not pulling down Docker images and have Wi-Fi again another approach to reducing dimensionality is involves trees there's a machine learning algorithm called a decision tree how many people have played the game 20 questions or a game like 20 questions you know like a guessing game where someone is thinking of something and the idea is that you sort of want to limit you mean as a kid you think like I want to limit the search space that's exactly the words you're thinking as a kid you know but you play this game yes I want to the way to lose this game is to ask questions that aren't specific the way to win is to really cut down the number of things you have to consider and the algorithm for training these things works the same way typically with a decision tree you have labeled data true or false data and you look at the features you have and you can sort of figure out what the smallest number of questions you need to ask to differentiate between yes or no is and then get that result now you can also use a similar technique by generating a whole bunch of decision trees what's a so called random forest of decision trees and looking at whether they can distinguish between real data and fake data and the things that you wind up considering in all of these random trees that are randomly generated trees that are distinguishing between real data and fake data are the features that wind up being the most important so the things that tell you whether or not the data are legit or randomly generated are going to be the features that are also most important for answering other questions about these data so that's a really cool technique the last one I want to talk about just because it makes for a cool visualization is the so called self organizing maps which we've gotten a lot of mileage out of on my team and basically the idea is that you take data with more than two dimensions and you train a map of two dimensional cells a grid of two dimensional cells to sort of recognize and respond to this data so that things that are close together in the high dimensional space wind up being close together in the low dimensional space and in this case we're only going from three to two dimensions but it's easier to visualize colors than it is to visualize whatever a thousand dimensions is so we get something that looks like this and you can actually see what the training looks like a short video you can just sort of watch it in action and this projector it makes this much less impressive demo than we might hope all looks washed out but you can see that gradually the corners are getting more saturated and we're sort of getting closer and closer to having those the area we affect with each training iteration gets smaller and smaller and so we we really eventually get this map where things that are similar together are close together and if this looks like a cool technique and you're willing to deal with code written in Scala we actually have a library that does this that you can use with Apache Spark so I have a link to that so any questions about the 10,000 meter overview of data science? okay so I want to talk about Spark a little bit before we get into the hands on section and how many people have done parallel programming with something other than Spark? yeah just any sort of parallel distributed programming so what did you use for that? Hive, yep Hive and Pig other uh MDI good example POSIX Threats yeah so one thing that a lot of these things have in common Hive and Pig are based on Hadoop MapReduce this idea that you have you have composable commutative associative binary operations on things that you can execute independently and then shuffle things around on that work and combine them together that's an execution model POSIX Threats is based on the idea that people understand how to manage access to shared state and we can just have multiple processes running in the same memory space and have some maybe we have some hardware support for synchronization maybe we're just clever about it and we can sort of control things that we can mutually exclude things from executing in the same section at the same time MDI the idea is if you can decompose your program into a bunch of tasks that can run mostly independently but then they can also communicate in lockstep at certain points you can get a result quickly these things are all execution models and maybe this is why you told me this because I had this on the slide these things are all execution models how are these things to program? I mean I've written programs in all of those it's okay but the cool thing about Spark from my perspective is that instead of starting with an execution model maybe they started with an execution model but they have a programming model on top of it they have an abstraction that lends itself to a distributed execution model and the abstraction behind Spark is one that you might actually choose to write serial programs into it's if you've done functional programming before it looks a lot like the sort of collections you'd have in a functional language or even in Python or Java it's really just a collection but it has these special properties the first property is that these things are partitioned so this fundamental abstraction in Spark, the resilient distributed dataset is a partitioned collection that means that different parts of the collection can live on different machines and we can operate on them independently the second thing is that these collections are immutable so how many people have programmed in a language that encourages immutability like Clojure or Haskell or Scala so the idea here is that instead of modifying something in place you sort of construct a new collection how many people here have programmed in a lazy language before or used laziness excellent so with laziness with laziness and immutability you don't just make a copy you make a recipe for making a copy so instead of doing something I sort of construct a sequence of instructions to do it and I only do it when I have to my kids are like this like clear your dishes after supper there's a delay it's not eager evaluation so we can see what this looks like in that we build up this dependency graph when we do these operations on these collections so if we have a collection of just one, two and three maybe we want to filter out all of the things we want to keep only the things that are odd numbers and then we want to multiply everything that we have left by three and then we want to do this flat map operation which is a sort of interesting operation that takes an element and returns a list and then concatenates all of those lists together so you can filter something out by returning an empty list or you can make something into a whole bunch of elements by making a big list we're not using it for anything really interesting we're just turning each element into itself as a successor right and then at this point in our Spark program we haven't done anything yet we've just constructed this recipe saying I'm going to start with this collection that I have somewhere and then I'm going to do these operations on it to get a value and the easiest kind of thing we can do is called a collect operation which is just an array and put it in memory now if we do that again because Spark is lazy it will do all those calculations again if we want to do a different different thing if we want to save it as a text file so Spark lets us say hey I'm going to use this intermediate result more than once and it lets us put in a cache and we can see what the execution model looks like for this we have a so called driver which is our application program we have a cluster manager we have a number of executors which are just things that run tasks on our collections and each of these owns some part of the data that we're operating on and the driver is just going to ship out functions to each of these things and they'll execute them and then we can optionally say to these executors catch these intermediate results and hold them around since we have a distributed program what are we unable to avoid communication but more fundamentally we can't avoid failures once you have more than one machine you're going to have failures there's just no way around it I'm reminded of this every time I send text messages text messages are pretty good but it's not a reliable communication channel you send a text message and maybe someone gets it a week later does this happen to people? maybe it only happens in the US but with any time you have machines that can crash independently these partitions can go away but because we have that combination of laziness and immutability we haven't modified anything in place and we have a recipe to reconstruct it so we can have another one of these spin up and just pick up where the one left off looks a lot like microservices you are ready for the third part of this work job these things are basically stateless microservices just to give you a flavor of what this looks like in actual code and not just in pictures this is a python code for word count which is sort of the hello world of distributed data processing the equivalent program in Hadoop is like eight pages and a bunch of XML as well so this is a lot cleaner basically what we do if we step through what this looks like we are creating an RDD backed by the lines of a text file we are turning an RDD of lines into an RDD of words we are turning an RDD of words into an RDD of word occurrences so instead of just words we are saying I saw this word once key value pairs and then we are adding all of those together which involves communicating and shuffling these things around so that all of the all of the tuples for a given word are in the same partition and we can then add together all of the occurrence counts it's a commutative and associative binary operation so we can just sort of do that across all the partitions and because we are lazy we don't actually compute anything until we need the result externally which we get here by saving as a text file Spark actually runs this job we see that it can do these first few steps without any communication so it coalesces them together you save a little bit of time you don't have to talk back to the application program so this RDD is actually not bad to program with but it's really like the assembly language of Spark like if you're using Spark programs today you won't be having to write a lot of these you'll be doing higher level stuff but with the RDD and the Spark scheduler there are libraries to do all kinds of useful things graph processing structured queries with a SQL and with a sort of query like DSL like if you've used a link DSL or pandas data frames you'll be familiar with that or data frames in R a machine learning library and a library for stream processing as well and you can schedule these in a lot of different ways you can have Spark as a cluster manager you can run it on Apache Mesos if you have a Hadoop installation you can run it on Yarn and we've actually had great luck running self-managed Spark clusters in containers on top of OpenShift that's what we'll be doing later today and there's also ongoing work to get Spark running on Kubernetes directly using Kubernetes as the cluster manager for Spark so I don't want to go into a lot of detail because I want us to get to the hands-on part does everyone have the images ready to go? great so the thing you need to know about machine learning is they're not there are a lot of algorithms in Spark there's not every machine learning algorithm but there are enough that you can get good results if you put in that time for feature engineering and there's support code for doing a lot of things we talked about like one-hot encoding and value scaling and feature hashing and so on all of the techniques we talked about there is at least one algorithm in Spark for doing that with streaming data it's sort of interesting the way Spark does this it uses the same abstraction for batch and streaming so you can take your code and only make a few changes because Spark takes a stream of data and turns it into just a whole bunch of tiny RDDs and then you write your program over these tiny RDDs and it can compose the results together sort of a cool approach there how many people really like programming in SQL or are using databases? whoa how many people would rather write SQL than write like C code to manipulate gigabytes of data? okay fair, yeah, I'm with you right? so Spark lets you do this too and the basic idea behind why you want to do this is that if you look at this RDD thing you're just basically passing a function and saying run this on a whole bunch of different things well, Spark is not going to look at that and figure out what you meant to do it's like your compiler like a C compiler the C compiler is going to do a lot of optimizations but it's not going to tell you that your algorithm is wrong if you use a higher level interface on top of Spark Spark can sort of infer more about what you mean, what you want to do and rearrange your program to make it better in the same way that a database would basically familiar with the concept of query planning and databases your database is going to not do exactly what you tell it to do but do something that's the same but faster so just as a simple example of what this winds up looking like if we have a huge collection and we want to join it with another huge collection so we want to take all these things that have a key in common and then we want to filter them out based on some like really rare condition so we're going to maybe throw away almost all of the things we get from this join by filtering it out well that'll run, that'll do something it might be really slow because if we're joining two huge things together we're generating a lot of records we're processing a lot of records and we're doing a lot of things we might be generating a lot of things that we're going to care about if instead we filtered these things out before we did the join if we push that filter up so that we do it before we do the join then you're joining together fewer things you're doing this expensive operation with a much smaller constant factor and this is the kind of thing that query planners will do trivially for you right now and so you can get a lot of benefits from writing your programs this way in Spark and it's pretty good because it runs these things in parallel which most relational databases don't do a great job of yet so there are a few different ways to do this with Spark you can actually put SQL text in and run that if you really like SQL text there's a so-called data frame interface which is an untyped interface and if you try and do something that violates the type system you'll get an error at runtime there's a data set interface which is not available in Python but it is available in Scala and the idea here is that as many of the operations as possible are checked at compile time you can't be perfect because you're converting from untyped to edited text to edited there's always a possibility when you're dealing with the outside world the types sort of break down but as much checking as possible is really nice if you're like I really like to lean on type systems wherever I can this query planning takes I like to say the query planning takes dumb programs and makes them faster the type system takes dumb programs and keeps them from running at all no one makes mistakes it's just me alright well at this point we can take a quick five minute break and just everyone can get started with the first running that notebook image I'm happy to take any questions we have too but should we take five minutes and make sure everyone is set with the images is anyone not set with the images anyone who would like to be there is some problem pulling the image from daughter can you wipe that kind of table out there are more I've got the images on the thumb drives here so you can just copy off does anybody else need the images or lock the images ok so if the rocker pole didn't work for you you can try this with the colon notebook on the end sorry about that no you want to pull because the the data files so there are no there is a lot of images so no there is a lot of there is a lot of there is a lot of and you are you running doctor you need to be I have a more a system are you on tour my name is I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I