 Let me say a little bit about what I'm here to do. So I am Trey Causey. You can find me on Twitter there. You can tweet questions at me if you want about anything I'm talking about tonight, and I'll respond to you. Or you can email me at trey at dato.com. So I work for a company called Dato, which you probably haven't heard of because it's a new name. We used to be known as GraphLab, which you also may not have heard of. But you may have heard of GraphLab before. So GraphLab started as an academic project about seven years ago. It was based out of Carnegie Mellon for quite a few years and became a company about two years ago. And we renamed to Dato in January. So the founder is Carlos Westron, which if you travel the machine learning world is a well-known maestro of machine learning. And so we're now based in Seattle. And you can find out more about us there. So quick show of hands. Who here works broadly defined in the data science slash machine learning world? OK, another quick show of hands. Who would like a quick explanation of what machine learning is? All right, awesome. So there are enough hands. I'm going to give you a quick definition of what machine learning is. And I'm unfortunate because Olivier is actually going after me. And this is going to be embarrassing to do this in front of Olivier. But broadly conceived, machine learning is the kind of subfield of computer science where we try to use data to learn things and try and basically write programs to write programs. So we're trying to use data that we have on some given thing, maybe it's website traffic, maybe it's people rating movies, maybe it's people clicking or not clicking on things, maybe it's genomic data. You can imagine a bunch of different data sets we might be interested in. And we want to use data that we have and make predictions about data that we're going to see in the future. That's a terribly short definition, but that's kind of fundamentally what we're trying to do most of the time in machine learning is make predictions about things. And this predictions might be trying to predict whether you're going to like a movie or not, whether you're going to develop a debilitating disease or not, whether you're going to click on an ad or not. I mean, it ranges from the banal to the kind of very important. So I'm not going to get into the substantive uses of machine learning tonight, but I'm going to talk about how you can do big, big machine learning very fast on your own computer. Is that a sufficient explanation for people? Cool. All right. This is a blank slide. All right. So right now it's kind of undeniable that Python is really kind of coming to the forefront as the language that people are doing machine learning in. So it used to be the case that a lot of machine learning was done in academia. It was kind of caught up in MATLAB and Octave and software like this. Python is becoming kind of the most amazing language ever in the world of machine learning. Thanks in large part to a gentleman sitting here. But it's really a kind of go-to language right now. The problem is that machine learning in Python often does not scale to large data. And everybody knows you've got to have big data these days or not. So this is a problem with a lot of people in Python on machine learning. So one of the problems we're trying to solve here is make it so that you can build machine learning models in Python on your computer and use that same model and ship it off to production. Be that in an app, be that in a web app, be that sitting somewhere making predictions that's not customer facing, anything like that. We want that same model that you build on your computer to be the model that you put in a production. Sounds like it shouldn't be that big a deal, but as it turns out, it actually is kind of a big deal. So typical Python and machine learning workflow typically has about five steps. So the first step is you pull up your favorite Python development environment. Maybe it's an iPython notebook. Maybe it's PyCharm. Maybe you're working in VIM. Whatever, you build your model. You prototype it. You have a sample production data that you've pulled down and you're kind of iterating on that, trying to see how well your model does. And you kind of do this for between hours and months. And so you finally get to a model that you're pretty happy with. If you're diligent and you're a responsible data scientist and or machine learning engineer, you're pulling new data in constantly to make sure you're not overfitting to old training data. Overfitting is when your model just kind of memorizes old data and you kind of connect the dots and it doesn't do well on new data. So that's the first step. The second step typically for scalable machine learning then is you then pull in the C++ or Java or C sharp engineers that work at your company and you have a meeting with them and you say, OK, we're going to spend the next days to months implementing this model and production. So that's the third step, spending those days to months implementing that model and production. And that might mean if you're lucky and you have some engineers to have some machine learning background that they know how the algorithms work and they start working with you and you start implementing it in production. Or it might mean that you're also now giving a crash course in machine learning to people who have no interest in machine learning and they're caring about things that you don't care about, like this corner case here, like what happens if the model returns no prediction or what happens if it returns the same prediction for everything, these are the kinds of things that keep engineers up at night but don't necessarily keep someone who's trying to build models quickly up at night. So that's the third step. The fourth step is crying yourself softly to sleep every night as you realize that there are new bugs introduced by all of this and then you get to the fifth step where you return to the first step and you're screwed. So this is kind of like the process, a lot of enterprise level, like machine learning kind of walks through these days. And so people, there's just so much wasted time and effort doing this. So these switching costs are real, right? Every time you switch between environments, you run the risk of introducing new bugs in your code because let's say you're switching from Python to C++, you're switching from Python to Java or whatever, now you have to remember like, well does Java deal with doubles the same way that Python does or what is a, let's not talk about Unicode and like all these other things that might be problematic as you're switching between languages, right? Those are real costs and there's also like monetary and time costs doing this. When you're handing your work over to someone else, that's a cost, let's say you built your model, it's in production, it's doing great, but you want to add new features to your model. Like let's say now you have new instrumentation in your web logs, I didn't realize that was there, thank you. Let's say you have new instrumentation in your web logs and you want to start incorporating new features in your model that you can make new predictions based on this, well if that model is written in another language, now you have to go back and coordinate with those engineers and it's a royal pain in the butt. So what we're trying to do is trying to streamline this whole process and I'm gonna give you the technical details of how we do that because I think it's pretty interesting, but we kind of see the building of machine learning models I'm putting in my production is taking place in kind of three steps. You have your engineering step which is you get a data set, you start hacking on it, you start kind of looking and seeing, doing running descriptive statistics, visualizing your data, trying to figure out what the data actually contains. The next step is you want to build machine learning models, let's say it's a recommender system, if you're not familiar with those, it's kind of like the classic Netflix or Amazon, you may also like type things. You want to start building those models, making predictions with those models, making them more sophisticated, making them faster and so on and so forth. And then ultimately what you would like to do, especially if you work in data science and industry, is you want to deploy those models into some kind of production setting where they're gonna deal with many requests per second or maybe depending on where you work, like hundreds or thousands of requests a second or maybe they're just peaks in the day when they have to be receiving thousands of requests a second and the rest of the day they don't so it needs to be elastic, it needs to be robust to like big spikes in traffic and so on and so forth. So you need these things to kind of always be on and you need these things to always be ready and responding and when they break you need to have some monitoring on it and so forth. So what we would like to do as Python programmers, we would like to do all of this in the same environment. We don't want to switch those environments, we want to do it all in Python and we want the model that we built to be the one that's in production. So there are kind of two different problems there that I glossed over. So I'm gonna spend a lot of time or not a lot of time because these are short talks, spend the bulk of the time on the technical details of the first part of the problem that I kind of glossed over and that is most existing dataset, processing packages, libraries, et cetera in Python require that you load all of your data into memory and kind of hack on it in memory. There's work going on in the open source home right now. I'll talk about that in a second but really what the problem is you might want to use production data, that production data might be sitting in Spark, it might be sitting on HDFS, it might be sitting in a SQL, some kind of relational database somewhere. You want to deal with the same production data building your model that the model's gonna see when it's in production. You want it to generalize well to that data. So to stay in the same environment, to work kind of all the time in Python, we have this data structure called the S frame, the S stands for scalable, if you've ever used pandas or R or anything like that you'd be familiar with the concept of the data frame, this is just a tabular data structure. So we built this data structure that resides on disk, now that we have SSDs with caches, actually as it turns out, disk processing can be very fast and so we reduce the reliance on a RAM and it's super fast, compressed, binary, then the thing that makes it so fast primarily is that it's columnar. So a lot of other people have kind of independently come to this discovery that columnar data structures actually end up being pretty fast for a number of reasons. I should say before I continue that this is actually all open source as well, so if you want to go check out the implementation of this, you can go to GitHub, our primary repos called data code and then the S frames are implemented here and the repo named data core. So if you want to look at the source code, see how we built this data structure, figure out how we did it, it's all there. In fact, let's see, can I switch? Yes, so here is kind of everything that you find here in our repo. The nice thing is that I haven't even mentioned, we also have a parallel data structure called these S-graph, which is for graph analytics, so if you're familiar with graph theory or like social networks would be a good example of this. GraphLab kind of sets the benchmark for fast graph analytics like in academia and in industry. So if you're familiar with Spark and GraphX, GraphX was created by one of the co-founders of GraphLab, so we're all in kind of a good company here. So you should check that out. There's also a lot of C++ in there, sorry. Let's see here. Okay, so why are columns fast and why are S frames a big deal? S frames are a big deal because it means on your laptop, you can start working on production size data, production scale data. We can ingest just about anything that you want. So we can ingest JSON, CSV, XML, RDDs, ODBC, NumPyRays, and so on, you can read. So we can ingest all that data. In fact, I have to say, I actually haven't been with my company for that long, but it has the best JSON handler that I've ever kind of interacted with. It's pretty awesome, usually in two or three lines of code, you can go from really gnarly JSON to nice tabular data structure that's not painful to read. Sorry to any JSON fans. So the nice thing about this is basically what an S frame is, is it's just a bunch of columns. Each column has a single data type, and this is one of the things that makes it fast. So the columns can be basically any type that you can imagine. There are strong schema types, like ints, doubles, strings, and images. So we can ingest and ingest a directory of images. It'll turn it into a tabular data structure complete with resolution and a thumbnail of the image, as well as weak schema types like lists and dictionaries. So that means that you can have a column of the S frame that's a column of dictionaries which then themselves contain data of multiple types. So pretty flexible. In fact, it's called flexible type. And then the columnar architecture actually ends up being very fast because it allows us to do vectorized operations which we've known for a long time are very fast because it enables use of the cache. Easy feature engineering. I think the thing that's hardest for most people to wrap their head around if they're coming from pandas or R is that columns are actually immutable. So if you're a functional programming fan, that's kind of fun, but you have to get used to the fact that you can't modify values in place. So if you're gonna modify values in a column, you're gonna be rewriting the column. Now the nice thing about that is you're only gonna be rewriting that column. So you don't rewrite the entire data structure every time you make changes to a column. So this means that you can have extremely wide data sets and you don't suffer any kind of performance breakdown because the data set has gotten wider. Does that make sense? So you're just modifying still in constant time. The columns are lazily evaluated which means that you're only ever loading data into memory as you need it. And it does it in a very smart way. And you can also do extremely fast summary statistics, sketch summaries and visualization of your data. So, sorry about these text dense slides. These are the only two slides that have a lot of text on them. So when we were kind of building this data structure which I think is a super fun way to interact with your data, we kind of had to make some choices to balance between speed and ease of use. So we had to lose some performance in some places to make it easier to use. So some things you might be used to doing in our pandas or something like that, like row-based filtering, ends up being a little slower because it involves loading a lot more data into memory, kind of joints against itself, but it's not bad. But like I said before, we now don't require rewriting all of the data and column operations are fast no matter how big your data set is. So I have like a screencast where I did this kind of test on a data set that was like terabytes in size and it was super fast, even on a laptop you could like sum up columns in a few seconds. So the nice thing is about making things column based, people naturally think in vectors a lot of time, especially if you've been living in the NumPy world or in the R world for a while, you're already used to thinking in vectors and vector operations and you don't have any restrictions in the way you deal with your data, except for that catch on immutability. So we have about a few seconds, let me quickly show you what an S-frame looks like. So if you can just pip install graph lab create, it will ingest the CSV here and create an S-frame out of it. So then the other nice thing about S-frames is once you've created an S-frame out of whatever base data source you have and it's stored on disk, compressed format, loading it is extremely fast after that because it only loads the data in that you need. So if I wanted to load the data from S-frame there, so you can just do all the operations they would normally do on any kind of tabular data structure. This all should look very familiar and there's a good reason for that, like we want it to look very familiar because we want it to just be easy to use on big data sets. So this is like getting both, this is a sketch summary of whatever, two million rows, something like that. I summed up the column there. So I was doing all this on like two million rows, it's very fast. That's actually not a very big data set, but it's just for kind of demonstration purposes. So it'll also do kind of approximate values as your data gets very big, it will approximate those values using probabilistic data structure so that you can get very fast summary statistics on the data that you're using. We can also do kind of Boolean operations here, and we see that returns all songs that have more than 10 listens. And there's also a visualization platform that uses the same sketch summary here. So that's over two million rows kind of pulling up histograms, things like that. Pretty fun, not that impressive on a data set of that size, but I just wanted you to see what it looks like. Okay, so if you're really interested in kind of technical details about building a data structure with this performance and this size, Yucheng Lo, who's one of our co-founders in the Prince Bar Tech, has an awesome technical document in which he lays out like what it's like to build a new data structure that's both easy to use and very fast. So here's the totally intuitive and easy to read bit.ly address here. So 1-A-X-J-K-O-V, that is an O, not a zero. So you can check that out. Okay, so the other problem that I mentioned, I'm about three-fourths of the way through here, the other problem that I mentioned, so that's one problem, you wanna deal with production data, you wanna do it on your laptop, you wanna stay in the same environment. So you also wanna push that to production. So I'm not gonna spend much time on this, but we have another service called Data Predictive Services, which allows you to deploy any machine learning model you built in GraphLab Create as a REST API with like five lines of code. It'll be either hosted on-premise or on AWS. Basically it just creates a service that's on, listening, elastic, scalable, that just waits for a request, then returns JSON objects. And so those might be predictions, those might be recommendations, those might be whatever your model's output is, those just come back as a JSON blob. And the nice thing about it is you can actually ship arbitrary Python code up there. So even if you're not building machine learning models in GraphLab Create, you can ship that code up to AWS as a service. So this is actually start to finish the code it would take to ship a model up to AWS. So this is shipping a recommender system model, like a standard factorization machine model up to AWS. And you can see that the vast majority of that code is actually just arguments to send, like login arguments basically to AWS. So I mean, if we could take out, actually we could just create a dictionary for all those and that would have been cheating, but could have just created a dictionary for all those then it would have looked really short. So that's all it takes. It automatically spins up a load balancer to AWS, gets your nodes running, so you know when they're running it also versions all the models you sent up there automatically. So you can basically call that same visualization platform I showed you, see what version is running on each of your nodes, see how many requests have been made, and so on and so forth. So pretty cool. Also works with scikit-learn models. So this is something we're working on right now to be able to deploy scikit-learn models as that arbitrary code that you ship up and then that turns your scikit-learn model into a service that's sitting on AWS waiting for requests, making predictions. All right. So if you wanna get started, you can download software at Graphite Create. It's free for personal academic use. Feel free to hack on it. The source code for a lot of it is sitting there on GitHub. You can download it at data.com. Things I didn't talk about. There's also a suite of awesome machine learning toolkits in the software. So there's a great neural network. If you're into deep learning, great neural network package in there. I don't know anything about deep learning and I was able to build a really nice image classifier using the ImageNet data set like in 10 lines of code. So it's pretty fun to play with. We're also hiring. So if you're interested in the Pacific Northwest, I was told it's also called the Pacific Northwest in Canada, but I don't know if that's true or not. So the Pacific Northwest of the States in Seattle, we are hiring. Please visit data.com slash jobs and find out more about what we're doing. I think I have a couple of minutes for questions if anybody.