 No promises. OK. Who's ready? Already? Good? OK. I'm not big on intros, so Eugene offered, but I politely declined. Thank you anyway, Eugene. So thanks everybody for coming. Who knew that this was a technical meetup? Hopefully everybody. OK. So it's not going to be super, super, super technical, but it may be more technical than ones that you've been to in the past. Who would like to see more technical topics and meetups from Data Science Singapore? That's everyone here. Not surprising. How many of you have given or have offered to give a technical talk here? Oh. So, hand. Thank you. Now we see the problem. Who's going to give these technical talks if it's not us? Because it's quite hard to find technical speakers, right? Super hard. So I'm sure you know some stuff that we would all be interested to learn. I'm positive. I'm really, really sure. So if you want to see more technical talks, it's up to us to give these talks. So please, if you have some topic that you think is interesting that's technical, talk to Eugene. Where's Kaishen? Talk to Kaishen. Talk to whoever, somebody, anybody. Just shout around and say, I have this technical thing that I want to talk about. When can you give me a slot? And I'm sure they'll probably be very happy to do it. So just a bit of preliminary. If you want to see more tech talks, please volunteer to give tech talks. Next steps. Today, I'm going to talk a little bit about data processing on laptops, because that's what I like to do. I don't like clusters. I'll go on the record saying that. I'm on video. So I'm on video saying I don't like clusters. I'm not a big fan of Spark. I'm not a big fan of Fadoop. They're really good tools when you need them, but most people don't, most of the time. Some companies do, but not always. And my claim is that you definitely don't need them for machine learning. Do not. You might need them for pre-processing of data, but for actual machine learning tasks, it's very unlikely that you need them. So that's me. That's my... Can you read that? Is it small? It's kind of small. You can find me on Twitter, Aadrake, and my website as well. So if you're on Twitter, and you think what I'm saying is really awesome, you're like, this guy, he's cool. Aadrake, hashtag hacker. If you are on Twitter and you think this guy is nuts, he's crazy, and I don't even know how he got the position to talk to us. Aadrake thought leader. No one else will know, but we will know. Like, this is a nice voting mechanism. And this is not mine, by the way. I stole this from Josh Wills, who was formerly with Cladera now at Slack. So hacker, good, thought leader. So these are my claims. These are important. The RAM that you can get in a machine is growing faster than the data that you need to process. This is an important one. There are lots of techniques that you can deal with so-called big data that don't involve these big systems like Spark and Hadoop and stuff. And yes, I know you can run Spark in a local cluster on your laptop. I don't mean that. I mean the big stuff. So you can do everything on one machine. I do everything pretty much on my laptop all the time. This you may not be able to see. We can't cancel it. This is a slide from Kray. Who knows what Kray is? Did anyone know this? You're laughing, so I know you know. Okay, how many? One. They build really, really, really big fast computers. And they've been doing it forever. I don't know, like 40 years or something. One of the recent ones that they released could accommodate 500,000 CPUs, which is really a lot. So my favorite part about this slide is the little part in the bottom that says Kray confidential, do not distribute. That's like maybe my second favorite. But what I thought was really cool about this was if you look at some of these claims and you do data science work in industry, you've seen some of these. So you have all these different processing systems and the data is siloed and it's everywhere and it's hard to get to and oh, this team has this data and that team has that data. Thank you for being on time. You're awesome. We're not shaming you publicly. I'm supporting you. But this is like part of the problem is that companies and teams set up these big, big, big systems. And I'm not saying Kray is doing the wrong thing here. They're actually marketing some pretty cool stuff, but their slide is really indicative of some of the problems. So you get this big, huge cluster thing for your data prep ETL. Then you get your stream processing system maybe, which is like Storm or SAMHSA or whatever. I don't even know how many there are now. And I think Twitter just, they're releasing some like every few months. Then you have data mining stuff that you have some interactive query thing, which is like your Teradata or your data warehouse or whatever. And all these things are separate and usually the data doesn't agree and things like that. So the idea is that instead of having bigger machines will be a little more intelligent about how we do this. So what if you have too much data? And too much is subjective. So there was a Kaggle competition recently. I think it just ended like a day or two ago or maybe today for Expedia hotel recommendations. Was anyone looking at this? Am I the only one here? Who here has ever done a Kaggle competition? Just so I'm curious. Okay, so quite a lot, quite a lot. Okay, cool, cool. So there was one today that just ended for Expedia. I think the data set was about four gigabytes plus minus. Was it? And when I looked on the forums like a couple of weeks ago and some people were asking questions, how can I process this data so much? How do you start up a Spark cluster on Amazon? And credit to them. If they're doing it for fun, for learning, that's awesome. But I mean, this laptop, by the way, is not new. It's like three or four years old and it's got more than four gigabytes of RAM in it. So I'm confused about why they were asking, but it's a very common tendency that I've seen on Spark or on Kaggle. That people want to use Spark, they want to use Amazon, they want to have clusters and do all this stuff. And well, one thing you could do is just add more RAM. Is that an option? And you can kind of see that. This is the frequency of people who have worked with certain size data sets. And if you look at it, I mean, 10 megabytes is like Excel size. So that's no problem. There are some people that have worked with like 10 terabytes or one petabyte, but it's not common. So if you chop out like the top 10% and the bottom 10% that are kind of skewing this, and then you kind of run a regression line through it, you get basically that there's a 20% year on year increase in size of data set that people are working with. So it's not like massive, right? It's not like the data sets we're dealing with for machine learning stuff are doubling every few years. And this source is from a Katie Nuggets poll. So in fair interest of full disclosure, there could be some sampling bias here. The people who are actually responding to these polls and surveys, something to keep in mind, but maybe it's representative. So let's say that data sets are going up 20% year on year. This is the progression and availability of RAM on Amazon EC2 instances, which you can now get a two terabyte EC2 instance. I would be super surprised if you have two terabytes of data that you need as your working set for your machine learning. Does anyone here have two terabytes? I don't, I don't. I'm so sick. They don't have two terabytes of data they need for machine learning stuff, not just two terabytes ever because maybe you have like movies and stuff, but. Anybody? I don't, I don't. And I know some companies that have, but it's not that common, right? Maybe you have like two terabytes of logs that you need to process into something that's usable to do machine learning. But usually it's not a huge need where you have two terabytes of data. And I put this note on here because I think it's cool. Tyen actually has a new system available that you can put six terabytes of memory in, six terabytes of RAM. Like that's a lot. When I looked at the case and all the RAM slots, like the case is mostly RAM slots. It's insane, it's insane. So even if you use EC2, you're getting like 50% year on year increases in RAM size, which is more than 20%. So clearly one option is just add more RAM. But that's kind of lame, right? That's the argument about vertical scalability versus horizontal scalability. So who here has heard the word scalability at work or school or something in the last three days? Yeah, it's crazy, right? Everyone's saying that word. It's kind of odd. So usually when people are talking about that, they're talking about horizontal scalability. We wanna add systems and we wanna be scalable and oftentimes they're using it in the generic sense that they want something to be fast or they want it to be fast even when we get more data or fast even when we get more users or something like this. But that's not always needed. So one of my favorite steps is just ignore some of your data or as statisticians call it, sampling. So just ignore some stuff. The classic case of this, there are many, many, many cases. So if you have terabytes and terabytes of log data, like for example, clicks and views for online advertising, is anyone here working online advertising or something like it? Nobody here works in online advertising, are you just embarrassed? I worked in online advertising before. Okay, one. I feel you, I feel you. Sorry? Sorry? There was. Okay, okay, okay. Was it Vito? Okay, last year, right? So just as an aside, if you don't do Kaggle competitions, it's a good idea to check it out. It's a good idea to check it out. The problems are, it's kind of a bit contrived, right? Because the data's already been processed, so a lot of the legwork's already been done for you. But you can experiment with a lot of different machine learning techniques and feature engineering and all kinds of other things that might be really useful. So if you haven't tried out Kaggle competition, I suggest you check it out because it might be. So yes, there was one last year for Vito, which was about click-through rate prediction. And most ads don't get clicked on. I mean, I don't click on ads. I have ad blockers. I don't want to see them. They're not relevant or interesting for me and they take up my bandwidth and they make my browser slow and it's just not great. So no surprise that few people click on ads, but I mean, really few. 20 people in 10,000? It's not that many. So if you have a data set of clicks and views, you can get rid of like 90% of the views and still train a classifier on that data set and it'll be fine. You're not gonna have to worry about it. You can literally throw away like 90% of your data and use the remainder as your train set and it will be okay. Now, that's a very coarse way to do it. There are a lot of theorems and stuff in mathematics that you can dig into if you wanna have guarantees about your sub-sampling and how you do it and if you wanna do under-sampling or if you wanna over-sample, under-represented cases and all this kind of stuff. But for this particular example, you can pretty much get rid of your views and it'd be fine. You look skeptical. Are you skeptical? Okay, I thought you were looking at me like, no. That's the way I look at people, sorry. That's good. It's always good to be skeptical. That's good. Oh, by the way, if anyone has questions or something that you wanna bring up along the way, please do. This is not really a lecture. This is more like a knowledge sharing thing. So if you have questions, oh, say that was fast. How do you know it's a representative of the sum? When you say you would throw 90% of the sum. Because if the difference is so big, like in this case, 20 clicks per 10,000, I'm not gonna worry about whether or not it's representative. Because the probability of having click is so low that you can get rid of the views. If you just take away a random sample of the views, it's not really gonna matter that much. Because your classes are so unbalanced. It's not like 60, 40 or something. It's like really, really skewed. So you mean are we sample only the click data? No, I mean take 90% of the views or a lot. I mean, I don't know what makes sense in a particular case. So it's case by case, right? But take a big chunk of the views and just ignore them. Just pretend like they never were in your train set to begin with. Seems weird, right? Like who would ignore data? Why would you do that? Statistician been doing it for 300 years. They're experts in how to ignore data effectively. Actually they do it backwards. They start with the important thing, but same idea. They use sampling techniques so that they make sure that it's the representative of the whole population. They just don't throw the data. Well, your population of people that click is so small that you can get rid of 90% of the views and the chances that you're gonna have any different. Unless you have like, okay, maybe you have something weird like you only have two features in your whole data set. And I don't know, there could be some contrived like pathological example where it wouldn't make sense. But in general, you can do this and it's not a problem. If you just take some uniform sample of the highly, the much bigger class, it's not gonna have any effect on the accuracy of your classifier. Assuming that you're doing some kind of classification problem which the ad click problem is a classification problem. Yeah. If you wanna learn more about sampling, though there's a lot of cool stuff, especially in the literature, if you wanna look on it. There are many, many papers on how to do effective sampling if you wanna get mathematical guarantees on it. So you can do that, but I didn't go into that today. Sorry. Sorry, I'm skeptical just because I think you would skew the distribution point severely the other way. This is a rare advanced search. So you're taking a data set with rare defense. No, not the whole data set. I'm not saying down, okay. Two things. One, if you down sample the whole data set uniformly, the chances that you're gonna down sample a click is basically zero. Do you agree? Yes. Okay, so wait, wait, wait, wait, wait, wait. That was only my first claim. So claim number one, even if you took the whole data set and just randomly threw away like X percent, the chances are extremely high that you would only be throwing away views. So that was, that's claim one. Claim two. I'm not actually talking about doing that. What I'm talking about is just take your views and then get rid of those. What's the difference? Not really much, but you get some guarantees that you're not throwing away clicks. Non clicks, views, yes, sorry, sorry. Yeah. Yeah. So in this kind of problem for anyone who's worked on it, usually you have tons and tons and tons of views. So think of them as training examples with label zero and then you have very, very few clicks which are training examples with label one. And you can get rid of most of those training examples with label zero and you'll have no negative impact on the accuracy of your classifier. Sounds weird, right? Super true. Super true. Sorry? Not in this talk, but we can talk about it afterwards if you like. Yeah. Okay, so one option, get more RAM, but that's kind of like cheating, right? Another option, sampling, which can be done in a lot of ways as we just talked, you can do uniform sampling, but basically any time you have a classification problem with really unbalanced classes, there are ways that you can cut down on the data volume and still, in fact, it can improve the accuracy of your classifier because depending on what kind of metric you're using for success for your classifier, if you have this problem, you can make a really good classifier that just always says no click. It will be right, whatever, like 99.99, like all the, it'll be right all the time, basically. So you have to be a bit careful about how you're structuring these problems, but in general, you can get rid of data. So sampling, that's one option. But what if we don't wanna do those two things? Let's say we wanna deal with data sets of arbitrary size, we don't know how big they are and we have constraints on hardware. So let's say we only wanna do data sets of arbitrary size on this four-year-old MacBook Air. So what do you do? One, you have to stop thinking in terms of data set and start thinking in terms of records because you're not gonna fit your data set, whatever it is, in your RAM. Assume it's size infinity, gigabytes or something. So you have to start thinking about records. And you may not know exactly which features are coming in on this data set. So you need a way to do feature extraction on a per record basis. So this is the part where I'm gonna ask, does everyone understand what I mean by feature extraction on a per record basis? Okay, so if you have, so you get a CSV file. It's got five columns, right? You know you have, let's say, five features. But that's kind of a nice example and you don't always have that. Sometimes people push data into the system and maybe it has five features or maybe it has seven features or maybe it has 10 features or maybe it has 5,000 features. So what do you do in this case? Because you don't know in advance what's in the data necessarily, right? Because it's of infinite size. So there could be other things in there that you haven't seen before. So you need a way to deal with features that you haven't seen. And the way that you deal with it can't depend on any other records. So if I get a training example, the way I deal with the features should only depend on that particular training example. So I'm, basically that's the case where you can only fit one record in memory, which is not true usually, but it's good for our purposes. Does that make sense? Because if you need to see the next five million training examples to figure out what the features are, maybe you can't fit that in your memory. So you need a way to extract features that only depend on one record at a time. That's what I mean by stateless. So you're not keeping the state of future extract. Make sense? So where do we find some kind of data? What do you mean? I mean what kind of data fits this description? What kind of data? Generally we see like some, like five features or number of features are possible, but you are saying that every record might have different number of features. So what kind of data has this kind of data? The favorite example I've seen is also the one that I liked the least, which is anything that's in JSON. Sorry, I just don't like JSON. It's really nasty. It's like, you can take, yeah, I saw there was a funny picture one time that somebody said, oh, I had big data and then I converted it to XML and now I have even bigger data. It's like, it's true. Has anyone ever converted JSON data just to CSV? You save like a lot of space, right? Just because you don't have the keys and stuff. Anyway, if you're dealing with JSON data, you don't always get all keys in every record. So sometimes, depending on the system, people will send keys with some kind of null value, but it's not really required from JSON, right? So you don't actually know in those cases. So you need a data source, which you can think about as a stream instead of a big file. That's one thing. You need a way to process your features without knowing what came before or what came after. And you need some kind of model which can support incremental learning. And what I mean by incremental learning is you have a model, you give it a record for some records, some examples, some training stuff, and it learns those, and then you give it more, and then you give it more, and then you give it more. And once you have these three things, you're actually not limited anymore by what you can do with your machine because you can process a stream. You don't care what features come in, and your model actually doesn't care how many records you give it at one time. So you can give it one record at a time over and over and over again. So if you have those three things, you have a lot of flexibility on the size of the training set that you can manage on this machine. Okay, data source. Let's talk a little bit about functions in Python. So somebody asked what the language was gonna be in the talk. Are you here, whoever you are? Ah, Python, right? Yes. Python. Yes, thanks. It's gonna change. It was a trick. Okay, so functions and generators. Is anyone comfortable with the difference between a function and a generator in Python? Let me try this a different way. Raise your hand if you have ever written a Python function. And raise your hand if you've ever written a Python generator. See, that worked a lot better. That worked a lot better. Okay, so I'm gonna take that to, I'm gonna assume then that not everyone's comfortable with generators, which is cool. It's like a function, except when you call it, it doesn't go away at the end. It sort of hangs out and it keeps the state. Keeps the state. So if you have a generator that's like incrementing a number, every time you call it, it will increment a number. And it will return the next one. It will return two, within three, and then four. Because it's keeping the state instead of just returning a value internally. Which is kind of convenient. Because then you can make a generator that reads lines from a CSV file. And every time you call it, you will just get the next line from your file. So now we have a data stream. It's magic. It's really cool. So I use this all the time. This is actually, I think this might be, this is code maybe from a Criteo competition that Kaggle did a while ago. I don't know if they've done more than one. But I use this stuff all the time. So if you have a CSV file, you just give it the path and how many columns are in the file. And magically you now have a stream of data. So you already fixed problem one, right? You transformed your problem into I need to load this big file into RAM to I'm going to process this line by line. Or for people like Pandas, there's a chunk size argument to read CSV. Who knew that? Not so many. Why are you laughing? Yeah. Okay, so a couple of people knew, but not that many. So did anyone here use Pandas? I do sometimes, okay, some people. So for anyone, so let me rephrase, who hasn't used Pandas before? Okay, cool. So Pandas is a very convenient library for Python. It has all kinds of nice stuff in it. If you've never used it before, I highly suggest you check it out because it will make your life much easier if you're working with data. And one of the things you can do with it is you can read CSV files, which is very nice. And if you can't read the whole file, you can supply this chunk size argument and it will just read in that many lines. So it's kind of the same thing that we did except instead of reading one line at a time, it's reading many lines at a time. And then you can do something with whatever that chunk is. Okay, so we got our streaming part. Do we start right at seven? This is about to get intense. Okay, feature extraction. Has anyone heard of the hashing trick? I love things in machine learning that are called the whatever trick. Makes it seem so fancy. Okay, so the idea is you take your features, you hash them, and you use those values as an index into an array containing your weights for that feature. Did I get it right? Did I get it wrong? Who's heard of the hashing trick? Did I get it right? Okay, cool. That's the other thing. If I say something you think it's just totally crazy, I'm depending on you to tell me that it's totally crazy. So now you have a way where you can get in features. You don't know what they are. You get an index into an array, and then you can have weights for that feature in your array weights. Now you can also do a nice trick with this that we'll talk about later, but this is what gives you the ability to deal with arbitrary features in your dataset. And if you think about it, it makes sense, right? Because if you have three columns, like first name, last name, location, the information contained in first name, Adam, if you just concatenate that together, this is the same amount of information. And if you hash it, assuming that that hash is unique, it's the same amount of information. So there's no information loss that's happening here. You will get information loss if you have hash collisions, but we won't talk about that. So this is a very nice thing to do. I use this all the time, all the time. So hashing trick. If you haven't checked it out, definitely check it out. And then the third step was we needed models that support incremental learning. And actually, there are quite a few of them. So who here has used scikit-learn? Maybe about the same people, same pandas and scikit-learn seem to go well together. So all these are in scikit-learn and they support incremental learning. So what that means is they have a partial fit method. And if you use the chunk size argument from pandas, then you can supply that chunk to the partial fit method of any of these and be just fine. So now we have all three things that we needed. Caveat, not all the models can deal with unknown, with arbitrary features. Some models need the set of features to be declared in advance. So just be careful if you're using some of those models. However, if you're using SGT classifier in regards to those don't care. So you can just throw whatever you want into those models and they'll be totally happy to learn for you. Or you can just write your own, which is kind of nice. It's not that complicated. And it looks chopped off, which is awesome for this projector. But still you can see the code. I think it's like 13 actual lines of code. And this is not my code. This came from, I think it was the Criteo competition on Kaggle. There was some very cool person who posted this on the forums. It was like beat the benchmark in 200 megabytes of memory, something like that. People who are laughing probably remember reading it on the internet and face-palming. Because I mean, I was doing some crazy stuff with like full-palm-wabbit and then they posted this. I was like, ah, okay. So this actually will get you in a top X% place in Kaggle. And it's like 13 lines of actual code. So this is stochastic gradient descent with logistic regression, no fancy stuff there. And you can't really see the log loss, but it's like 0.463, which was pretty good. But the problem is it doesn't use all the cores. It was Python. So it's gonna use one core on your machine. And if your machine has four or eight, or if you have one of those big servers with six terabytes of RAM, maybe you have 100 cores or something. Or if you have one of those cool CPUs, those sort of power eight CPUs or something from IBM that have like 96 different threads, you won't be able to use those if you're using Python in a default, in a normal way. The problem with this is it performs well. So the log loss is four, six, three, but you can't really see it, I think it's chopped off. The runtime on my laptop is like an hour and seven minutes. Which is, I mean it's okay, but for people who do a lot of data analytics work, one of the things that really slows you down is the lag time between running your models. So if it's like an hour plus every time, that's like the old school version of I can't work, my code is compiling. It's just like the modern version of well, I don't know, my model is training. So that kind of thing. So the faster you can train models, the better. And there are other ways you can do that. You don't have to train on the whole data set. You can do some sampling stuff like we were talking about earlier, but if you want to run it on the whole set, on my computer, this takes like over an hour. Kind of long. So ideally I'd like to be able to use all my CPUs, which I was not using in the Python version. And if I share any memory, I want to use locks to make sure that I'm not writing to the same area of memory from different cores. At the same time, because that can be bad. But this was a question from Qashen. What about the GIL, the GIL, the global interpreter lock? Has anyone ever battled with this in Python? Couple people. Now we're getting into more esoteric territory. This is good, this is good. So in Python, when you're executing all these instructions, the Python interpreter has a nice little lock that will only let one instruction execute at a time per thread. So if you run Python, whatever, your script, one instruction at a time. So if you have, whatever, eight cores on your machine, it's not going to use those other cores normally. Even if you spin up more threads. But that's the important part is that this is per thread. If you have a totally separate operating system process, like if you have eight interpreters running, then you're actually OK. Because each interpreter will have its own interpreter lock. And all you have to do is share the memory. All you have to do. So there's this nice library in Python, multiprocessing, which you use to do this, to spawn these multiple processes. So you can get around this global interpreter lock problem. And there's a nice structure in there called raw array, which gives you a raw C array in memory that can be shared between these processes, which is quite nice, because now it seems like we're starting to get around all of our problems, right? Like, we bypass this lock. We're going to use all of our cores. We're going to make it faster. We have access to shared memory, which is really fantastic. And we can run multiple processes. So if you run it, you'll get this. This is like 10 processes. Each one gets access to the same chunk of memory and just increments some integer of whatever process number they are. Basically, just to show that all these processes are accessing the same memory. So it's not that hard. And once you do that, you can have a queue. If you remember before, we had those records that we were generating from the file. You can put those records in a queue. And then all of these processes can pull from the queue and do the work in parallel. So it sounds pretty sweet. It can look just like this. It's not very hard. But that's actually slower. That's slower. So even though you're using all the cores now, it's slower. So when something is slower, more experienced tech people, what's the first thing you do when something's slower? Somebody's got to answer. What are you doing something slower? Give the process in the file. Yeah, that's step zero. Good point. After you kill the super slow process that's annoying you, what do you do? And don't say go have a beer or something like that. Like what do you do technically to try to fix it? Check if you're mixing all those CPUs. Yeah? Yeah? The i will pull. I don't know. I don't know. That's the point. We don't know. So what most people do is they just start changing stuff. Bad idea. First thing you do is always profile. Always profile. So I profiled that code. Why is it slow? Let's run profiler and see why. I don't know if you can see, but this is the code. And it's calling out to raw array, which takes, you can't see it because it's fuzzy. It takes like 0% of the time and some other stuff. And then there's put, to put the item into the queue. So we created that queue, right? And then we were putting stuff in the queue. So all the workers could do it, could work on it. And this put method is taking up, you can't see it, but it's taking up 74% plus minus of all the runtime. So my cores, my CPUs, cores were 100% maxed out, fighting to put stuff in the queue. Why? If you follow over to that yellow box, that says method acquire of blah, blah, blah, semlock. What that means is that that queue has a lock on it. And all these processes are fighting over who gets to use the queue. So the main process is trying to put something in. The other ones are trying to take something out. And there's all this stuff going on. And they're all just sitting around fighting over this lock on the queue. It's often called lock contention. It's a big problem. So awesome. I made my multi-core Python code that can process this stuff really quick. But it can't process it really quick. Because even though I bypassed the interpreter lock, there's still a lock on the queue. If I want to get around this, I need to do some really complicated magic, probably with some C module in Python, so that I can have a queue without a lock like this. And now we're getting to the point where it's starting to be silly to try to do this with Python. Now we're going to change. Sorry. But there was Python, right? So it's OK. So this happens to me. It's not unusual. This happens when I'm helping companies. So for those of you that don't know, most of the time I don't get to write code and work on technical stuff. Most of the time, I help companies with leadership, operations, organizational structure, and all that kind of thing that you would expect somebody from McKinsey or whatever to do. That's what I normally do. I advise growing, mid-size, and large companies on how to do this. But sometimes I get to play with tech stuff. And when I do, I run into things like this a lot. I run into things like this a lot. And if you're using something like Python, which is awesome, I mean, you can see I even got stickers. I'm actually not a sticker person. I just did that because I wanted to get rid of the stickers. I use Python a lot, but there's a point. And this is an example of a point where it doesn't make any more sense to use Python. You're at the end. Now, you've got to move on. Yes. So Jython, so that's a nice question. Jython and IronPython do not have a global interpreter lock, because they run on the JVM and the .NET, whatever it is, what's it called? .NET thing? Yeah, but it's the something runtime, ACR. I don't remember the acronym. Anyway, so IronPython and Jython do not have the same global interpreter lock problem, because they don't suffer from that. However, I bet you if you try this code, you'll still see they have a lock on the queue. Still see they have a lock on the queue. And there's no easy way to get rid of that. There's no easy way to get rid of that. But I didn't try it, to be fair. To be fair. How about Scython? Scython. So if you created your own queue data structure that was lockless somehow, and you did this in Scython or in C or whatever, and you use this as a module, to your other Python code, you could do that. But in Scython, I don't know if it worked. If you write it in C, and then use it as a C module in Python code, then maybe it'll work. But in Scython, I'm not sure. In Scython, there's a with statement that allows you to bypass the global interpreter lock. But I don't know, you probably still have to write your own data structure from scratch. And at that point, I would argue, you're still, just give up on Python, man. It's not going to like, no one's going to come to you. Travis Oliphant and Guido are not going to come to you and be like, why did you do this to us? You betrayed us. I go to C, but then you go to Python extension. OK. So if you start doing stuff like that, then you have a lot more options. But most people, I'm guessing all the people here, basically, are not going to write C modules for the Python code just so they can have access to a lock-free queue. That's kind of a lot of work, kind of a lot of work. Yeah, what tool did you use? So this is the profiling tool from PyCharm. PyCharm. So if you don't know it, PyCharm is a development environment for Python from JetBrains. I used the Community Edition for a long, long time. And then they changed the licensing. And I got really angry. And then they made a modification. And I got less angry. And I paid them for it to show them how less angry I was. But they make good stuff in general. I like PyCharm a lot. It has a built-in profiler. So if you have code, you can just say Profile. And I mean, you can do this without PyCharm. You can do Profile. You don't need PyCharm to do it. But it's integrated. So that's the tool that I use. So I switched to Go because I like Go. Does anyone know Go? Go Language from Google, a couple of people. OK, I like it a lot. It gives me a lot of things that I can't get from Python. The multi-core support is better. Concurrency, la, la, la. We can talk about it later. The Go people, I'm sure, will be happy to evangelize, right? OK, cool. So what I did was I basically rewrote the code in Go, which wasn't hard. It took like 20 minutes. And you can't really see because it's chopped off on the screen. But actually, it's not really any longer. In fact, if I can do this. Yeah, I don't know why I got cut off. OK, real managed. So I rewrote in Go. And I started with the base case. No concurrency, just one core. So same as the Python version, basically. I read one record from the file. Do the machine learning stuff, la, la, la, go, la. So this was the Python version. This was the Go version. They're not that different. But man, was it faster. Man, was it faster. So just switching from Python to Go dropped the runtime delay eight minutes, which is already not bad. I mean, you can, with eight minutes, like an eight minute cycle time to train a model is pretty OK. I think this is about 12 gigabytes of data I don't remember exactly, plus, minus. So it's not bad. So with 20 minutes of effort in doing this in Go instead of Python, I've shaved out quite a bit of time. So we're on a good path. We're on a good path. So same data set, still my small four-year-old laptop, eight minutes. But actually, we wanted to use all the cores. That was the original idea. Like, there are three other cores on my machine that could be used. So that's pretty easy to do in Go. Because it's kind of built in. You can just, if you want to have one worker, that's fine. If you want to have four workers, that's also fine. And if you want to have four workers, that's what it looks like. You just make a for loop that goes up to four, which is actually kind of hard to, you make a for loop that increments to the number four. And then you can have four workers on your machine at the same time, all using the same data. It's all very nice. So. So the goal is, is it still should be, or is it not should be? It is a compiled language. But even if you ran that code with PyPy, it's still not that fast. Because I'm also a big fan of PyPy. That's my sneak tool that I use when my Python is too slow. I try PyPy. So if you guys don't know PyPy, it's a jit for Python. So as long as you're using standard Python code, it will just in time compile your Python code while it's running. And it can give you really, really good speedups. So if you haven't checked out PyPy, I can highly recommend it. As long as you're not using a lot of libraries, because the library support isn't always 100%. That answer question? Cool. So this is the idea. We want to use all the cores. Just now, it was also on a single core. Sorry? Just now, it was also on a single core. Yeah, it's exactly the same thing. And so which part of the code has got speedup to PyPy jumping, and when you profile that, that should be a hard way to text the code. That's a really awesome question. I was so happy that it was faster, I didn't check. I mean, I'm just being honest. But that's actually really good. I didn't think about it. I really didn't think about it. That's my fault. I should have said, OK, well, why is this faster? But I was like, oh, it's so much faster. Yeah. But OK, so if I had to guess, I would say it probably is string parsing and arithmetic operations. Because when it pulls the records in from the file, it has to split on the comma. And that's actually kind of a nasty thing for code to do. String parsing is always nasty. I think I just complained about that on Twitter a while ago. So I don't actually know why it was faster. But that's an awesome question. I think tonight I will find out. Come to me afterwards so that we can talk. And then if I find out, I can tell you. OK, cool. Because now I'm curious. I can't believe I didn't think of that. Oh, well, OK. Sorry, was that another question? I was just getting you to repeat your question. Oh, so the question, which was awesome, was why was it faster? Which, if you think about it, is a very good question. And the answer that I gave was, I was so happy that it was faster, I didn't check why it was faster. That's my fault. But the follow-up was, I'm going to check why it's faster. Because I also want to know now why it was faster. But yes, it went from an hour and six minutes to about eight minutes. Because my fault. I don't know. I'll find out. OK, so this was the idea. We want to use all the cores that we have, which is very easy to do and go. It's not so easy to do in Python for the reasons we already discussed. You can use the multiprocessing module like we talked about. But if you have certain needs for shared memory or cues or something, and you're not spending a lot of time in your CPU, you may not get benefit in the Python world due to lock contention. So that's where we are. So then, our go version is not really that much faster. So even if we're using all the cores, it's not that much faster. Why? It should be quite a bit faster, right? Going from like one core to four cores. But that's like what, I don't know, 9% difference or something like that? Not even, I don't know. Not a lot. And now the math. So I think I said last week there was going to be math in the talk, and I actually took a lot of it out. But there's still something there. And I'm actually starting to hit the time limit. So I think I've got about three minutes left. So I'm not going to go too deeply into this. But stochastic gradient descent is ubiquitous for scalable machine learning. I just used that word, scalable. But it's kind of true, because it's quite fast. It's easy to work with. And in our particular case, we only are operating on one record at a time, which is a perfect fit for what we're doing with stochastic gradient descent. And the code that mirrors this math is that. It's when we're updating the weights for our model. The nice part about that is if you think about these updates to this big vector or array or whatever of weights, it's pretty unlikely that two cores are going to try to update the same weight at the same time, with some loose assumptions on how sparse the data set is. It's pretty unlikely, because if you have a bajillion features and lots of records, the chances that two processes, especially if you only have four, are going to try to update exactly the same weight at exactly the same time are really, really small. So how can we make this faster and use all of our course? Well, one option would be that we take the locks off of memory and just let the cores update in a round robin fashion. And for this, you can leave the cores on. You can leave the locks on, the memory. But just synchronize or kind of structure the way they try to update it, so that they're not really fighting over it, but they're sort of updating the memory in an organized way. You can do this. It's a bit faster, but not much. Or you could just ditch the locks. If the chances that you're going to update the same memory location at the same time from two different processes are really, really small, then why not just ditch the locks in the first place? Because that's what's slowing you down. Because we basically traded one problem in Python for the same problem in Go. Except instead of the Q, it's locks on the memory. It turns out you can do this, and it's not that big of a deal. And in fact, it can actually be helpful for you. There's a nice paper on it that I'm not going to go into, because I'm almost out of time. But I will try to put the slides online, and you can read it. The basic idea is if you have a problem that's sufficiently sparse, who's comfortable with the word sparse in this context? That's the second question. That's the second question. Sufficient sparse enough that you don't have a risk. OK, so for people who haven't run into this before, a good example of sparsity is something like a recommender system in the e-commerce website. You have tons and tons and tons of products and lots of users, but most users have not purchased most products. So if you put this in a big matrix of users and products with ones in the places where users have bought the product, it's going to be mostly zero. Mostly zero. That's a really good example of a sparse problem. So when you're doing stuff like that with machine learning and you have sufficient sparsity for some nice definition of sufficient, which they talk about in the paper, you actually don't really need the locks, which is cool. So let's take off the locks. Now we go from eight minutes to four, which is better. I mean, we didn't need to go down from eight minutes, right? Because it was still pretty good. But we go from eight to four. And the original Python was an hour and six minutes. The multi-core go with no locks is 355. And one thing that I do is I usually operate in requests per second, or records per second. So when I build systems, I usually don't care about how much data, like how many quintillion zillion bytes or something, or how many terabytes or something. I care about how many requests per second, how many records per second can I push in there? So the Python version is about 11,000. The Go version is about 200,000. Nice. 20 times speed improvement. Plus minus. Plus minus. So on my laptop, this is perfectly doable now. You can give me twice as much data. I'm still happy. It's no problem. It's no problem. And to be fair, there's still locks in here. The Go version, do I have it in here? The code's not in there. So in the Go version, there's something called a channel in Go, which is same as a queue. And a channel in Go actually has a lock on it. So even though the Go version is still using some locks in some places, it's still way, way faster. So if you have a laptop, even if it's a four-year-old laptop that doesn't have too much memory and maybe only four cores like mine, you can still do quite a lot in terms of data analysis. You don't need a cluster. You don't need to worry about Spark. You don't need to try to figure out how you're going to pay for AWS stuff. I see people doing this, too, on Kaggle. Like, I spent $2,000 last month on Amazon. What? Wow. Wow. Why? $200,000 per second on my old MacBook. How much? Wow, OK. So you don't need all that. You can get it by just fine on a small machine, processing as much data as you want. So I'm happy to take any other questions you have. What about the library support of Google? Do you have a specific concern? Like, is there a library that you need that's not there? By the beginning of the process, there was a library that I needed that I needed. What about Google? You mean for exploratory data analysis specifically? So there are tons and tons of libraries for Go. I guess that's the first point. The other point that I would make is, for me, the way I typically work, I do a lot of exploratory analysis in Python. And when I'm actually ready to build something, then I do it in Go. So Python, for me, is an exploratory tool most of the time. Sometimes I actually use it for things that are in production or bigger projects. But most of the time, for me, it's a tool that I just use for exploratory analysis. Like you said, for the library support. It's got pandas, got numpies, all these things. There are some things like that for Go, but it's not a tool that's been focused on data analysis. Like Python has been recently. I mean, it's kind of the same as if you, so numpy has been around for a while. The panda is not that old. I don't know. Does anyone know when pandas was released? It can't be more than like six, seven years, right? I think it's about six years, plus, minus. So until six years ago, there was no pandas for Python. And I haven't found it to be a hindrance. But if you're doing exploratory analysis, you're still understanding your data and what you want to do with it and what kind of models might work, then it's really hard to be pandas and scikit-learn. It's really hard to beat it. There was another question. Why do values of normals are different? Because the hashing functions are different. So when I talked about the feature extraction, sharp. I was wondering if anyone was going to catch that. In the Python version, do I have the hash in there? Yeah. So code line 123, index equals end value, la, la, la, weakest hash ever. That's not my comment, by the way. But I think it's a good one. So that hashing function is different from the go one. Did I? Where is it? Yeah. In the go one, I wrote my own hashing function. So since the hashing functions are different, the weight increments might be slightly different. So that's one reason why the log loss might be different. However, I will call attention to the fact that it's better. So it's basically the same, right? It's 4546. It's not that much different. Is there anything like how many places you can rank? Ranking-wise? In Kaggle. This will, I mean, in Kaggle, 0.01 will take you from 10,000th place to first place. Roughly. Roughly. People are laughing to get it, the Kaggle joke. So the thing with Kaggle is I never play to win with Kaggle ever, because you will lose large chunks of your life trying to do that. You'll probably have more fun playing World of Warcraft or something, seriously, seriously. You may develop a larger social community online that way. You might learn more doing the Kaggle one. But anyway, the point is Kaggle, as I mentioned earlier, is really contrived. You're getting a data set that's already been prepared. It's already nice. The loss function is given to you. You know how you're going to be evaluated. Everything is very straightforward. And then there's a leaderboard. So you can rank against other people. But what happens is, once you get above, so I usually go to the top, I don't know, once I get to about the 80th percentile, 70th percentile, something like that, after that, no point. Because in the real world, you're not going to do that. In the real world, if you're actually solving an actual data problem in a company, or a university, or wherever you are, they're not going to let you spend four months getting a 0.001% improvement in your model. Nobody cares. Like, what they're going to ask you is, is it better than what we have? OK, deploy it. And that's going to be it. Maybe you do another version of something like that, but you're not going to spend a long, long time. So what would be the ranking change on Kaggle? A lot. So what would be the ranking change in the real world? Nobody would care. Usually, unless you're like UPS, they have some place where they would care for routing problems. So can we say that you have a few ensemble mentors in our work? I'm sorry? You have a few ensemble mentors in our work? Have I built what? I'm sorry. Do you ensemble your models in our work? Do I ensemble them? Because you mentioned that nobody really cares if you spend 0.001% on your work. And ensemble, it takes a huge amount of time. So are you asking, do I personally do it, or have I seen it done? No. But I have seen it done. There was a company that was using a cluster, Spark cluster, of not a huge one or anything. It was like six machines or something. And they were spending about 25 hours of compute time with Spark to process all these models and do cross-validation and all these kind of stuff and then ensemble them together at the end. And it was quite cool what they did. But it was replaced with XGBoost on a small machine that ran in 25 minutes. So that's pretty much what I do in a lot of companies when I help them with technical stuff. I like to make the joke that I've shut down more Hadoop clusters than I've ever set up. And it's true. It's true. And that's an example of it. Sometimes people start with a really big approach to this problem, but you don't need it. I mean, just like we saw tonight, you can do a lot with a laptop. This is not a powerful machine. It's just a regular four-year-old laptop. And I typically don't ensemble models in the real world, for me, for the problems I've seen. Does anyone else? I mean, is there anyone here that's seen significant improvements from ensembling models? Does anyone know? For Kaggle, yeah. Yeah, but that's like pretend world. That's like pretend world. Yeah, for Kaggle, yeah, because you're fighting over that tiny, tiny improvement. But for real world, I've never seen it. I've never seen it. Maybe some people do. What about data size with hundreds of features? Sorry. Do you make a difference? Sorry? Data size with hundreds of features. Hundreds of features? Does it make a difference? Does what make the difference? I guess I have to try with data size with hundreds of features. Like this? This method? Yeah. Oh yeah, that hashing trick is perfectly fine for that. And the other thing I like about it is when you're using that hashing trick, you're pre-allocating how much memory you're willing to use for this problem. So it's quite sneaky. Maybe it didn't show up in there. You want to look at the Go version or the Python version? Probably the Python version. OK, so on that third line, there's like value plus key, la, la, la, la, ma, d. And what I'm actually doing is I'm saying I want to use two to the d slots of memory for this. So I can say in advance, I want to use 1 gigabyte, 5 gigabyte, 8 gigabyte, whatever gigabyte memory. And then when the features come in, it's going to hash them, mod whatever. So if they don't fit in there, it's going to put them somewhere else. So you may have collisions with your features, depending on how much space you're willing to use. Like hash collisions or memory location collisions. But that's OK. It usually doesn't cause a huge problem. Now, if you say I'm only going to use two bytes of memory or something, then yeah. But in general, no, generally it's no problem. You can put many thousands of features. That may be more of a model question. Which models are able to handle many, many, many features and automatically deal with ones that are less important, more important, so like regularization or these kind of things. But in terms of the hashing capability, no. No problem. I mean the overall acting. Yeah, it won't matter for this one. It will run at the same speed. Yeah, because if you have many, many, many features, it's still going to hash them into the same memory OK, so it won't matter. There was, you had the question. You know, I was a comment on software. So I do a lot of software at work. Oh, OK. You have to work well if a lot of features you have start coming from images and start coming from graphics, and later from images. So if you just train the base of all the features, train one out of one of your first thousand, fix one out of the fixed thousand, notify 100 of them on software all different times, because you didn't know that the different features come from different data sources, they are the components. OK. So it's one example that I really have. You have a deep view of that and produce a lot of features. It doesn't really make sense if you just don't text the specific data. So you're basically partitioning your training set. And then you're training different models on each partition, and then ensembling them later. OK, OK. And you do that at work. OK. Anyone else do that at work? Oh, oh, be proud, man. Come on. OK, so a couple of people doing ensembling stuff at work. But you have to know what you're doing. It doesn't mean that you have to control like you don't ensemble. Yeah, it's not the Kaggle approach where you do an ensemble of 1,000 different models, each one with a tiny different version of some parameter. But it's actually, I mean, it's ensembled, but you're really kind of doing some kind of model selection by hand for each of those partitions. Sorry? It's more like stacking. Stacking? You're building a bunch of models on different things, and then building a model over it that uses those as features. It's not so much ensembling as creating it. But you're not using those features, right? You're actually taking the outputs, and you're just Yeah, the pseudo features. Sorry? So it's stacking because the pseudo features. Yeah, you're going to use all those specific model features. So for example, there's a really high level idea sense on certain data, it goes to the driving So then, sorry guys, we're getting a little specific. You too? Let's talk afterwards, OK? Cool. Anybody else? We thought we would be the loss of memory, assuming that under a very rare scenario, that the schools try to access the exact same memory space with the process die, or is it just get corrupted data out there? So the answer is it depends. Because I mean, it's like a more deep technical question. But if you have a CPU that supports certain atomic operations on your memory, common ones are like increments, like an atomic add, or add two numbers atomically, something like this, add something to a memory location atomically, then in some sense, you still have a lock, but it's enforced by the CPU. So one CPU operation actually does that add for you. So it's impossible for two processes to do that at the same time. So you're basically moving this lock out of your code, and maybe down to the architecture level of your machine. But I was actually arguing, debating, whatever you want to call it, with somebody a couple days ago about whether or not that's an actual lock. Because it's like one CPU instruction, right? So is it still a lock, maybe? I don't know. But in general, it's not an issue. As long as your CPU supports those atomic instructions. If it doesn't, then if I built a system that doesn't use locks like this on an architecture which doesn't have the atomic operations, shame on me, because I made a bad choice as a developer, as an engineer, as a tech person, whatever you are. You have to be aware of the architecture issues. But in general, it's not a problem. So wondering whether you've tried a higher order system optimization method instead of just casting gradient descent. Because in theory, things like this method would be a lot faster in terms of the project. So that actually is actually for single computer. Would that actually be the same method? So I'm jogging my memory now. So Newton's method converges faster. Yes. I think it's a quadratic method versus linear method. It does converge in fewer steps. But the processing requirement for each step is quite heavy. Not sure. Yeah. You're right that Newton's method will converge faster in terms of steps, like how many steps? So the linear method, or if you use SGD, it will converge more slowly. But the computational requirements to execute one step in Newton's method can be quite heavy if I remember correctly. So we can check afterwards. But I haven't looked into that in a long time. So honestly, I can't remember the details. But I seem to remember that being the reason. So if you have less data, then yes, Newton's method can work quite well. But if you have a lot and not a lot of data, it's maybe not the best option. Which in this framework, where we don't know how much data we have, could be 10 gigs. It could be a 1 terabyte. We don't know. So if you have like 1 megabyte of data, maybe Newton's method is ideal. But we're not making the assumption that we know in advance how much data we have. So that's the use case you have seen in that loop, or just like plus-go in that sense. Who laughs? You think I'm going to say something funny? Never makes sense, right? That's not true. So it does make sense where you have a really large amount of log data or something like that that you need to preprocess so that you can then do machine learning with it. So if you have, I don't know, petabytes or something, I don't have petabytes of data. If you have a lot of data, it can make sense. Yeah. Oh, so excluding ETL stuff, when does it make sense? I've never seen it make sense. There may be cases where it does. I've never worked at Facebook. I've never worked at Google. I've never worked at Twitter or Microsoft. So maybe those companies have cases where it makes sense for them to do their actual machine learning stuff on Hadoop or with Spark or something like that. But I'd be kind of surprised because I have acquaintances that work at Amazon and they say the same thing. It's like, there's no way you need more than an EC2 instance to do machine learning. Just no. And they were doing image processing, image recognition stuff, image-related machine learning for Amazon. So I've never worked there. So I can't say with 100% certainty. But my guess is that with extremely rare exceptions, there's just no need for the machine learning case. For the ETL case, I think it can make sense. And the machine learning also feature extraction is the next one. In this case, we do streaming feature extraction, actually. So our feature extraction, we don't actually have to worry about because we do the feature extraction when we see the record. So we're basically doing the hashing once we get the record. So it's on a per record basis. So in that case, it's not an issue. But I'm not saying there isn't the case. I'm just saying I haven't seen one. Yep. We will be nice. But what I think is that if I have a lot of data, I'll just select the right constant. That just looks like it takes really long to do it. So if I'm doing some kind of exploratory stuff, I would just have a sample of the data. I won't try to do it. Because I'm just trying to understand it, right? What kind of features are in there? What does it look like? Probably look at some histograms because I love histograms. Those kind of things. I'm not really worried too much about doing a full scale model training at that point. If I needed to do that, then I probably would use a different tool. But for the exploratory stuff, I use Python, pandas, and scikit-learn pretty much. Sometimes I'll use PyPy. There have been times where I wrote some scython stuff for some problems because there were some computational things that were a bit heavy. And writing five lines of scython fixed that problem. So that's a good example where extending and using Python more made sense. But in general, just the same stuff, the normal PyData stuff. Python, pandas, scikit-learn. Other questions? Yeah? You said that the bottleneck is usually the computation. I'm sorry? The bottleneck is usually the computation of Python. You're not a Python. The bottleneck of what? In this case? I mean, it definitely wasn't IO for me because I'm pulling from SSD. So I'm still fine on IO. If you have an IO bottleneck, this is actually a good question, if you have an IO bottleneck and it's not computational, then the global interpreter lock is not an issue. Because global interpreter lock is not in effect for IO and for certain computations that are done out in the sea land. So for some NumPy stuff, you don't have to worry about pulling the interpreter lock. If you're doing a web-based API or something, then you may have more of an IO bottleneck. But it's kind of a different problem, right? What we're describing is how much data can you reasonably process on a small machine? And for that, you pretty much have to discount the possibility that that small machine is well, discount's not the right word. But on my laptop, I'm not worried about running models, running an HTTP endpoint for my model. I'm just pulling the data from disk. But in this case, no, there was no IO bottleneck. It was all CPU. Other questions? So I'm curious, what's the limit about creation? Oh, this will work on anything. You can do this on a Raspberry Pi. It's not specific to the machine. What is specific is you have to decide. I won't go back to it. But you have to decide in advance how much memory you want to allocate for the weights. And that's machine dependent. But other than that, this approach, in general, will work on any machine. Are you mean in terms of CPU and stuff like that? Yeah, it's not specific. Let's say what you can run on a Raspberry Pi, you want to? Sorry? Something like a lot of work is done on a Raspberry Pi. Yeah, I mean, you can run this on Raspberry Pi, if you want. You won't have as much RAM. But that's OK. You just won't have as much space for the weights. Any other questions? Yeah. So what kind of machine learning algorithms does this approach conflict itself to? And you're out of supply, but I think you're going to find yourself in a kind of supply. Right, so in, you mean for the feature extraction or for the piecewise fitting? Piecewise. Does it not, when it's up to? What kind of problems or which algorithms? Which algorithms? That's a really excellent question. I can't think of any off the top of my head. I mean, there are some. I'm not saying I can't think of them because they don't exist. I'm saying I can't think of them because I don't know. Does anyone else know? What's the question? Oh, so the question was, are there any algorithms where you can't do this? You can't partially fit them. I mean, if you want to do, like, regular gradient descent, then you can't, right? Because you have to calculate the whole gradient for the whole data set it was. So that wouldn't work. Sorry? I don't, I don't, I really, I genuinely don't know. But you can kind of get that from some regularization, right? And you can add regularization terms to estrogen and logistic compression. And then you get the same result. Maybe we talk after. It's not parametric, that's why. OK, maybe we talk after. So the answer is almost certainly, but I don't know. OK, other questions? Cool. Thank you very much. This is my information. You can find me around here. Thank you.