 I'm super happy to be here. I've been using, you know, Ruby in Rails for ages and feels great to be speaking in RailsConf. So, as Collab did yesterday, you should follow me on Twitter. I don't know. Like you followed him, so I think you can just follow me just as well. So, yes, I'm actually Italian. I can speak Italian, I can cook Italian dishes. I think that makes me pretty much Italian. I started doing Ruby in 2010. In the last couple of years, I've been doing mostly JavaScript. I got really interested in machine learning and I climbed a lot. So, I'm not a good climber, I just climbed a lot. So, I think we're trying to organize like a climbing session tomorrow. So, if anyone is interested, just like follow me on Twitter and then hit me up. Okay, cool. Yes, right now, I'm working at a company in London which is called Erlang Solutions and they were very kind to send me here. So, I'm going to tell the world about how awesome they are because they're the first company who actually believed in Erlang as a technology in 1999, which seems like a century ago. I guess it is. And of course, they love Elixir. They love all sort of technologies which run on the beam. So, React, RabbitMQ, all this sort of stuff. Okay, that was like my business blob. Okay, so my goal today is to do some live coding. So, let me just first ask you, oh no, something, VLC is crashed, clear. Can you guys in the back see all this? Could it be bigger? A little bit bigger? More? Good. I'm just gonna keep going. No? Is it okay? Less? Okay, I'm just gonna go for this but if you can't see it, let me know. But before I start, I just want to show you my desktop background. I think it's really nice because there's this obvious Photoshop in the front. It's just such a high ranger quality and then you have these two guys in the back. They're not even looking at each other. What's the point? I don't know. I hope you're familiar with the movie. Like if you're not familiar with the movie, raise your hand. Keep it raised, please. Well, okay, I can get it off now. Anyways, so what I wanted to do today, first of all, there's this amazing web M which I showed you before which is this guy playing the flute. But there's also trained CSV file and this is actually like real historical data of the Titanic passengers. So there's 892 lines, minus one for the header. So it's like 891 passengers. And for each one of these passengers we have this information. So the passenger ID, whether they survived their passenger class, their name, their sex, the age, the number of siblings as spouses, the number of parents and children, which ticket they bought, the fare, so how much they pay for it, the cabin, and where they embarked on the ship. So what we're going to do today is to, well, I have to admit it, like mainly the main goal is just like to make sure that the movie didn't lie to us. So that was like my main like scientific goal when I started on this adventure. So what we're going to do is to use some Python libraries. I think the guy who spoke before me, he already explained like why Python is the way to go when you're doing machine learning. And I think he's right. So they're just like some really amazing libraries and I hope today I can show you just like how good they are. So yeah, let's start. I'm going to create the file which is called visualize.py and we're going to use this library which is called pandas, which is an amazing name because everyone loves pandas and it's like one of the best libraries I've ever seen for just like dealing with CSV files, for example. And also we're going to import matplotlib.pyplot which is a visualization library. There's this concept in pandas of data frames. So whenever you're loading your CSV, it will just be converted into a pandas data frame. So I'm just going to call it like df and then you use pandas, you just do read CSV, it has the file and he will automatically read it off. So for example, if we wanted to see like a little bit more about, you know, like the distribution of the survivor rates, we just do print.df, survived, this is like automatically created by the library and then we can just do value counts. Okay. Feel free to interrupt me like there's like anything like unclear. Okay. So if I just run this, you'll see, oh, okay, of all these lines we had, we only have like 332 survivors and the rest unfortunately did not survive. But it would be nice if you could actually see this, right? So to do that, we just do plot. We say like what kind of, you know, graph we want, in this case, want a bar. And I'm just going to set some opacity. And then in the end, actually I should create the graph first. So I'm just going to go figure and it's plt.figure and here, just pass a fix size. 18, six. Okay, good. And here I'm just going to do plt.show. So hopefully this is going to work. And here it is. Okay. So in two lines, we basically transform like this information into this graph. And of course as humans, we can't really reason too well with these numbers. So I want to see the percentages. And in order to do that, I just have to write here normalize true. And if I do that and run Python, you'll see this is what happens. So we can see that actually on our dataset, there's like a 30% people which survived and 60% unfortunately did not. Okay. But what's better than the single graph is to have loads of graphs. And in order to do that, we're going to use these subplots of grid, which basically creates a grid of subplots. Does that make sense? I don't know. But the only thing we need to do is just like to set a title for this thing. And we're going to say this is survived. Okay, cool. And basically like this structure just means like this is a rectangle, two rows, three columns, and this is the first cell, right? So if I run this, you'll see that it basically just like puts it like up there. I really hate the fact that like by default, the graphs like are not full screen. So I have like some really nice code, two lines of code which will do that. So I'm just going to paste them in. Okay. Okay, cool. So now we can keep making more graphs and I'm just going to use like the good old like copy paste. So what we want to do now is to take a look at, we want to see if there is a relationship between the survival rate and the age. So there is this tool, which is called a scatter plot, which will show you this. And the only thing we need to do, let me just like remove these two lines. Okay, so we take a look at the survival rate and we compare it to the age. And I'm going to pass a little bit of opacity here as well just because otherwise the dots would just get like too clumped together. And here I'll say this is the age with regards to the survival rate. Okay? And if I print this, we'll see that actually, this is quite unexpected, at least like for me when I started doing this, like there is no like apparent like age connection between the survival rates. So you can see like the main like lumps of people are between 20 and 40, both on the left-hand side and the right-hand side. As I said before, one means survived and zero means passed. And if we take a look at it a bit more closely, you can see that like older people, like have passed a little bit more and there's like younger people here which might have survived. But other than that, I think the distribution like still like doesn't allow us to, you know, like any conclusion. Okay. Something else we might want to take a look at is just like the distribution of the passenger classes. So here I'm just going to do this, change this to two or two. And here instead of survive, just going to passenger class and same here. Okay? And if I run this, we'll see that most of the passengers were in the third class and we have people in the first class and people in the second class. And I think this is quite like what we expected because you know, there were like more people like, you know, which like couldn't afford like the more expensive tickets. Something we might want to look at is the relationship between the age of the passengers and the class like they were able to buy. And in order to do that, I'm going to do something which will probably make you very, very afraid if you work with HTML at some point, like tables, call span, anyone? Yeah? Yeah, good call span is rocks. Anyways, so we're going to use this feature of Python which is the list comprehension. So I'm just going to say it's basically one, two, three dot each, right? And for each one of those, we're going to want to display the age when the passenger class is equal to a certain number. So when you're using the square brackets which we're doing just like to filter, like you want to extract the age but only for the rows which passenger class was x, okay? And for each one of those, I'm going to create a new graph and this case is going to be called the kernel density estimation. You can look it up on Wikipedia afterwards what it actually means, but it looks pretty. So that's why I've added it. And here the title is going to be age with regard to P-class. And I'm just going to add a little legend to the graph so that you can actually tell what is going on. Second, third, okay? And if I run this, you'll see this is a really nice graph, right? Anyone agrees with me? Anyone disagrees, most importantly? Okay, so we can see that for the third class passengers, they're like way younger, that their average age is 20 years old and for the second class passengers, they get around 30 years old and the first class passengers, they're around 40. So as they get older, they get richer and therefore they can buy more expensive tickets. Since I was really interested in understanding more of the movie, I've actually figured out that there is like a glaring, historical mistake, which is that I discovered that the ship actually made two stops. So it started in Southampton, but it actually made other two stops and we can take a look at the embarked column. And I'll just put it in one, two. I think this is okay. And if I run this, you'll see that 70% of our data set embarked on Southampton in England, but then the ship made a pit stop in France, in Cherbourg, and then it actually made another stop in Ireland, in Queenstown. So yeah, if you're like Titanic fans, I think this is like a good piece of trivia. No, like at parties especially, I think it's quite cool. But I think it's quite nice, when you have these 15 lines of Python code and you can actually drill down the data set and try to put some of these values in correlation and try to just see what's going on. And instead of building a prototype and taking at least a week or something, here using these libraries, it just takes you five minutes, I think. I think it's pretty powerful. Something which is missing from these graphs is the gender of the passenger. And the reason why that is missing is because I think it's quite important. So I think it deserves a dashboard on its own. So I'm going to do some nice copy-paste. Gender, pie, copy everything, delete this. And actually I don't really need most of this stuff. I'm going to comment this stuff and just leave this, okay? So what I want to do now is just like to take a look at the difference of survivor rates between men and women, right? But I want to make more graphs because more graphs are more awesome. So I'm just going to do this quickly. And here I want to show this survivor rate but only when the sex is male, okay? And here I'll write men survived and then I'll do the same thing here and I'll just say sex is female and here will be women survived, okay? To make things a little bit nicer, just like create a color just for female and I'll just make it something just like different. Color, female, color, I think this looks good. And we can see here, like the graphs have pretty much the same shape. So we can see in total there's like 40% people survived. The men, they have like a 20% survivor rate and the women looks quite similar unless you look at the numbers like underneath, probably you can't really see, but if I zoom, it says one, right? So actually 70% of the women survived, like at least like in this current dataset. Like by itself, this information doesn't mean that much because you know, if you think there is a room with 100 men and one woman, sorry. Like this information by itself is not really significant. So maybe we can just like take a look if that's the case or not. And to do that, we just like take the same code we used here and just say, just like show me the sex, okay? And we can see that, oh, I think I made a mistake somewhere, I should have updated this to three. Okay, and we can actually see that, okay, there's like more men, but the difference is not that big. There's like a 45%, 65%. So I think the dataset is quite balanced in that way. Like it's not an inconsistency of the dataset. So there must be something else, right? Before we looked at the passenger distribution correlating that with the age and we can do the same thing and correlating the passenger class with their survival rate. So I'm just going to uncomment this code we had before. And here instead of the age, I'm just going to take a look at the survival rate. And I think the rest can just remain the same apart from this. And I'm just going to make the colespan larger. Every time it's a colespan, just like it's funny. And here we have the graph where you can see on the left, the passengers in the third class, they're like really like, you know, in such a bad spot. And then instead the passengers in the third class, they have a good survival rate. So I think this is quite cool because we can see like on the first row that this is like difference between the genders. And here there's a difference between the passengers class. So maybe what we can do is to try to see, oh, maybe if we try to combine the first two rows and there's like some like striking, you know, feature of the data. So in order to do that, I'm just going to do copy of this. Don't worry about the code. Like I'll put it online. I think it's already online actually. So what I'm going to do is I want to take a look at the survivor rate of all the men, but I also want to add another condition. So I'm just going to use this end and say that the P class is equal to one. And here I just say, first class men survived, okay? And then I'm going to do the same thing, but I'm just going to change this to one, ooh, to one, and this to three. And this is going to be third class men survived. And if I run this, you'll see that this pretty much confirms our like suspicions before. The first class men have a 35% survivor rate and the third class men have a little over 10%. So if you're a third class man, like you're not doing great. Instead, if we want you to do the same thing for women, I'm just going to copy this. There's a lot of copy and pasting this talk, as you can see. Well, one could say the same thing about programming in general. And so there's a color, which is female color. And here I'm just going to say women. And the last one is going to be female in like third class, okay? So third class. Okay, cool. And I think this is quite striking because you can see that in the first class women, I had to check the dataset actually because like when I first ran this, I couldn't really believe it. But like, I think like there's 78 first class women and 77 of them survived in the dataset. And instead, if you're a third class women, the distribution is like more even, it's like 50-50. But I think like, especially given my initial, like scientific goal of this exploration, I can say that the movie is actually confirmed. So there is some historical accuracy to the movie because this third class man would be Jack and this first class woman would be Rose. So it's not really a surprise that Jack perished and Rose survived. It's not like a fictional mechanism. It's like actually how things went. So we can take a look at the picture again. No? Okay, but as you can see like it's super nice to be able to do this like data slicing and just like visualize these things like in a little bit of like easy way. But what would be even cooler is to try to run some predictions, right? We all want to be wizards one day. So what are we going to do is just to create like the most basic heuristic that we can do, which is just like to predict if you're a woman, you're going to survive. And if you're a man, you're going to die. I'm sorry for all the male audience like in the room. So we're just going to create a new file. Actually, maybe I'm just going to create a call you like predict pi and import pandas as PD. And here I'm just going to create like call this data frame train because it's what machine learning called like the training set like the initial data that you use to build the model, right? So here I'm going to do read CSV, train.csv. And what I want to do is to create a new column. So I'm going to do that just by doing this. So in this way, it's going to create a new column called hip for hypothesis, but it's like a cooler hypothesis. Thank you. And I'm going to initialize the whole column to zero except that I want like that. When some condition applies, I want that column to be one. So I'm going to use this function, which is called luck. And this function takes like, well, it's not really a function, but anyways, this thing takes two arguments. The first one is the condition. And the second one is the column you want to update. So the condition is DF sex equal equal female. And the column I want to update is hypothesis. And I'm going to set that to one, okay? And so in this way, basically, we created our first guess of like how to predict the outcome that we want to predict. But now we want to check like how accurate is our prediction. So I'm going to do like something quite similar. I'm just going to create a new column called result. And inside the result, I'm going to check the survival rate against our hypothesis. And here, I'm going to set the result, okay? And basically now like we can do the same thing we did before. We just like extract the result back. We run value counts on it. And I'll just run it like this for now, okay? No, DF is not defined. I think it's because I called it train. Train, sorry, okay, this should be all right. And we can see that this thing like was correct 701 times and was wrong 190 times, which in percentages means something which I can't remember. But I think it's around like 70%, 78%, okay? So if you consider that if you were only guessing, you'd have an accuracy of like 50%, you just improved your heuristics by 28%, which is like a very simple guess, okay? And you got to that gas because you took a look at the data, you tried to understand a little bit how the data works. And of course, if you use like more advanced algorithms, you'll be able to improve that number. But I think it's quite cool just like to see the huge difference of when you have like a basic understanding of the data and when you don't. So I think this is cool, but something which is cooler is to let the machine do the whole thing, right? Luckily, there's a set, like it's an ensemble of libraries, which is super well known, which is called scikit-learn. And inside scikit-learn, there's basically so many machine learning models. So what we're going to do is to use a model which is called a leaner model. So I'm going to do leaner.py. And I just want to quickly explain you like how leaner model works, right? So if you see like a graph like this, and like all these are data sets, like the data points, we as humans would guess, okay, the data is exhibiting this sort of trend, right? And I think this is quite simple, like you ask my cousin who's like five years old, you could probably do it. But the problem is that in real life, the data sets have way more features than two. So for example, if you're doing image recognition, every different pixel of the image is a different feature. So if you're analyzing images which are 50 by 50 pixels, you're facing a problem which has 2,500 dimensions. So if I'm going to ask you to solve the same problem in 2,500 dimensions, I don't really know who is able to do that. But the advantage of having a numerical approach which does the same thing is that apart from performance, of course, the numerical problem doesn't care. The numerical problem would just like solve the problem just as well. And I think that's why like most people think that machine learning is quite scary because it tries to condense this sort of human knowledge into some number somewhere. And there's not a really good explanation for us like what that means. But I don't really want to scare you with my dystopic future tales. So I'm just going to import pandas again as a PD. train.pd, read.csv, train.csv. I'm just going to show you, I said I was going to live code, but I actually lied. And there's some helper functions I wrote before because I didn't really want to bore you these details. But basically, of course, one of the things you always have to do with data is to clean it up. So for example, in this case, some rows don't have the fair information, some rows don't have the age information. So I'm just like filling them back up with like the average value. And then another thing is like most of these numerical approaches work really well with numbers, but it can't work with strings. So I'm just converting the idea of the sex to a number. So if you're male, you're going to be zero. And if you're female, you're going to be one. And the same thing applies for the embarked information. Okay. So here, what I'm going to do is to import these utils and just do utils.clean data train. Okay. And how most of these algorithms work is that you basically you tell the algorithm, these are all the inputs and there is one output. And the inputs are called features. So what you're going to do is to extract these features. So let's say I want to use the passenger class, the age, the sex, and let's say the fare. Okay. And I'm going to extract those values. And then I want the target, which is the survived information. And I'm going to extract that as well. Okay. And since this algorithm is going to try to decide if this row is going to go into the survived bucket or the disease bucket, it's usually called a classifier. And here I'm going to use scikit-learn. So from scikit-learn, I'm going to import this linear model. And this linear model has this little thing which is called logistic regression. Okay. Which is basically the same thing that which I described before just like trying to figure out like which is like a good line to separate two datasets. And it's super simple. You take this classifier, you say, okay, fit these features against this target and then print me the scores of these features against this target. Okay. And if I run that, it's like 79%. And as you can see, I didn't tell the algorithm anything at all. It's just like, you figure out, okay, these are the inputs. This is the output that this guy asked me to figure out. I'm just going to do my best. And the model that he constructed was like a little bit better than our naive intuition. Of course, the linear model isn't always like the right answer because we know in real life, there are problems which are not linear. So if I asked you to describe this data, you wouldn't say it's this, right? I mean, unless like you have a thing for like straight lines. Ah, maybe you do something like this, which is not too bad. I'm not judging, but I would say that usually you'd go for something like this, right? So luckily, like most like machine learning experts, they recognize this hatred for straight lines as well. And they created this module, which is called pre-processing, where you can basically, you can manipulate your features and transform them into their sort of like polynomial transformation. So basically you can make them quadratic if you want to. And it will just like take your data and combine them and multiply them together and create new columns for you. So what I'm going to do is just to create one of these like transformers. Um, polynomial features. And you have to pass like the degree. So in this case, I'm going just to try with the quadratic. And now we can transform our existing features using this fit transform of features, okay? So basically we had these features and in this way we transform them into their sort of like quadratic versions. And now we do the exact same thing. We just like say, okay, fit these poly features against the target. And then print me the score of this other classifier. Poly features target, okay? And we see like this thing already like improved a little bit. Usually you can see that it's because, I think like the mental model is now the algorithm is trying to match against like data which behaves in a quadratic way. Usually if you add like more information like the algorithm will behave better. So for example, I can add the number of paradigm children and number of sibling spouses. I think this should be enough. And let me try to run it again. And you can see that even more information to the algorithm, the algorithm is able to like, you know, make better predictions, okay? And all this in like 20 lines of code. I think it's pretty cool. How much time do I have left? Six minutes. Okay, cool. So I just want to show you one more thing which is what we should call decision trees. So basically there are these algorithms which are going to build these decision trees. So for each row in our CSV is going to run through a series of questions. And then depending on the outcome of these questions we're going to classify our little row. So let me just do some good copy pastes. Predict tree pie. And no, I don't think my pace worked. Let me try again. Yes, this pattern. And instead of linear model, we're going to import tree. I think we don't need this. So I'm just going to change this to decision tree classifier. All the rest remains unchanged. And in this way, the algorithm is going to build a tree and try to match that. And if you're a normal person, you see this number, you're like, wow, this is so cool. But if you're a software developer, you know there's something wrong. Because we all know there's no such thing as a free lunch. So 98% is like way too good. It can't be, right? So basically what the algorithm did in this case is like a phenomena in machine learning which is called overfitting. So basically we're passing some data and the algorithm is trying to find a very complex solution which matches against all the data points. And that's why afterwards, when we try to ask him, oh, how are things going? And he's like, yeah, everything's great. I got it, bro. But when you see this solution, you're like, nah, I think you might be using. Luckily, the scientists recognize this problem as well. And it's quite like a big problem in machine learning, this problem of overfitting. So we have to figure out ways to make the machine learn like the generalized properties of the system, not the specific properties. So in scikit-learn, there's this little thing which is called model selection. And if you use this model selection thing, you can basically say, okay, take this model and try to hide some data from this algorithm so that the algorithm builds a model which is more generic. And I'm going to use this function which is called crossval score. You can explore later what that actually means. But I'm just going to pass the features, the target. You have to pass a scoring methodology which I'm just going to pass a accuracy. This is not good, don't do it, but I think it's just simple enough. And I'm going to say, okay, we run this process 50 times. So basically this thing is going to sub-sample randomly the dataset, hide some stuff from the algorithm, we run it again and try to see how it behaves, right? So if I print this, scores, print, scores. Whoa, whoa, whoa, that's not how Python works, I'm afraid. It'll be nice. And you see, if I try to hide that from the algorithm, the actual average of the scores is quite abysmal. And this is like what we expect because we are trying to build a solution which is like too specific, so it can't really work. The good news is that there is a way to fix this and it's basically to tell the algorithm, well, don't be too smart. So what I'm going to do is to, first of all, pass a random state to the algorithm and then basically specify a max depth. So if you imagine this as a tree, the tree will never go more than six levels, okay? And then I'm going to pass this other little thing which is called sample split which basically controls how many elements have to get into a branch before the tree decides to split into a sibling. It's not super important, but I think it's nice. And if I just rerun the same code again, you'll see the difference. You'll see that the initial performance wasn't as great, but the general average is much better, okay? So basically in this way we've created a model which is more resilient because in this way we're cheating before we're just passing all the data, but in real life you'll train the model and then score it on new data. So it has to work well for data it's never seen before. And I still think this is quite weird. So luckily there is a really nice, no wait, it's called tree. And there's a really nice utility inside the tree library which is to export to a graph is file. So actually let me just double check how I do that. Okay, gotcha. So it just passed a tree, you pass the feature names which I don't have. So I'm going to have to add that and you set the out file to a dot file, okay? So I'm just going to cut this to feature names. Feature names equals thing I just cut. I think this should work. Okay, cool. So if I, control Z, if I list there's a file which is called tree dot. If I place it just like a little bit weird, but I can convert it to a PNG file. Tree dot into a tree dot PNG. And if I open this folder, you'll see that there's a really nice graph now. And this is the actual tree that the algorithm built. This is just like visualized. And these are like all the decisions that the tree does in order to classify your row. Is if you can see like the top level decision are the most important decisions, right? And it's not a coincidence that the first decision that the algorithm makes is the passenger, man, or woman. And then the one immediately after that is, oh, is it the child? Like it's the age of this passenger, like less than 6.5. And the other immediate decision is, what is the passenger class, okay? So basically the algorithm figure out a lot of things which we figure out like on our own. But as I said before, this thing can do it like on a thousand dimensions. So yeah, that's all I have.