 to the data science workshop by Apocot Academy. For, first of all, before anything happens, those of you that brought a computer and I wanna go through the presentation with me, you can go to the data science.marcogg.com and you will see something like this. So data science. Yeah, so this is called Binder. It will set up a virtual machine for you and it will set up an environment with the same tool that I'm using. So you will be able to follow the presentation and you'll be able to run the code that I'm gonna show you. So it's actually a very cool tool. It is, if you see here, click on the show part, you can see that it's actually launching the server. So in a bit, it will actually set up the whole thing and it will look like my, it should look like something like this. Wait, not this. It should look like this. This is like the environment that we're gonna use and if you click here, you will get the presentation there. This presentation is interactive, so it also has code and you can actually run it. So that's why we're using the Binder option. I think, yeah, this one already launched. So once you get with this, you can actually click here. Sometimes it, because it's on beta and it's like it's not like a final version. So these guys sometimes it doesn't work properly, so you might have to preload it and stuff like that. But please be patient because this is really new stuff that they're doing and they're actually providing it for free. So just make sure it is not very stable, okay? So I'll go to my presentation and I'll start presenting myself. My name is Marco and I'm the head of AI and Chief Data Scientist at UpCode Academy. So I came to Singapore to finish my PhD in robotics just right across the street with the guys from ASTAR. And after that, I stayed around and now I'm working full-time at UpCode Academy in charge of the data science track. So any questions that you have with the courses or anything, I'm here to help. So enough on my side, if you want, you can go on the website, check my whole thing. It's on the UpCode Academy website. So I wanna get like an idea of what you guys, what is your background? So I can actually go faster, go slower, explain different things. So first of all, who knows any programming language here? Raise your hand. Okay, quite a few. Who knows Python? Python, good. Who knows Jupyter notebook? Anyone? All right, that's a few. And who hasn't done any coding ever? Can I raise your hand? It's fine, no worries. We want, it's actually so I can explain what is these things, right? So, okay. So this that I'm using is a Jupyter notebook. For those of you that never saw it, it's actually, so it looks like a presentation because I'm using an extension called rice, but it's actually a Jupyter notebook. You can actually exit and you can see the whole thing. It's got the slides there and it's got, so it's actually markup language every single cell and they become slides. So you can actually play around if you go into the website. There's also some other options if you want. So if you go to in my github, github. So the source, I have the source code, it's online. So you can actually go, should be under, what is it? Yeah, this one. So you go to the, I'll put it here. So it is github.com slash marcoag slash, what is it? Hacking online dating. Hacking online dating. So if you go there, you have the whole, the readme is not very updated, but it's actually pretty much the same. So you have the first, the first version that you have to run the code is this one. So this is the easy one because it sets everything up for you, but as I said, it's not very stable. And because of size reasons of the data sets, you can still get the code at home and then run it and it should be fine. But you'd be able to go through the whole presentation pretty much. Then you have the second option, which is the advanced option. For those of you that already have Jupyter Notebook, you can actually get the code using git and you can run it on your own computer. This is the most advanced version, but it will get you more flexibility, right? And because some of these algorithms might take some time, if you run it on your computer, it will be faster. It will be the finally faster. And the third version, which is the not recommended one, because it's only read-only, you won't be able to execute the code, but you will still be able to follow the presentation. So if you click here, you actually get a non-interactive version of the Jupyter Notebook. It doesn't look as nice as the rest of the presentation, but you will be able to follow through and you'll get the code inside. Still, you just won't be able to execute, okay? So yeah, let's go back to the presentation. So everyone should be either running, trying to run this, the binder. And once you run it, just click on the Notebook, which is the one with the logo that has like a little book, right? So see, now it seems to be working. Yeah, there you go. So here you should be able to run the code and everything. It takes a while because it's slow. And as I said, binder will launch the presentation. You can get out of the presentation mode by clicking on the X. And then you can go back by clicking on the rise extension up here, right? You can go back and forth. And we'll actually do that through the presentation, because sometimes it's easier to see code this way than in the presentation mode. Actually, this rise extension is also very new and it's not very stable either. So writing code in presentation mode is actually weird. At least for me, it doesn't feel proper and the cursor goes around and it's not very stable. So we'll do it here, right? So for those of you that don't know what a Jupyter Notebook is, it's basically a tool that we use. It's embedded on your browser. And we usually use it locally, not like this online, but we usually use it locally. But it helps us teach, especially Python. You can use other languages, but it's mostly famous for Python. So it is good because we can put images, we can put text, and we can put code. So you'll see that we have code at some point, right? So to execute code here, you only have to hit run. So there's a run button here. There's also shortcuts. You can hit sheet enter and then you can execute the code and that should work, right? So for example in this cell, this is a cell. So every now and then you'll find cells in the notebook that I put there. You won't have to write code yourself because there's people that doesn't know how to code. So the code is already there for you. But if you know how to code, you can actually change the code and make different things that what I'm doing there. So it's actually also interactive for the people that knows how to code. So for example, if I wanna print something, I can just say print hello and then I get a hello. So this is already Python code being executed. And you should be able to do it in your own version that you got through that URL, right? And another thing is that to create a new cell, you just click here on the plus. It's here and then you create a new cell, right? So I think that's pretty much it with that set. Anyone has any questions about this or the Jupyter notebook? If you have any questions at any point, feel free to raise your hand because it's better to answer whatever questions. I might skip something at some point. So it's easier to solve it as soon as possible. So you don't get stuck because we all built all the, especially in programming, all this knowledge builds up on previous knowledge, right? So if you don't get the starting part, then you might lose some other parts at the end, right? So Jupyter notebook with a race extension. So I'm gonna go and do presentation mode for a while. I think, wait, it's not this one, this one, right? So data science. What is data science, right? Let's have a look at it. Wait, data science. So the way this presentation works is that it's got like main points on the, if you click on the side and then going down are those points. So you can see the four arrows here on the bottom. So you can move around. If you press the space bar, should go, should move around properly. So data science is an interdisciplinary field that uses scientific methods, process and algorithms to extract knowledge and insight from data in various forms. So we have a bunch of data. And what we use is like different tools that we have, mainly our statistics, data analysis and machine learning, lately a lot of deep learning. And basically we try to use those methods to answer some question or give a solution to some problem. That is the basic idea of data science. And for that we use data. So then as an output, we usually say that we get like data-driven solutions, right? So you have a problem, you find a solution based on your data. So, wait, I gotta go down. So, data is the key. That is something that is widely spread. So people usually tend to think that that is how it is. Data is the key. And it is very important if your data is very big. So apparently it's as smart as. And the speed that you process your data and then a bunch of buzzwords that go around that describes how data is not, your data is so big that it's not able to fit in Excel. So you use some other methods and you use like deep neural networks and you use all these buzzwords, right? And then you use Hadoop, you use EC2, Google Cloud, PIC, a bunch of other tools, right? Then people go around fighting like I use Python because it's better, I use R because it's better or whatever, right? So that is like an old way of thinking about data science. The truth is, I keep on going to this slide, the truth is that science is the key of data science. So the main point of data science and why data science is useful is because it's able to answer a question. So it is very important that science it's a part of that equation. It is very important that you focus on what you want to achieve. You study the field that you're trying to do where you're trying to work and then work with the data in that field. So it's a big on the science part. You have a question, you have data and then you have to release an answer, right? So it's very important to understand those correlations that you will have in your data as a solution and you have to understand if those correlations really matters. So it's actually, there's a lot of on the research part of the data scientist that is not only running algorithms. So you have to actually understand what it's the problem and how the problem relates to your data, right? So why data science? So data science is actually, it's very famous nowadays. It's like these buzzwords that everyone saying around and it's supposed to be very useful, right? So there's actually some examples that I want to share with you. So data science helps you answer questions with data, right? So we know already that. And this is an example. Has anyone seen this movie, Monable? Anyone? So this is actually a good example of how data science can help you out. So this is a movie in which they basically try to build a baseball team using data. So the problem that they have is that they don't have enough money to get, or they don't have all the money they want to get the correct players. So they actually use data and statistics on those players to actually build the team. So it's actually a good movie if you wanna check it out because it's very related to this field. And it's actually based on a real story. So it showcases the possibilities of data science, right? So another, wait, keep on going that way. Another is the Obama campaign. The Obama campaign was really data-driven. It was actually so data-driven that they actually say that it won because of that. So it's the data science team of the Obama campaign. It's actually very famous. And one of the key things that they did on the second time he ran for president is that we actually discovered that the guys, so they tried to get the guys who wanted to vote for Obama. They wanted to get them to come to the polls and they actually discovered that the populations that were going to go to vote were not the ones that they expected. They were not the moderate people that it was supposed to be the guys targeting voting for Obama. So there's actually something, sometimes you get actually cool outputs that you don't expect from the data. So this is another example where you will get something that is not on your hypothesis. Because sometimes you usually get an hypothesis, then you go through the data and then you confirm your hypothesis or not, right? So that's how it is. Another example is the Netflix price. So Netflix ran the price for $1 million and it was mostly to recommendation of movies, right? Given a movie, like what are the other movies that you can watch? So these guys won the price and they got the best results. But one thing that was not taken into account in the contest was the engineering part. So that is something that sometimes, depending on our problem, we have to take into account. Engineering part, because these guys, they did a merge of a lot of machine learning algorithms. So they got the best result. But the algorithm was so heavy that it couldn't run on Netflix service. So the solution never got implemented in real life. So they won the price but Netflix couldn't implement the actual solution. So that is also an example of how you have to actually think about your problem. It's not only the algorithm, even if you get the best result that you can get, right? So yeah, data science is actually one of the sexiest of the 21st century. So that's another reason, right? Why do you learn data science? Because it's really well-paid and there's a lot of demand for it. So actually, I looked at some data and according to Forbes, it's one of the most demanded jobs in 2018. And IBM predicts it will raise by 28% by 2020. And then according to Glassdoor, is the best job in the US. I think they only have statistics for this, for the US and UK. So I got the one from the US. It's got like a really good grade, 4.8 out of 5, 4.2 out of 5 job satisfaction, a really good median salary, and a lot of job openings. They have a lot of job openings in the website. So this is another reason why you might wanna learn data science, right? So how do we do this? How do we learn data science, right? So we want to learn data science and there's the first rule of presentations, which is don't demo because it was always go wrong. But here at Upcode Academy, we like to break this rule. So I'm gonna go through some code and I'm gonna do some demos here. So for data saving projects, you usually have a problem or a question, which is the problem that I'm gonna present here. And then you start gathering data for that problem in that domain. The data that you think that will answer your question, right? Or has the answer to your question? You gather that data in different forms. We'll see what are the forms that we can gather that data. You prepare that data. You have to go through your data. You have to check out like if you have outliers or you have like different kinds of data, you have to put it in the correct form and all that. And then once you have that, you create a model. You create a model using either statistics, machine learning, deep learning, any of those techniques. That is not like the main point. The main point is creating a model that will allow you to obtain a solution and finally get a data-driven decision at some point, right? Which is what you actually want out of data science. You wanna have the data as a backup for your decision. So the problem. So the problem that I'm gonna present here is that I'm a lonely guy. I spend too much time with computers. I'm alone at home. I live with my parents and I don't have a girlfriend. So what happens then? I went to a Cupid. I decide because I don't usually go out. So I go to this website and I join the website and I try to start dating girls, right? I try to check. I fill out my details, put on my picture and all that. And what happens is that I get pretty mediocre matches percentages and I get only 24 girls interested and none of them are answering me. So that's a problem for me. So the result is that I'm still alone. Still at home, still alone with my parents. So I really want a girlfriend and I know I'm a data scientist, right? So maybe I can do something about it. So let's go ahead. I go and I know Python already. So I think that's a good way to start. I know Python, I can do some data gathering and I'm gonna check what happens, right? So now it comes the first step, right? I got the problem, now I need the data. So where do I get the data? There's different sources for data. Data sources, you can go get it. You can just go and start asking for data from a sensor. Or like, for example, for self-driving cars, we have guys of an autonomy right here. They're always driving around, getting data. That's a way to get data. You gather the data yourself. Or if you have a wifi, you might wanna get the data from the wifi, different types of data that you wanna use, right? Another source, very good source of data, this is really good, is the government. They have a lot of APIs. They release a lot of data sets from the actual government. So for example, Singapore has data.gov.sg, which is actually a good source of data that you can use. And you can actually run a lot of, get a lot of concussions using that data. In the US, there's the Freedom of Information Act. So you're actually allowed to ask for data that is stored by the governments. So you can actually file a claim and they will have to send you data. Actually, I just read a post today about this guy that wanted to figure out when it was the best day to fly for his parents around Thanksgiving. So what he did is that he filed one of this to the San Francisco airport and he got the information from the wifi. So he could check how many people signed up for the wifi per day. So he could actually check what is the most crowded days on the airport. That is not real, so you could have better solutions, but it's quite close to the reality. Now everybody signs up for the wifi, but if you have more wifi users, you definitely know there's more people in the airport. So that is the part where you have to think about if your data really matches the reality that you have for the question. The data sets, it's another one. We're gonna see that one today. Data sets is the most common one, the easiest, but sometimes data sets might not match your domain. So usually there's a lot of data sets released online. It's good for testing out things and it's good for trying out different learning, but might not actually match what you wanna do, right? And sometimes you also scrap it from the internet. That means that you create a Python script and you basically go through different websites and get the data from it, right? Then you use that data. So you could go and scrap Wikipedia and then use some data from Wikipedia and then save it to a file and then you can read that file and use that data to run your algorithms. That is also an option. So these are the main, this is already code. These are the main, for those of you that don't know Python, these are the tools from Python that I'm gonna use on the first part. So the first part, I decided that I signed up for this OKCupid, right? And I'm actually selecting OKCupid because the founder is a mathematician, one of the founders, and he is actually very data-driven. If you go into the block of OKCupid, you're gonna see a lot of, actually through the presentations, I think there's some links to the block because he was actually posting a lot of information about the data science behind OKCupid. So actually there, we'll see it in more detail, but one of the keys of OKCupid is their matching algorithm. So this matching algorithm, it's actually trying to put a mathematical algorithm to love. So it's actually trying to get two people and try to come up with a percentage that says if those two people will like each other at some point, right? So we'll see how the algorithm works because it's actually public. There's actually a link to the founder talking how the algorithm works and then we'll try to do some hack around it. So first of all, we're using these tools. This beautiful soup is a tool that we actually teach it in data science introduction. It's actually a tool for web scrapping. So it's actually meant for going through websites, content and saving it to files. Requests is a tool from Python that we use to get requests to create HTML requests, same as the ones that our browser is doing. RE is for regular expressions. For those of you that don't know regular expressions, that is basically too much. We use them to match certain patterns of text so we can actually extract parts of the text at some point. JSON, for those of you that don't know what a JSON is, basically a JSON is a structure of data. So JSON is, we'll see probably, I can show you how a JSON looks like, but it's basically a structure of data where you put like different fields and you can actually extract those fields. So these JSONs are used a lot to send information on the web. So most of the websites, they will use JSON and a lot of servers and different applications, they will use that. And we use time, I think it's just to put some weighting not to make too many requests to the website. So think we can execute this. Yeah, we can execute this. You can actually run pressing here. I think you can execute that. Or you can Shift-Enter. And nothing happens because it's just loading the libraries, but you will see the number on the left changes. So if the number on the left changes, it means that that was run. If there's an asterisk, it means it's still running. If that asterisk is there for a long time and you're not doing something that is very heavy, then there might be a problem. So you can actually go back, you can go here and actually if there's a problem with the asterisk, you can actually come here and press that restart thing. That is just in case your notebook hangs, okay? So I'll go back to my data sources. Yeah, we were here, right? So I think actually I'm gonna keep it here so we can actually see the code. So now it's mostly code. So what I did is, and you have to do it if you want it to work because I'm not gonna give you my password, you have to sign up for OKCupid. So I went and I sign up for OKCupid and I got a username and password. So you need that if you want to keep going. So for this, I have probably on the code that you have there's a default username and password that hopefully doesn't work because I got my own, right? So you have to create a new cell and you have to write down your username equals. You write down your username there and then you write down your password. And this is because what I wanna try to do is I'm gonna try to gather data from OKCupid profiles and to access to OKCupid profiles and I need my username and password. But don't worry if you don't wanna sign up, it's fine because this is only for the gathering data part. You'll be able to execute code on the rest of the thing. So I have my username and password loaded already. So I created a little function that will basically perform a logging on the website. PS4, you missing a library? Oh, you might need to have that module, yeah. You might have to install it on the binder. You have it on binder, right? You're running on the server. Yeah, so you might have to add it to the conda. So it's a bit tricky, yeah. So for this one, I created a logging and it's basically just performing the logging. It requests a session. It uses the URL from the site, which is basically OKCupid.com slash logging, right? So, and it returns the session. So if I run this code, now I'm logged into OKCupid. I have a session and I can use that session to request other websites. So what I created here, it's basically a URL with the search that I want. So it's with a bunch of filters. And I think if I remember well, it's like no any gender is getting like everywhere in the world. So like the filters that I set up are on that URL. So I can run this and I get a page. So this is for those of you that don't know, this is HTML, this is how a page looks like in real. It's a plain text. So that's how a page looks like when you browser get it. So that's what your browser will read and then will show you the nice looking page that you get. So if I go through this code, right? If you go, it's quite big. So I think if I go through this, at some point you'll see here that it's saying ethnicity, drugs, smoking. So it's showing some users information there. So if I check this, should be, no it's not this one because this one is generic, right? This is mine. So this is display name, this is Marco. That's my profile. I have to go down. You can actually search for, we can search for real name, right? Real name, this is mine. Yeah, so this is the information from the users. So I search for some of the fields and then I realize that this is the information from the users. Location, Berlin, status single and all that, right? So if I look at it, it actually doesn't come on the HTML. It comes on a JavaScript. JavaScripts are, it's another type of code for those of you that don't know that actually gets executed on your browser. So what happens is that this is getting executed on your browser and then it's changing the website when you have it, right? And it's loading the user information there. So for this, we have to do some little tricks to get that information out of there, right? So, and then the whole thing that is this structure where this data is coming from, it's in a JSON. It's a JSON file, it's a JSON variable. So what I did is that I came down and I basically realized that it's under this function. This is one of those regular expressions that I told you about. That is a regular expression basically that selects that chunk of the website where the JSON is, right? So I run this code and I get smaller version of my website. And then the next one is basically giving me a JSON. It's actually extracting the JSON from that chunk of code. So it's basically just removing the rest of the stuff on the function and just output in the JSON file, right? So this is the JSON file, the JSON structure. So actually for you to see, I'm gonna just copy the whole thing. You can actually do it, even you, the ones that are not, you probably have it already there, right? So you can actually do it for whatever information you got there. And I think there's a bunch of online JSON viewer. There's a bunch of online JSON viewers that will format this whole thing and make it look nicer for you. So you can actually paste the JSON file, but not this. I can not this, where is it? JSON file, there. And then I can format this. And if I format this, I see the whole JSON file with the different fields, right? So you can see there's a parent here and then I can see total matches one. And I can look for the users at some point. And this looks like pictures. And this is a username. So I have user ID, username, and I have gender, age and all that. So I got some users information there already. Now, I need to get that information out of there, right? So I can load the JSON into a JSON structure because up to now it was just plain text. So this is basically just loading the JSON. Wait, I got extra. So I can just run this. And I can actually print the data. I can actually get, I think I got only one user. Because if you go to the OKCupid website, I think I'm here, right? And if you paste the search that we were using, the number of users that you get is random. So if you copy this URL, and if you paste it on your OKCupid website, oh, I need to sign it. So it's data science. So if I sign in here, data. Right, and data. Yeah, thank you. So here's, yeah, all right, thank you. So if you come here and you post the URL, you can see that the amount, so this is how, this is what we're getting that doesn't look. So you can see I'm looking for all genders, everyone, NEH and all that, right? So this is actually very random. So every time you reload, you get a different number of users. So we have to take care of that. So here, we're accessing the data and I wanna access the first user. So this is the first user, Harlan, real name. So if I go and run this, this is a function that I created that will print the information. This is how you access JSON information. So you can actually, wait, this is, so this whole thing is to access the information from the user. We can print it out and check if it's working. So I run the function and I can see that I got one user. So that's why I was trying to access the second user, but I only got one. And I'm not showing the real name and display name, but I'm showing the rest of the information. That's logging and all that, right? So this way, I already know that I can scrape some data from the users. Then I can save that into my file and then I can run this for like 60,000 users. They designed some algorithms to decide what to do, right? So then I created this loop that actually it's only scraping 10 users because it takes too long. But if you run it, it takes a while because you get users that are the same every time you. So what it's doing is basically just doing the whole thing that we've done, but it's doing it several times. So we're taking different users and I am just printing the user ID. It takes a long time because if you check those user IDs, they come all the time the same. So you have to wait for a while. So if we keep on running this for the whole day or we leave it at night or something, then you might be able to get, I got 11 users there, right? So we got 11 users. So that's how you can gather data using web scrapping. The other option is the one that I actually used to get more data because this is taking forever and because it's not very ethically correct to pass you the data that I scraped from my profile, from people from OKCupid. So I actually got a data set. So this data set has 60,000 online profiles and it was published in this journal of statistics education. And it's specifically allowed by the OKCupid co-founder. So we're OK to share it and you're OK to get it and do some playing around with it. So it shouldn't be that bad. There's some links there that you can check. Links to the paper, it's actually a very nice paper because what they do is that it's actually an introductory paper to data science. They use R instead. We're using Python here. They use R, but it is actually, they use this data set basically to introduce the data science to a class. So it's actually a good paper. There's some article from Wired about ethical issues and all that. So I'll let you to check that out. So now the data set. We can download the data set. For those of you that are using the binder thing, hopefully this part works. And you can actually execute this and you'll get the seed file from the data set which sits on GitHub. It's on a GitHub account. So you're going to actually get the data sets with all the profiles. So once we get the data set with all the profiles, we can actually go ahead and zip it. So this actually executes the command and zip and it will unzip the profile CSV on your binder and you'll see it here. So you should have a profile CSV file on your directory of the binder machine, right? So congratulations, we got data. That's the first step on data science. We got the data. Now we're going to play with it a little bit. So we got a few libraries there that we're going to use. The first one, it's actually, I would say it's one of the most important there, but actually all of them are important. The first one is basically the one that we use to mess around with the data. So we use it to read data, to check it out, and to plot information. That is actually one of the ones that we also show to our students in data science introduction. Well, pretty much most of them are used so because they're very important there. So math.lib is to plotting. Seaborn is also to view data. Anunpi is basically, it's actually a basic library in Python that we use for a race to do mathematical computation with a race. So once we load that, we can go ahead and I think my, you see that my cells are on asterisk, right? So that means that my notebook is somehow hung. That happens sometimes. So what you have to do in this case, which is what I said before, you come here, you restart the kernel, and you're good to go. But you have to go and rerun some of the cells that you run before. So for us, I don't have to because this is totally independent from the previous part. So I'm gonna just run the part with loads and see that works now. So now I'm doing the imports so I can load the libraries. And now I'm gonna load the data. And this takes a while. It takes a bit of time. Actually, because it takes a little bit of time and if it's too big and we'll see it at the end because I'm gonna use another bigger dataset, that might take too much time. So for this one, I can check the last part. I can check the last part of the, because we said there's 60,000 users, right? They remove, I think they remove the name and they remove some user ID, some information for privacy purposes, but they left the rest. So you can see most of them are from the San Francisco area in this dataset. And we can see this data tail when we use, so when we use pandas, we load the data into the viable data and then we play around with that. So when we do data tail, we get the last part of our data. So we get the last few rows of our data. So if we run the data tail, you can see we got all this information. So you can see this guy is athletic. I don't know what is this, that's diet, it eats mostly anything. It drinks socially, it do drugs often, working in college, university, blah, blah, blah. So you can see we got data there. So now we can play around with it. So we can see what's the shape of the data. So you can see that we got 31 columns and we got almost, as I said, almost 60,000 users. So we got 39 attributes for almost the 60,000 users. We can see the data size, which is basically all the fields that we have, all the amount of fields that we have. We can see the columns, that is all the 31 fields that we scrapped. So we have like age, body type, diet, dreams, drugs, blah, blah, blah, blah, all that, right? And we can access different information by the field. So we can get the data age. I can actually do the mean of the age. So I can check that the mean of the people in OKCupid in OKCupid is around 32.3 years old, around in San Francisco. Take that into account. So we have to, as I said, it is very important that we know where our data stands. So this is not about the whole OKCupid community and this is not about the world. So this is about OKCupid people from San Francisco and exactly from 2012. So we have to be very careful with the conflations that we run out of this, all right? So let's see what we have. So we have 60% of males, 40% of females on this dating app. Usually there's usually more males in dating apps than females, right? And we can check different things like for example, the offsprings, like how many people wants to have kids, but might not want them, might want them and all the different options that they have. So for this, basically what I'm doing there is just getting the offsprings field. We go up, right? There's different fields here. There's one called offsprings and that is the field that it relates to the part where the user sets whether he wants to have kids or not. So I'm just checking that and I'm summing up all the users that have the same value on that field. That is basically what the value counts is doing. So that is, and I can do the same with the sexual orientation, but I can actually do it on a plot. So this is a plot of the sexual orientation of the population that I got on my dataset. So I got most of the people is straight, there's a little bit of gays and a little bit of bisexual and thinking about San Francisco, again, we might think that the gay population is slightly bigger than in the rest of the world, right? So that's how that's an hypothesis that we can take into account when working with this data. So we can get statistics about the age, right? So we can get, as I said, there's a mean about 32%. It's about 32 years old is the mean of the age of the people there. And we get that 25% around 26, 50% around 30, 70% around 37. And wait a minute, there's someone that has got 110 years old and he's in OKCupid, right? Might be, who knows, right? But actually if you check, there's two users older than 80 in OKCupid. So that's a bit fishy. I don't know how many people over 80 knows OKCupid and how many people over 80 is into online dating, but doesn't seem right. So this is part of the data preparation that we set, right? So we got the data, we're working on the data and we figured out somehow that there's two fishy entries in your data. This is very important because you wanna clean your data from some noise that might drive you into concussions that you don't wanna go, right? So these guys, we're gonna check on them, right? So we're gonna prepare the data for the outliers and we get now the guys who have more than 80 years old. So apparently we have 110 years old person from Daly City, California, which pretty much hasn't filled anything. So it feels like it's not very real, this profile. And the second guy, it's a guy on 108, 109 years old with a body type, athletic. So these might be also not be very good. So we can assume that these guys are not actually real profiles, right? So that is what I say the part where you come in, the science comes in. That's where you have to take the decisions and say, okay, what is going on? Athletic guy of 109 years old, this is clearly not real. So what I do here is that I basically, I remove this guy. And there's a D here, which is data. And I'm gonna do data here. I'm still getting the error on line print. There's a print on length data. So there you go. So now I remove the guys and the maximum age is now 69. I remove the odd lawyers. Now we prepare our data. So we can do some extra analysis. I can check out the age distribution for males and the age distribution for females. So this is a nice way to look at the data that we had. So you can see the age distribution around 30 as we said and for the females, it's slightly, maybe slightly less. And you can see there's actually less females than males. And we can get the mean and the median for the age on the males. And I'm gonna keep going because otherwise we won't have too much time. But I'm gonna check both in the same plot So you can see here males and females all together. And you can actually see the percentage of males in each age group. So one slightly concussion that we can have here is that women's after 60 are the majority in OKCupid. So you can see that there's more women's in the community. And this can be also due to the fact that if we look at the information, not only in this group and actually if we look into the general population and we look into Wikipedia information, you can see that there's actually slightly more females than males in that age, in that range. So actually somehow this gives us an idea that our OKCupid data might actually match actual population of, or like a slice of the part of the population on San Francisco. So now we're gonna look at how tall they are. And we're gonna check this, basically this algorithm is just getting the height, you can see some access to the height of the males and height of the females. And then we plot that information and you execute that code and you get this, right? So we execute the code, I'm missing one D, that is somewhere on 21. So 21 is data, if I run this, I get both, right? So you can see males are taller than women's and this is the self-reported height in inches there. So you can see that the men's are obviously taller than women, but we're gonna check this data to actual real data from other sources. So what I did is I got the information from the centers of disease control and prevention, which actually has data that publishes charts about the population in the US. It is not the same, but we can still get some idea. So this is the data set that we're using. So all the time for these data sets, for those of you that don't know, we're using CSV files. So CSV files are basically text files where the information is separated by commas. So you can actually open this file with any text editor and you will see all the, I don't recommend actually to open the bigger ones because it might hang your computer because there's a lot of entries. But you can actually check it out. So now I'm gonna run this and it goes to the, it reads the CSV directly from the URL and we can see a chart with how tall the people are. This is at the different percentiles of the different height in the US. So we're getting that from the CDC. Now I'm gonna adjust this data into inches so we can see the data. That is something very important that you have to do. So now I got the same data that we were looking at and we can actually check how tall the guys are in the US. So you get 64, 66, depending on the percentile, right? For males or for females. And now I'm gonna compare both percentiles with the CDC data and the users that I got. So you can actually see that there's a slight gap between my users and the CDC users. So if you run this, there you go. So what is the slight gap between my users and the users and the actual US population? So we can drive different complacence here. We can think about the idea that people in San Francisco is taller than people in the rest of the US. That might be possible. Or we can think that people is actually reporting slightly higher height than real, right? So maybe we catch some lighters there. And another option that we might think of is that we have users that are likely shorter people are likely to not report their height because they might not wanna say that in a dating application, right? But that's actually what we learned here. So what I'm checking here is that how many users have their height filled is actually null, which means it's empty. And I'm summing them up. So we only have three users that didn't report how tall they are. So that we can scratch that hypothesis. So if we do some research online, there's actually one post from the OKCupid founder that is called the biggest lies in online dating. And you can actually check that they run this test and they do it with actually much more data because they actually have the actual OKCupid database. And it's actually, they confirm that they actually have a tendency to write down a larger height and exaggerate that a bit, right? So this is an example of how we caught people lying on our website, right? Actually, another use of data science that I'm just remembering, I was talking to one of the grad data scientists and they have a lot of problems with people cheating with the GPS. So the drivers will cheat with the GPS. And what they do is they will actually hack the GPS so they can actually report different locations and they could report different pickups and drop-ups that never happened. Hard for grab to actually detect this and they use data science for that. So they actually trying to, like the same way that we did with these guys, they're doing with the guys driving cars in grab, right? So they can actually check if it's real or not. And same with grad food. Apparently there's also fake food deliveries going on that they have to detect. So more things that we can do. We can check. This is checking basically the body type. There's not much to say here, mostly that guys use, likes to use the term a little extra, more than women, and girls likes the term curvy much more than guys. So apparently there's no many guys that can, that will put the curvy thing in their profile. More things. I can, actually I have to load this part. So I'm gonna load this because it takes forever. So this is another data set that I'm gonna use later, but it really is really slow and it takes forever. So I'm gonna just load it now so I can go back and keep on explaining this. So yeah, there's two curvy and little extra. So that's, we had a look at the data. We prepared the data and we could, like for example, we could run an algorithm on machine learning, which is something that we haven't used here. And I don't use because it will take a lot of time to train and it won't be feasible for a small workshop like this, but we teach it on the data science intro. But for example, like you could do some algorithm that will tell you if some profile is a male or a female based on whatever they write down, right? So probably the algorithm will take into account different features. And one of those features that might be very interesting to look at would be this body type, right? Because it looks like it's very discriminative between male and female. So let's go over, okay, Cupid. And so this is actually, it is based on those, this is this guy, Chris McKinley. And these are two videos from this guy. I think actually one of them is not, the second one is longer, it's like an hour. It's not a video, it's like a talk. But the first one is actually very short, like three minutes, so you can watch it if you want. Basically this guy had access to a supercomputer in his lab at university because he got a job at university or something like that. And basically this guy was very unsuccessful on dating, pretty much like me. So he actually decided to run algorithms using okay Cupid data from his profile into this server. And he actually ran into some ideas on how to actually get more matches and be able to date girls, right? So that is how he explains how he did it. And we're gonna try to do something similar here and see what happens, right? So what I did is I went online doing some research about how this match algorithm, because actually one of the things he says on the talk is that this match percentage is very important, not only on the dating side, but also on the actual date. So when he goes into a date, if the girl has seen a higher percentage, he says it's more likely to be more open to him, right? So he actually tries, so his point was trying to maximize this percentage, this match percentage. And because if you have a higher percentage, then okay Cupid will try to match you two together. Because you are supposed to have higher chances of succeeding in that relationship. So Christine Rather, the co-founder of okay Cupid, as I said, he's a very into data science. And this is a video, a TED educational video, where he explains how the matching algorithm works. So it's actually where I got the idea. I'm gonna explain it a bit here. It's better if you watch the video later on or at home. So you can get an idea if you're interested and you wanna try this for yourself. As I said, he was a math guy. He was majoring math. So his main point was to try to hack the law scene and come up with an algorithm that will actually check if two people are actually gonna fall in love or not, right? So the first thing that he realizes that he needs to actually perform this matching thing was data. So he said, I need data from both parts. Whoever is gonna be in that couple, I need data from those both parts, right? So what he do is that he ask a few questions. He started asking answers from people and then try to match those. But then he realized that sometimes like, you might like horror movies and that's okay. But then you might have a certain personality and if you have a partner with the same personality, that it's actually not actually very good match, right? So what they actually move the questions to is that you get your answers for yourself. You answer what do you like on the other person and then you answer how important is that for you, right? So there's different levels of that. And the way this works, right? The way this work is that we're gonna have an example where I'm gonna have this matching algorithm which is basically, we're gonna have questions in common because we only take into account the questions that both persons have answered, right? Otherwise we don't look at them. So we look at the questions that both persons have answers. And the way they give points to these questions is that they give zero if you decide that that questions for you is irrelevant. They give one if you decide if it's a little important and they give 10 if it's somewhat important. They give 50 if it's very important and 250 if it's mandatory. So those numbers they come up with along all the testing and all that. Actually I don't know, I didn't check how old is the video. So probably that information is not even there anymore because I expect them to keep on changing the algorithm. So if we go down, now I'm gonna check an example, right? So we have person A and person B. Person B, we wanna check how much person B satisfies person A. So we have two questions. And for the first question, it is very important and we got it right. Right means that they got the same answer. Getting the same answer or getting the answer that the other person wants, we got it right. Because it's very important we get a 50, right? But on the second answer, it's only a little important and we got that one wrong. So it's only, it's still like we got 50 out of 51. So we got 50 points on the first one because it's very important. And we got 50 out of 51 in the second one. So that means that we have a 98 percentage of satisfactory there. So we're 98 compatible with B with A. So we run the same thing for the second one. And you look that for A, it's a little important and we got it wrong. A little important is only one point. So we lose one point there. And then we got a second question that is somewhat important and we got it right. And that is 10 point. So we got 10 points out of 11. So we got a 91%. After this, we average both scores. Because we have one score in one directions and we have another score in the other direction. So we run a geometric mean running the n square root of the both, we multiply both and we run the n square root where n is the number of questions that we have. This is called the geometric mean. So we get an idea of like how much they are compatible. So here we came up with a 94% for both. So this matching will be a 95%. They say they have some margin correction which is probably like how they keep on improving the algorithm and they keep on adding some different tunes based on the data that they get because they get much more access to data than we do. So they probably can know that people is talking and if they actually go to a date, if they don't go, they can actually have that data. So for this part, I'm using a second data set which is from this paper. You have the link there. The data is on a mega. I didn't download it and I didn't download it for you. I didn't provide any way of downloading it. So you can go on mega and download it yourself because effectively this data is not that nice as the other one. Doesn't have the permission from the CEO and it's actually scraping the questions from my user account. The first one is actually was scraping public profiles. This is doing it through a user account. Same as I did with my own account and I save it for me but I didn't publish online. So the link is still there. If you wanna try it at home, you will have to actually download the code and you will have to run through the notebook yourself, okay? You would have to put the data set there and it's pretty straightforward. It comes in a CSV as the other data sets. The only difference is that in this case it's got 61,000 different users and instead of the profiles, I think we also have the profile data there but we're not gonna care about that. What they have is that they have 2,541 questions from the users and they actually gather users that have answered at least 1,000 questions from all around the world. So we can gather some of this data for ourselves. So my hypothesis is we don't have this whole information, right? So if we go back here, we don't have other people answers. We don't have these two ones. So we don't have how much you'd like someone else to answer and you don't have how important. So you don't have the answer for the other person and the importance. But my assumption is like if you answer something you usually want the other person to answer the same thing. I wouldn't say that it's always the case but I pretty much expect that to be 90% of the time and that is the science research part that I put myself into this. Because the data set that I have, I can only work with the answers of the people, right? So what I'm gonna try to do is that I'm gonna try to match my answers with the answers of other people and if it's not very important to them that is not a problem for me, right? But if they like me to answer something different then I won't get a good score there. But I still get, I expect it to be a good score for me in 90% of the answers. Because if you like horror movies you like your couple to like horror movies usually, right? So that is a hypothesis that I'm putting on this. And then we'll see, I can see it then on the results. So because this data is huge, that's why I went before starting to load it because it takes a lot of time. It's one, I think it's around one gigabyte of data, explain text, so that's a lot of data. And actually when I was trying before it was actually pretty much hanging my computer at some point. So this is warning here. So let's see if the data was loaded properly. Hopefully it was. Yeah, there was. I can, as I said before, I did the tail of the data which shows the last lines. Now I'm doing the head. So I'm showing the 10 first lines, right? You can see, this is part of the profile. So we get, if it's a woman, a man and all this and then you can see that there's questions there from two to whatever big number there, right? They have an ID, each question has an ID. And then the answers are here. And you get a lot of new answers. So we have to clean that when we work with it. So that's something that we have to take into account. So what can I do with this? Basically what I said, I wanna see what it's actually the main answer out there. And I can, I'm gonna load these all CSV which actually contains the questions because they have two different files. One files only contains the answers and then the other one contains the information related to the questions. This one loads very fast because it's way smaller. So if I check at the head of the questions, you can see a lot of questions there. And you can see that regarding food and plans, what's more interesting to you, you can say sex, love and nothing else. I think you get, if you look at the top, in this data set you get four different options. So I'm gonna work with this data and this is the topic, religion, different things, right? So I can actually gather one question. So I think this question is, what I said? Q26, have you ever owned sex toys? This is the question that I'm checking. So if I run my quote, I can see all the answers that are not null. So what I'm doing with that drop NA is just removing the people that didn't answer that question. So I'm only looking at the actual people that answer. So now that I got the people that answer, I can run and count, sum up all the people with the same answer. That's what I'm doing there with the value counts thing. You didn't see anything because I didn't print, just save it to a variable. But now I can actually get the percentage of people that answered yes and the percentage of people that answered no, right? So that's what I'm doing there. Basically here, it's just like getting the numbers that I calculated here and just make it the percentage looking at the whole amount of people that I have, okay? So I can see what is the, I didn't save this for Qubit 26. I save it to question answer. So this is answer, Qubit is one, right? So basically what I'm doing here is I'm just checking what is the maximum, the maximum selected option. So that is what I probably want to put on my profile if I want to get more matches from my hypothesis, right? Then you can check, actually I build this little, you can change the question that you want to check and this one, I don't remember what it is but you can run it and then you get Q44 or we can run like some religions are more correct than others. So I run this Q44 and I get like 80% of the people thinks yes and 20% that does not think that. So I might want to answer that because they might want to have the same connection with me. One thing, several things to take into account here. So I decided this and I did this into my profile. I changed the answers, I put these answers, some of these answers into my profile and there's a few mistakes here that I want you to learn that also happens, right? So we got algorithms as I said and the main point is not on the algorithms, we got some conclusions that might look good but there's also ways to improve it. So for example, I did choose and those are decisions that I made, right? So it's actually in the science part. So I choose to run this algorithm over all the dataset. So I could have selected if I like females, I could have selected females maybe, right? Because the answers might be different. So that is something that I could have taken into account. Another thing that I could have taken into account is the location of the people in the dataset. So there's different parts that you can actually take into account. And still you get like a pretty good idea of what it might look like. So I went to my profile, I changed a bunch of answers looking at this and you can see that it changed the percentage of matches. It's a bit different and you get like 445 different matches there and much more higher percentage. So the probability of me finding love now on this OKCupid website might be a little higher. And it was all basically thanks to data science. So you can see that it's actually something useful. So this is the whole demo and I want to end up with something more useful that is not, data science has a lot of application. This is a funny one, this might be useful for you, this might not. But there's like, for example, there's one very interesting that is actually data science saves lives. And there's facts that are proving this. For example, nowadays there's actually projects running on intelligent stuffing which means that they're actually taking data and checking how much stuff you need in the different hospitals. So that actually reduces the cost of healthcare and it's actually giving more access to healthcare to different people. By using this data to improve, right, you can give a better service to the users and you can actually reduce the price of healthcare. Also, they're working on real-time alerting because you can get data at real-time and then you can process that data against a bunch of other information and you can actually alert doctors about things that can happen or things that actually might happen at some point. It's also being used to prevent opioid abuse. It's actually gathering information to see where are the higher risk factors on people to get addicted to opioids. It's also fighting cancer. There's a lot of information on cancer being gathered and this information is also run through data science with different statistics algorithms with machine learning to try to detect what are the causes and what factors might actually lead to a certain cancer. There's also being used in telemedicine so you can actually have patients at home gather data, compare this data to other data sets and see what are the status and what are the outcomes that might be for those students. This is actually, it was a funny presentation but it's also a very interesting field for very numerous reasons. I want to announce that we have here at Upcoot Academy we have a full data science courses track and we run different courses starting from Python development for people that actually don't have any knowledge about programming at all and then we have an introduction to data science and we have data science one which is for people that has a little bit more background on probabilistics and they have been programming around and they manage properly with different libraries on Python so that will be the second data science and we have some data visualization course with Tableau that is also available for you. And we also have government subsidies on the courses and we're currently starting data science introduction on the 15th of December and it's a six-week course that runs from 1 to 3.30 p.m. and we also have the Python development for those of you that don't know how to program yet on the 7.30 to 10 in two weeks. So it runs, I think it's either what is this one, Monday, Wednesday, Friday, or is it? Monday, Wednesday, Friday, yeah, at that level. So I think that's it for now. Thank you very much for everybody. You can reach me at my email, Marco at Upcode Academy if you have any questions. I'll be happy to answer those.