 Yes All right, hello everyone and welcome to our last session before lunch Please join me welcoming on state Ignacio and talking about how everyone everyone can do data science in Python Hi, everyone. Thanks for being here My name is Ignacio Lola, and I'm going to talk a bit about how to do Data science in Python and what data science is for me So a quick overview summary of what we are going to do I'm going to talk why I'm what I do. So why I'm here actually talking about this a Bit of an overview of about what data sense means for me What is the let's say flavor of data science that I'm going to be talking about? And then we will do a quick overview of of the data science Cycle in with with some examples in Python data acquisition cleaning processing and Also using that data to predict some stuff So that's me with a bit less of fossil hair And who I am I'm not a software developer by training. I study physics actually So I'm more of a from the I came from them maths Background or point-of-view. I done some research in systems biology complex systems always very interested in in How things work between each other and things like that That drives my attention to big and small data not so so long time ago And I start coding in Python around Three years ago You need to to have in mind that my all my previous coding experience was doing Fortran 77 during university and I'm not kidding and it was not so long ago Probably they are still teaching Fortran 77 in physics. I'm sure and yes 77 not even Fortran 90 I become obviously in love with Python very easily and I Become also engaged in the startup world doing a lot of data science and and those kind of things I'm also a huge advocate of pragmatism and simplicity And you will see that in everything that I'm talking about today That's why this talk is also pretty much a beginner's talk into data science because I believe that With very little tools you can do a lot actually you cannot solve everything. That's for sure There are still Problems and things that will need Very clever people to work on there for a lot of time But most of the stuff actually can be solved quite quickly by by most of us now on Contrary to to say in the time big advocate of pragmatism I'm I've done for the very first time all these slides in in Python notebook because well I thought you know, it's a Python conference I should give it a go and do all my slides shows in Python It makes sense it took me forever But I'm actually so it was not very pragmatic But I'm actually quite proud of the result even if it doesn't look as good as if you know, I will try To use PowerPoint or whatever. I'm also one more thing I'm also the the demand stand between you and the food from lunch So I will try to be a bit fast and and do this a bit fun Because yeah, I'm looking forward to food. I will have to roll the introduction early today about it. I Also work at import.io This is relevant because of some of the stuff that I will be talking about and also because of the vision of the data That I have and the kind of data science that I do And what is import.io? It's import.io is a platform that have two two different things It has on one hand a set of tools free tools for people to use and get data from the web So to do web scraping without having to code it just have a UI and and you can interact with it We know really a lot of technical Knowledge and get data from the web even doing crawlers or things like that and it's also on the other hand an Enterprise platform for for just getting data. So we use our own tool and other things and we just Generate very big data sets that that we sell I've been working in particular for a couple of years as a data scientist and more recently as the head of data operations So so heading basically the data services that put those data sets together and deliver those to to customers Now let's go into the the topic What what we talk when we talk about data science There is a lot about there's a lot of hype around data science Which obviously came with good things and bad things When you have hype there is some good things about it. There is a lot of jobs around it So it's easy to find a data science job You can get very well-paid to do it to do it, but also there is some bad connotations on it. So Usually a lot of roles are ill-defined. So you can find unless you can find with the same TAC things that are really really different And the expectations sometimes can be actually quite not fair to to to what it is To define what I mean with data science I'm going actually to just talk about it to just talk about what is the cycle of data science for me as it could be the cycle of development and We will just see it on the go what I mean with data science and I'm going to start that introduction cycling around this this this nice picture This is called the heroes journey, which I took from Wikipedia problem And I'm not even sure if the context of this image was like talking about movies or books or whatever but it's a very nice It's a very nice metaphor for I think For most agile development cycles and I'm very very good one for for data science That thing that is called the the call to adventure in in that diagram is what I call the point to solve for the business questions Everything needs to start with that All pieces of work that we do in the science need to start with a business Question with a problem that you need to solve Otherwise, you're just doing things for the sake of it and I will be coming back to this thing probably two or three times over the over the presentation because Kind of obsessed me because I see a lot of times the opposite So yeah, here is where where myself the pragmatic is coming That's always the starting point Then that threshold between the known and unknown is when we start actually collecting data and cleaning data to try to Solve that problem all those questions we need then to do exploratory data analysis, which is usually what? Drive fast to some kind of revelation where we can actually start to having some insights and knowing what we can do what we cannot do and so on in the in the framework of the of the business that we that we are working on Then it comes the algorithm and machine learning. So trying to use that the staff to make some predictions The last things but not the less importance is at the end We need to answer those questions that we try to solve or to do a kind of MVP and we need to remember This is a cycle when you usually Arrive to your first model is just the first step into making it better It's just the first step into actually solving that issue. You might then realize that You have learned something, but you have learned that that model is not the kind of Correct model that you need to use or that you need to to change the kind of data You were that you were doing as far as you have learned something from the first iteration of the cycle You're going in the right direction I also want to mention that When when we talk about data science, especially in in tech talks like this Most of the time we just focus on on on the machine learning and the algorithms Which is fine because it's a lot of fun and and if you're talking with people that came from mathematical backgrounds or from a Programming they will get really deep into this kind of stuff because we find it fun Myself included we find fun to to be playing with with Google's deep dream code or to do a stuff like that now Actually most of the time that we Do data science or something similar we are not playing with those kind of stuff and we are doing many other things like data cleaning or Or exploratory data analysis usually takes much longer than than playing with algorithms or or tweaking them And not everybody talks about those kind of stuff and usually a lot of the pitfalls are there so Not going to read all of these things, but I think it's a it's a very nice List of sentences that I agree with most of them and and I will just highlight a few things that date is never clean Yeah, most of the time tasks will not require deep learning or things like that Most of the tax actually cool cool be done with very easy tricks and and we will we will see that Yeah, this is basically a lot of the things I believe I didn't write this I quote there the person who wrote this but It's very pragmatic. I like it a lot. I think it's there's a lot of truth of truth is about about data science there So let's go inside that cycle and let's see some examples and let's try to do some stuff and see how how that goes This is a cycle which basically is you get data you process data that yes the data and then you use it and that's like a mantra We need to be a bit careful with that mantra because if you go deep into it You can just you know you can try you can you can be biased by yourself biased by The data that you have and then because I have this kind of data I'm going to predict these kind of things because that's what I can do or By us by oh, I really like to do a neural network right now. So I'm going to do that those kind of things happen and happen all the time and Actually, what you should be biased or is through the business to say and okay I'm trying to solve this issue and trying to predict this thing. So what what what data do I need for that? What is the kind of algorithm or model that I need to make that prediction and that's that's the right approach Sometimes you might end up Using yeah, the data that you have and doing that cool neural network Other times you might be doing a very simple regression or or yes just drafting some KPIs, but that's fine The goal the goal always is actually to have fun an action after what you have done Your goal is that when you have finished your work Something is going to change something is going to change in your business or something is going to change in How people use your product or or in how you see your product or whatever But there needs to be an action if it's just like knowledge for the sake of it. Something is going wrong and And you need to fix it So let's going to get in data This is a very important part I'm not going to stop a lot on it But it's a very important part because we can also be biased in In getting data and not a lot of people talk about this, but we can get data from you know Or internal data store which could be a my sequel Database and getting data then might means doing a sequel command or a series of sequel commands I'm putting that into maybe your Python code or or or a file that you then are going to process and make predictions on now This is very important because usually when then you are going into the machine learning and doing cool stuff with the data You don't think again about how did you get the data? And if you have done an a mistake or if or if there is some kind of bias in how you get the data You already have to be you will be conditioned for the whole Rest of the cycle This is the very first step of the funnel So you need to be sure that you're doing it right or that if you are doing something that is Where you have questions you at least have written down those question marks So so you know where to go in the future if you need to If you need to to review this As I was telling we can get data from what can be internal sources like like yeah the database where you have data around your Web page or around your customers or something like that You can get also external sources which for me and obviously I'm biased here because I work on on on this can be things like like Web data data you get from crawling or or things like that The next step is to to process that data and what I'm Talking about for processing data. I'm meaning digest data digest data So we get from that data that you got from a sequel query Let's say or or whatever that is into the actual Nd array that you are going to use in Python to to make a prediction or to make a plot That's when the data is ready and there are steps in between where things can go wrong or where things just can't take time to To make so we are going to do a very simple example I this is a web page called call a speaker Pedia which I find by pure coincidence some time ago and it's basically like a Wikipedia or or at least of Speakers work around the world for kind of topics you can find. I don't know Obama there or or things like that and how much they Cost if you want to put them in your conference Basically, this was for me a surprise because I didn't know people charged to speak in places But apparently some people do that So I crawl the whole site and I make a database of Of this the stuff just to make some analysis and some quick fun stuff or or or insights into how that strange world of people who Of people who receive money for speaking work I've done that within portfolio, but I'm not going to go into how I crawl those the site It's pretty easy and if someone is interested I can I can show it to you Probably take like 10 minutes or so to to set it up I'm using them pandas to to See the data and also to to clean it a little bit As you can see here, I'm just consuming the CSB that was let's say my in the output of my crawling It was actually and we got around more than 70,000 speakers and we got a lot of information I'm just plotting here some some of the ones sorry showing here some of the one that we have like We have the speaker name the fee. We have the location tags stuff There is a lot of things to to to clean here, which is very common in in getting data from the web And in some cases you can just do the cleaning while you extract data It's the same when you are calling a database or when you are crawling if I use the right regex, let's say I could have turned ahead those fees into Number that will be read as a float here and not as a string because I have the case but I've done it very Plain and naive just to showcase how these kind of things happen and we need to deal with them Same thing happened for for tweeters where we have that Inside in safe at least or many other things. I'm actually putting only a few columns here, but I have many others So I'm showing here how how can we clean for example the fee data? Because if we are going to do something simple the first the very first thing that I would like to see is, you know, how much people Charge for speaking and how many people actually charge and things like that So I can suit I can replace very easily the case for zeros in a string and then Reload the say the that that call of the data frame as a float And we have then this ready to be to be used to be consumed That's what I'm calling basically process the data getting it ready for that getting it ready for for using it Wow There's a lot of things to do to do in using data before going into making predictions with it And a good example is a data set that we just saw One thing that is called that thing that is called exploratory data analysis is basically knowing okay I have that data set. I thought that was cool We need to make now something out of it We need to know where we can start and I'm breaking my rule here I know I have no no business context or question in this problem. Okay, this is just for fun I'm just I just don't love that thing and I'm for fun. I don't really have on an objective In so far this we will see other examples later where where I have that objective and are more like more real-world examples This is not the case, but the exploratory data analysis point is very similar You need to see what what your data look like? And if I see what if I can if I want to see what my data look like in the previous example Well, I can I can print the the average the median and the most of the fees of that data set and we see Very easily here. Well, we have an average fee of more than what is that? 20,000 sorry, $12,000 But the median and the most is zero which is already telling us okay a lot of people actually charge zero So that average is probably meaningless on that on that sense if we do a box plot We actually see that we actually see that we see that but we see something else the box plot is not even a box is just a line Because there is so many things close to zero and we see that that's also because we have like three outliers here Three outliers that are I don't know like a really crazy number So crazy that I can think probably maybe it's not true Maybe it's you know, I don't know how a speaker a pedia works But we can go back to the source and think again, and this is why we need to think about this kind of stuff Well, maybe if a speaker PD is actually like a Wikipedia and people can edit things that might be not true I might be something putting something crazy because that's what ten millions or whatever, you know That that might be or even if it's true. It's It's changing a lot anything that I do in my data set I have 70,000 people here and just those three guys is going to change all my numbers So I might want to to exclude those liars in any further analysis And One more thing to comment here. I really love box plots I think they are like one of the most important things important plots that you can think about them probably If I can choose only a few plots to work for the rest of my life It will be like only three or four and I think I can do it with those probably Scatter plot a box plot line plot and an Instagram and who needs something else I don't know journalists to Plot pie charts, but really not people who is doing like actual stuff Now after saying this probably tomorrow, I'm going to use something else and see that is super important but That's that's that's what I think We can go deeper into this and say, okay, let's actually see the histogram but avoiding those Crazy guys to see how actually is this this is distributed This division is something that we will that we will expect ahead And if we again do the same thing of calculating the median the mean and the mode We see that the average is much lower, but we still see the same because there's a lot of people charging Zero, there's a lot of people also in the speaker pedia who is actually not charging They are just there because you know, it's a list where you see people by location and people by Topics and things like that So what make even more sense to do is is something like this where I'm seeing how many people Do not charge anything and how many people is charging and what is the average for those people? Which is around $20,000 for for a talk But we see that only like one between four people in a speaker pedia do that Now it's I'm this this This is getting me back to my previous point of knowing always which are your data sources And how you are biased from the very beginning because the right Conclusion here is 25% of the speakers in a speaker pedia charge an average of $20,000 It's not that 25% of the speakers charge at all because not most of The speakers at all don't charge. It's just that you are not on a speaker pedia. I'm not And that's a very important point It's kind of obvious in this case and maybe it's not so obvious when you are working with your database on how But it's actually the same and you need to to have it clear Other things that we can do we can do here and we are not going to do but we could do stuff like repeat this kind of analysis for a speaker topic and see how different topics have Charge different maybe or have a different ratio between people who charge and people who don't charge That's something very easy. We have a column already for the topic We can do I don't know we can do location versus fee how how fee correlate with the location of the speaker All those kind of crazy stuff Very interesting. Basically when we do exploratory data analysis, we always want to do that that kind of thing of Knowing what is our median? What is our mean? What is our mode? What are the percentiles? Plotting the data to see how it actually looks like which of the wires that we have and also Which variables correlate or can correlate with with others? I'm not going to speak a lot about correlation, but I'm going to Give you at least one comic about it Which I think it's kind of important This we can do a whole talk just about this but I think I think the the comic plot it makes the point even better So okay, we were using data This is that was an example of a very quick and dirty exploratory data analysis and other things before we're going into predictions is KPIs K performing the cater so what are the right? What are the the metrics of the thing that you're trying to solve or or the thing that you are measuring? Because sometimes just Monitoring the right metrics can save your business and very simple things can can have a huge impact So we shouldn't be afraid of going sometimes for for simple tools to do simple jobs every tool has is right for for one job and We shouldn't be afraid of things like Excel, you know Yes, the the fact that we can consume data in pandas and do really cool stuff. That doesn't mean that sometimes I don't know Excel is not the right tool I'm saying this because actually is how most of the people consume data CSB is is how most of the people consume data and how most most of the people is also going to read your data So a lot of times the output of an of an analysis or the output of a report or whatever It's going to be on the end of CSB and it's important that we know also how to not how to work with those tools It's not so difficult, but How to make both use of them there is even a whole book Written by John Foreman called I think simple data, which is just about how to the data science only in Excel And it's a lot of stuff about modeling and machine learning only in Excel When I'm talking about Excel here, I'm talking just about something that can Give you a graphic interface for viewing and editing at CSB not really about Microsoft Excel Even if I choose that picture because I think it's kind of amazing Okay, let's go now into making actually predictions is to doing some machine learning and modeling I'm going to do super simple stuff here, but going to use different examples and and some different whole bunch of different algorithms First of all when we go to this this step is when we separate the data into what is called a train set and a test set and And This means a whole world this mean everything into data science because this is the basics of what of why you will be able to In theory proof why why your predictions are correct? This means that All the data that we were preparing before we are going to split it into two two pieces and one piece is going to be used to train our Algorithm train or machine learning model And the other one is the one that we will use only to test the results So it's the only is the one that we are going to test in the model and then see oh If we were right or not Because we we know the answers for the for that one so we can see what is the answer so for algorithm? And if that matches and we can have some kind of of accuracy into our predictions It's very it's very easy to get biased by this it's very easy also to Your data set not being specific enough you have Samples said that it's actually not good enough for the brain that you try to solve But then you divide it you train your model you test it with your test set and sign you say wow I have 90 percent accuracy and when you suddenly go into a real data set outside your Very big big data set the accuracy is completely wrong that happened a lot of that happened a lot of time It's a very big problem So we need to be doing this all the time is With train set and test set is what it's going to tell us how good the algorithm is but it's not like a magic thing It's still biased by how was your first data set and where do you get it and how did you get it? After doing that we have basically only one question to answer from my Very simplistic approach, which is do I want to predict a category or do I want to predict a number? If I want to predict a category I'm in a class if in a classification problem if I want to predict a number This is just a regression So there are only basically two things to do and being simplistic and taking edge cases aside, but we can put almost everything in those two buckets and are very very differentiate and they Depends on what is the output it's going to be a number or it's going to be a category Let's start with the with the regressions because I think is what is What what everybody has done everybody in in high school has used least squares Unless the squares is a machine learning algorithm that will make predictions with some data for where some Yeah, it will predict other points for for the data using some some trains data that we have There are others Things like lasso or things like support a support vector regression for example We will send example But this square feet is basically a machine learning algorithm on any other regressions that we do are basically going to be the same or the same in In in the theory the only thing that that will change most of the times is how we are defining the distance between the dots and our Perfect line or curve to that to to to those dots How do you define this this distance if it's this thing or that thing or? Any other crazy thing is what will change between having a very simple Algorithm here or having a more complex one But on the end we are basically doing this maybe we are doing this for 20 dimensions and not for two And you know we have may maybe a lot of a whole a whole bunch of other problems But on the nth is this what we are doing And I'm going to do another example here The data that I'm going to use now here is more business oriented is hard drives Is hard drives prices that I also scraped from the internet so I have for a whole CSV with like features for hard drives on On prices and I can I can basically Do very easily a linear regression which is I think it's least squares this thing that I'm doing here after dividing Into test and train set my my my data I can see basically more or less. What is the the variance score for that? For for that linear regression See how it looks like And we come very easily using very easily using Skill learn Doing more complex more complex Regressions support vector machine is just two lines. It's just two lines to train two lines to to to print a score I'm probably again 20 lines to to make a plot But on the nth is is very easy to do and we can get Some some results we see that the results here are not much better than the results that were from a list of squares Are just like 5% improvement or something like that which might might mean a whole world in a in in business context, but It's actually not a lot Very quickly some classification issues Let's try to do as an example let's try to Put our heads into how people is using a platform for example and here again I'm doing a real a real world problem. I'm just I'm trying to get To know better the users of import IO the the free tool the free platform Plotting and and dividing how they use our product. So I'm going to be looking into how much people Is using the platform how many volume do they do of queries? And how often do they do that usage or that volume of queries and I can try to divide that in in clusters That can just tell me something that I didn't know about that data set and and hopefully make me do better decisions in the future We again load some stuff from a skill earn We load the data with pandas. We do a quick A quick model with using means it with this one one way to do clusters one algorithm to do clusters We plot it. I Don't know like how it looks like because we have like bands of stuff So it basically the only clustering that it has done is in one of the axes which Kind of not not sounds right So I'd say let's do Cummins if you if you Google for clusters most of the people do Cummins So let's let's try it and we find basically the same thing and the issue here Which is very obvious for anybody who has done some Some clustering before or even some my children before but not for for the real beginner is that you cannot be doing this This is this is absolutely wrong You cannot be working with an axe that go from zero to I don't know what and one that go from zero to one That's it's never gonna work especially in clustering So we need to clean the data I'm not going to do it But we just basically need to normalize the two variables that we were trying to plot and then we just repeat the same thing we have now to access to go from zero to one and We actually have some kind of clustering that makes more sense visually But also when I go to the data because if I now use this stuff to see okay Which which which user is this and I see it with real examples I see that it makes a lot of sense and one of these users can be I don't know the user who use Python and has connected an application with with our API and is doing millions of queries versus the guy who is Using the UI to do crawling without knowing even what crawling is and making that prediction might be very valuable because you can implement that into your I Don't know your your help this system and the customer support guy that you have working in your company can know in a Heath when a ticket support coming if that guy is actually a very technical guy or is a less technical guy Or is doing this kind of user or of that kind of usage and that will improve the experience For the user and the support that they get and also the life or your friends at the support desk Last thing that I'm talking about very briefly we're running out of time. It's a webpage classificator. I think Decision 3 which is another way to classify things In this case the context is I'm trying to do I'm trying to basically know which kind of website is a website Just by looking up very simple attributes of that website and which type of website of website I mean classifying the content the content so trying to know, okay this is an e-commerce website or this is a map or this is a Job application board or this is a Events data things like that For that very easy again with a skill earning just two three lines to to make a decision tree And also to plot it we plot this thing here and again I'm doing a very naive mistake here, which is when you see something like this a decision tree is supposed to be simple to Simple to read and simple to to interpret simple to know what it's telling you when you see something as big as this Is because you are doing something very wrong You are overfitting your whole data set into a lot of very small conditions that will drop into this huge List of categories and decisions to then make the classification of categories we can very easily change that just by by Doing a lot of things actually but the most simple one You can just say no the maximum the maximum number of leaf nodes that I want is this and then you got a much simpler Decision tree which you can read and try to see if it makes sense Which you can make a prediction very easily also in only one line with your test data and see actually how it how it Works out and that's it the recap always know what problem are we trying to solve? Clean your data and get it ready to use beware of very common Problems like overfitting. I try to make an example of that or normalization of your data I try also to make an example of that And always try to have an output which is something actionable something You say, okay, we finished this analysis and now we need to change this in our business now We need to change this in how we do support with our people or in how we are doing this in our product Or in how we are dealing with this data if there is no that kind of action Basically the whole thing has failed and you need to learn from that cycle and go again into the loop and make it better So that was it Just telling you that we are hiring a lot at import IO So there's a lot of different positions DevOps front-end QA even Python With with very with a lot of data connotations in the role So anyone anyone that want to talk about that or about Data science or a Python or about web scraping I will be here for for the next few days and I will be very happy to engage in any conversation Thanks for your attention Do we have any questions? I've just seen that you jumped over the abyss in the adventurous cycle The abyss like that death and the rebirth Is there something in data science too like that in the hero cycle in the beginning? Oh in the cycle Sorry, what was the question on the cycle even they cannot hear you very well You didn't reference the abyss the rebirth at all the what? The rebirth and the abyss like the death of a friend Huh you're referring you're referring about this The very bottom oh Sorry, I know now what you mean. Yeah, I didn't refer about that But I think that's that's precisely the moment I have I have actually worse for all the things there So I have the metaphor very well on my head and and the abyss basically is That moment of realization where you know what kind of plan are you really trying to solve from a mathematical point of view? So what algorithm is going to work because when we are doing just Exploratory data analysis or when we are doing the data cleaning We might not even knowing at that moment for a complex problem We might not knowing at that problem and we are going to do a regression of a classification We might not and even less what kind of algorithm is better for that Classification problem or for that regression. That's the point of the revelation Basically when you think you have an idea of how to solve that and then you just need to apply it which is much easier And what is your experience with SK learn as as when you were a beginner And do I have to know? Trying around with different parameters the parameters until I get a result or Do I don't have to know the internals of of the algorithms? It's it's very easy to use a skill arm basically in the Documentation in documentation page there is even a tutorial of how to approach it from the sense like Depending what kind of friend do you have what do you need to what algorithm? Do you need to use which is like great map into how to do machine learning with it? And once you know what algorithm you're going to use which is usually just a few lines of code to to put in there and knowing which right parameters are going you need to use it's If we are objective, it's a very hard problem And it's basically the whole thing around this is how do you feed those parameters? but from a Simplistic point of view is not so much you can just use Basically some defaults or something's almost a random you can basically do a loop and an iterate through different parameters and see how it looks like you always need to have an an output from your model which is either a plot or a Prediction or even better The two of them so you can see okay. I put these parameters. This is my out my output Do I like it or not? Let's change the parameters till we fit something that we think it makes sense That will be a simplistic approach into how to Change parameters and and fit in the right things using a skill arm Thanks. All right. Do we have one last question? No Thank you. Thank you chef for good talk and let's all head out for the a pretty fabulous lunch. Thank you very much