 I'm Julia, I work as a data scientist. I spend my whole day in Python analyzing data, basically, except when I reach Java. I'm going to be talking about a couple of tools that you can use to analyze data, like maybe rap lyrics. So if you want, you can follow along with this talk there if you want to look at your screen instead of me. So iPython notebook is a web-based user interface to Python or iPython. It gives you a lot of pretty graphs. You can type in some stuff and get graphs right away. You can do kind of literate programming where you can interspersed code and text. You can make slideshows. I made this slideshow using iPython notebook. And you can do version-controlled science. And pandas is a data analysis tool for Python. It's a little bit like R for Python. And it gives you really good data structures and a lot of useful helper functions for cleaning up your data and transforming it. It's really fast. If you want to do machine learning, it integrates well with second learn. It's basically the best. And I would never do any data analysis without it ever. Like, I don't even know. OK. If you're installing this stuff, don't use the Ubuntu packages. You can use PIP or you can use Anaconda. Anaconda is a packaging tool that will do all of your scientific Python everything by magic. And it's the best. It's free. So the way you start iPython notebook is you run iPython notebook. And then you add dash, dash, pile up in line. And then it'll start up and it'll open it in your browser. And you'll have it. The data I'm using is that we're looking at is going to be some data about how many people are biking on each of the bike paths. So right now it's getting kind of cold. I just got back from San Francisco yesterday. And I discovered that it's cold now in Montreal. It's not normal. So at this time, less people might be biking. So I downloaded the data from the Montreal OpenData website. You can get it yourself. And I got this, right? So it's like the encoding is wrong. It's using semicolons. And there are these dates. And it's sort of a mess. Like some of the columns are missing. Right. So Pandas gives you a fantastic read CSV function where you can tell it all of the things about your data and how it should parse the dates. And that you have to put the day first instead of the month first because we know how to write dates unlike Americans. Because the default is to write the dates like Americans. No offense, Americans. You're great. Right. OK. So we look at the first five column, five rows of the data. And we find out that apparently the first day of January was worse than the third day. Sure, great. So we have our data frame. A data frame is the kind of the basic data structure in Pandas. It's a little bit like a table of like this columns you have rows. It's kind of intuitive. It's a little bit like a SQL table if you want. Or like an r data frame if you know r. I actually don't know r. But if you do, I hear it's awesome. But Python is a more fun to program. As we all know. So to plot the data in a data frame, it's really difficult. You do a bike data dot plot. And you get a beautiful graph, which shows you that really surprisingly, people don't really bike in the winter like in February. And then people start to break in March a lot. And then this was last year. And I don't know if you remember that it got really hot in March over there. I think I'll like die if I go all the way over there. But you see that spike, right? So it got really hot. It was like 20 degrees. And I was on vacation. And I missed it. And I was upset. And then it got cold again. And people stopped biking. And then here's the summer. Good. So that's good. If I wanted to like the median number of bikes on each path, I have some numbers here. Or I can take the median and then plot it. So this is the typical kind of one-liner you're right. You're like, I want the median. Oh, actually, I wanted a graph. OK. You can look at only some of the columns if you want. This is not that exciting. We're going to do more exciting stuff, I promise. OK. So I wanted to know, if people bike more on the weekdays or on the weekends, like are we a commuter bike city or are we a for fun bike city? What do people think, actually? Do people have ideas? If you saw the stock before, you're not allowed to answer. Communic? OK. We'll see. We'll see if you're right. OK. So the first thing I did to do this is I added a weekday column. So this is a number for each day of the week. And you can get that. If you do bikeday.index, that's the date column there. And Pandas was built for financial time series analysis. So anything to do with times is super easy. So you see index.weekday, and then it gives you all the weekdays. And there's a whole bunch of other stuff. So we have the weekday for each day. And now we can use groupby and aggregate to add up all of the numbers for each weekday. So groupby here is a little bit like an SQL groupby. If you're familiar with SQL, if you're not, it's fine. But it does kind of the obvious thing that you would want to do. So it finds all of the columns where the weekday is zero. And then you aggregate that with a sum. You can use any function you want there. So if you want to the maximum or the minimum or to take the first three and add them up, you could also do that. I don't know why you would want that. So we get a whole bunch of numbers. And I hate tables of numbers. I mean, I love them, but this is not, are you learning? I don't know, I'm not learning. Plot, right? So we fix our index to have actual days instead of these numbers. There was probably a smarter way to do this, which I didn't do. And then we plot them. And you guys are right. We're commuter city. We bike on weekdays. Yeah, I don't know why. I guess people are kind of tired on Monday and then on Thursday. But it could just be that Thursdays, that year were warmer than Mondays or less rainy. I don't know. There's more investigation to be done here, just saying. All right. So let's go back to this graph. There's a bunch of spikes here, right? And I was like, what are all these spikes about? Maybe it's the day of the week. But it's obviously not just the day of the week, right? Like, maybe over there in the winter it was, because it's kind of this regular pattern. But over here, there are more serious things going on than it being Thursday. So I wrote a little function to get the weather for each day of the year. You don't have to read this, but the point is it didn't take very long, and it wasn't that much code. I scraped the weather.gc.ca, and then wrote my little get weather data function. It takes like a minute to run. So we have a bunch of stuff here. We have fog, visibility. I don't really know what to do with humidity. Like, maybe people don't like going by humidity or mid. No, temperature. We're going to look at the temperature, OK? So you'll notice here that we have the temperatures every hour, and that's too often, because we only know the number of vikers every day. So because pandas is built for a time series analysis, it's really good at that. So all we need to do is say, resample, d is for day, and you say, how equals mean? You can also put a function here, like your own function, like we talked about before. But if you say mean, it'll just work. So this gives you the mean temperature every day. And we could change this if we wanted to get the temperature at 8 AM if we decide that people make their biking decisions for the day based on the temperature at 8 AM and not on the average temperature, because they don't care about the temperatures at 2 in the morning, because they're not biking then. But OK, so the thing I wanted you to see, did you notice something here? Can you see, like, a similarity between these two graphs? Do you see that spike over there in March? How it got warm? And then people started biking? And then it got cold again, and everyone was like, oh, no, never mind. I didn't mean it. And then anyway, so the temperature gets warm. But I think that these two things don't seem to me to have a lot to do with each other. Like, the fact that it goes up from, like, 20 to 25 degrees, I don't think it's causing these huge spikes downwards. You can believe me or not believe me. Yeah, yeah, we should do it, but not right now. I want to talk about the rain. So to get the raininess, I looked at whether or not, so this is the biggest one-liner ever. Oh, and there's warnings in here. OK, so to find out if it was rainy, I looked at this weather column, and I was like, does it have rain or not? So then you get a 0 or 1, right? True or false. And then I did this thing where I said lambda x into x, which is not very smart, because you can just say int. Don't do this. It makes you look dumb. And also, it makes your code way slower. Because sometimes pandas does magical things where it makes your functions vectorize. So instead of having the cool numpy vectorize version, which is fast, you'll get the slowest possible thing. So just say map int, or don't even do it because it'll handle the type conversion anyway. And then we resample it again, right? So this is going to give us. Is this going to show me what it gives us? No. OK, it'll give us the percentage of raininess that day. So it's like, here it was raining for 0.60% of the hours in the day. And then if we plot this again, you can see it. So I don't know if you're going to believe me here, because I feel like these things match up. And this spike down, or this is because of this raininess here. This looks plausible to me. Does this look plausible to you? Do you believe me? In July? OK. So people actually will also be all know that. We don't bike when it rains. I mean, you probably do. You're probably more hardcore than me, but I don't. OK. So the last thing I want to look at is a couple of data points. So I want to talk about slicing of it here, because often you have this big data frame, and you want to look at just a small part of it. So one fantastic thing is you can look at the first five things, so like that. But you can also say, I want everything from May to September. And you can just give the date as a string, and pandas will parse it, and then compare it to the date objects and give you what you want, which is abnormal. And then you can say, take the summertime data, and then get my very column, and find out whether or not it's less than 2,500. So this is a pretty heavily traveled bike path, so if it's less than 2,500, then that's weird. I found that out by looking at the graph, and you're like, let's pick 2,500. Very scientific. And then what I can do is I can take that summertime data, less than 2,500, and index my data frame with that, and say show me all of the data where there's less than 2,500 bikes. And this is all basically a one line of code. And then we can see that those are kind of rainy days. So I kind of leave myself a little bit more, except maybe this day. But I had a theory here. What was it? Oh, this day wasn't too rainy, but it was kind of cold. And also, it was the day after St. John, and maybe that had something to do with it, and maybe not. I don't know. You have to be careful when you're looking at numbers, because you're just going to start making stuff up. That's data science. Don't just make stuff up. All right. That's all I have to say about pandas. I have a little bit of advice. One is read a little bit of the documentation. The documentation is quite good. And there's a lot of it. And there's a lot of examples. And every time I read it, I'm like, oh, no. Here are a million things I should have known about. And why don't I know about this? And I use this every day. So this is like a 460 page PDF. And you don't have to read the whole thing, but just read a little bit, and you'll learn things. Wesum and Kenny, who wrote pandas, has a book called Python for Data Analysis. It's quite good. If you don't see it lying around, you should read it. I always use vectorized operations. So I haven't talked about this too much, but don't write your own loops over data frames. It will be slow. Though apparently, NEMBA makes it fast. But I don't know about that. You should learn about that and teach me. But yeah, I always use vectorized operations. So I always operate on the whole thing all at once instead of writing your own loops. And that's it.