 Okay, I'll try to rein it in. For those joining us in the future on the video, you missed the intro, sorry. Okay, so I think of the process like this. We have a question, then we write some code. Then we have code. Then we run the code and we have results. We interpret the results and maybe we have answers, maybe we have more questions. So here I have these kind of, these are ways that we hold these things that we have. The question is in our brain, then we have our code. Our code is going to be, Python, Pandas is gonna be some text. And then we have our results, which run on the computer. I think there's some more seats over here, whatever you guys wanna do. So a big challenge with this, I find, is the interfaces. So this is where the work is actually getting done in these arrows here. So this first arrow, we have the interfaces going from brain to text, going from our brain to the programming language. Next we have text to computation and then we have figures, text to brain. So these are a lot of steps and we can think of this as a pipeline. And what do we wanna do as we're engineers? We wanna optimize our pipeline, right? We wanna look for the area that's the slowest. Now optimizing this, I mean speeding up my brain and that's gonna be a little difficult for me. This, in a lot of cases, is actually going to be great. It's going to be, basically we can write better compilers and for small data sets, maybe not for giant data sets where we need to do, it's gonna be really fast so as soon as we can write down our code, it runs almost instantaneously. So today we're gonna talk about optimizing this part, optimizing the interface from having our question to having a programming, having some language that represents that question in a way that the computer can understand. Now it looks like almost everybody in this room may have used pandas at some point to go through this process. You have some ideas and you represent it in your pandas code, you run it and you have some answers. So pandas is fast and very powerful but I found that it can be a little overwhelming at times. So here you know import pandas, foo is a pandas data frame, foo dot, here all the attributes on foo 219 and this sometimes has me feeling like that. So I'm gonna go for a second to the Zen of Python, if you open up your console and type import this, you'll see a beautiful little poem and here are two lines from it. Simple is better than complex, sparse is better than dense. So recently I learned this great library called dplyr which is really fantastic but it's in R, it's not in Python and I love Python and I think many of you in this room love Python. I found that dplyr maps a lot better onto the way I think about data analysis, making me much faster at writing code to analyze data. So the goal of dplython is to bring the ideas of this package dplyr to Python. All right, so when I was saying it was Zen or it's simple, it's because there are really only six main things you do in dplyr or dplython now. I'll read these out, hopefully everyone can see these okay, but these are sift, select, arrange, mutate, summarize and group by. And for those of you familiar with SQL, a lot of these will sound kind of SQL-ish as well. Sift means you basically select or you basically remove some rows. I'm assuming here for data that every row represents say an observation and columns might be variables of that observation, so this is called tidy data format. So you can remove some rows, you can select specific columns, maybe you don't want all the information, it'll look a lot like a select statement. You can sort by different columns, add new columns, these columns derived from previous columns, summarize, which collapses columns to one value, so if you're finding the mean of a column, maybe you'll summarize that into one value and group by, which will have your operations done on groups instead of overall, so if you want mean for a few different categories of objects, then maybe you would group by before you summarize. And now there are a lot of different mathematical functions, different things that you can do inside of all of these, but these are the way that we are manipulating our data. Okay, so I'm going to give a few examples of this using a data set of diamonds. If you're familiar with the ggplot2 package from R, I think this is built in with it, it's used for example visualizations. So each row is a diamond and there are a few different columns to describe it. So carrot is a measure of the size of a diamond. You have the type of cut, you have a type of color, and you have other attributes, which I don't know much about diamonds, so I don't know much about this. So let's talk about sift, this is a really simple thing, which says, okay, we only want to select a few things. So let's say I'm just interested in really big diamonds. I just want the diamonds that can make me rich. So sift is a function, the first input will be a data frame, and now this is going to look a little funky, but the extra inputs, the additional inputs are going to represent columns and operations on columns of your data frame. I'll describe in more detail some of the guts, like what's going on behind the scenes here, but basically X is kind of a stand-in for diamonds. I'm going to take a second to run this in the console. That is way too small. So here's our data frame. So lots of different columns, diamonds, and now we can call sift on it. And yeah, thank you. I thought I'd be better at one-handed typing, but. Okay, so now we've filtered, and we've just gotten the rows where the carat is greater than four. Now we can also do other operations on carat. So for example, maybe, I don't know why you'd want to do this, but you only want the square of carats to be greater than four. So we see a lot more because anything over two carats is going to be displayed now. So we can do other operations on top of carat. And now this is a particular thing to sift. We can do, I don't know, other things here. So say we want to build an additional filter. Okay, x.cut is ideal, so now all the rows that we've gotten will have cut be ideal. So that's idea one, is to break every task into these little verbs that do one thing. The next idea is piping. And this, so the x thing might look strange, a little funky to all of you Python experts. This piping might also look a little funky. So rather than just have our function call like before, what we can do here, instead, I'm gonna put the mic down and be loud, can do diamonds and pipe it into this. And that'll work as well. So we can just not have the argument for the data frame and write this pipe instead, and it'll go straight there. It'll do the thing that we want to do. And you'll see the reason for that in a little bit. This might also remind a lot of you with Unix command line. Okay, so here we have select and mutate. And now maybe you can see a little more why it's nice to have this piping. So I'm just gonna copy and paste this into the command line. And do that. Okay, so select just like a SQL select, or there are ways to do this in pandas, as many of you know, we'll just select certain columns. So as you can see here, I'll blow this up a little more. Can everyone see this okay? Okay, excellent. Sorry, you can't see the bottom? Yeah. Okay, thank you everybody, takes a village. So this is a little messy because we lost indentation, but you have your diamonds, you pipe it into select. So if we just run that, this is just going to select the two columns. Just care it in price. Now what I've done is mutate. Mutate adds a new column. So say I'm interested in the relationship between carrots and price. I wanna know how the effect, like if I get a much bigger diamond, is that gonna be, you know, am I gonna be much more rich? So here I've added this new column. And again, you can't see that. So I will hit control L. And we have USD per carrot, X dot price divided by X dot carrot. And head just shows me the first five columns. So I don't have to keep scrolling up. Or the first five rows, sorry. So now we have carrot price and we've added a new column, USD per carrot. So the reason this pipe operator is very useful is that it helps with being able to compose these operations together. And when I think of what I wanna do to the data, I think of it very sequentially. Maybe not everyone thinks this way, but I think of it very sequentially. I wanna, I'm only interested in these two things. I wanna make this new variable. Maybe I wanna summarize, et cetera. And I've always found that each thing I wanna do maps really well into one of these six verbs. So we'll go through a few more examples here. So now this is summarize. So we have the same thing we did before, USD per carrot, but now I'm curious about what's the average? What's the average dollar per carrot? And maybe I wanna know the standard deviation in the median. So that takes what we had before. This is just what we had before, select mutate. And now we summarize x.usd per carrot dot mean. This is this new column we created. Take the mean of it, take the median of it, take the standard deviation of it. And it gives us a data frame output, three columns, one row. And these are the values that we are curious about, that we wanna know. Okay, but wait, my diamond retailer comes to me and says, well, color has a really big impact on diamond prices. So let's say we use group by. This is where group by comes in. So again, we take the data that we were just using and now we did this little group by. So I'm making a new column here, which is called bin, or actually I'm sorry, we won't do the color example maybe later. This says, I wanna know if this is different for different sized diamonds. Maybe really big diamonds, there's diminishing returns, people don't care about how much bigger it is so they're not willing to pay as much, maybe for the smaller ones they care more. So x dot carat, I'm integer dividing by one. Basically I'm just breaking, I'm cutting off fractional points of carats. So is it gonna be at least zero carats, at least one carat, et cetera. So with just adding this little group by in here, we now have these bins, which is the rough size of carats. And now we can look at this and see mean and median and standard deviation. Okay, and I'm going to go back into presentation mode. We can also import external code. So say I wanna visualize something. Right now this is a little clunky, I'd like to improve this, but you can just call this as a decorator or a function that you can call on a function. So I'm importing PiLab to do plotting. I run this decorator on the scatter function. And now I can pipe my code directly into this. Sample frack just samples a fraction. So this is sampling 10% of the diamonds. And then this is a scatter plot where the x-axis is carat and the y-axis is price. Here's another example of this. So ggplot is a, now a Python package, it's a port of the ggplot2 library in R. And this is just to show the same thing. We can pipe our data into other, we can run transformations on our data and pipe it directly into plotting lines of code. Okay, so maybe a lot of you are saying, what is going on? What is this x? What is this piping? How are you doing this? This is black magic. It's actually, the ideas behind it are actually fairly simple. So this isn't exactly our code, but this is a baby version of what's going on behind the scene. So how many, let's get another show of hands here. How many of you know what operator overloading is? Can you just raise your hands? Okay, so looks like about half. I'll explain that really briefly. If you have a custom Python object and you want to use like the plus sign or the minus sign, if you want to use operators on it, you can do this, you can do that. And it's actually pretty simple. All you do is write this method, underscore underscore add, there's a special one for each operator. And that will let you do custom code for adding. So if you want to make a vector and you want to do vector addition, that's how you can do a lot of this nice stuff in pandas or in NumPy. If you have a NumPy array and add one to it, it'll just work. So here, what I've done later has a queue. It's a queue of functions that are going to happen to it as soon as it gets data. Whenever we do something to it, so if I say x.carat plus one, add gets called. Other is going to be one in this case. We create a little lambda, which is x plus one, and we append it to a queue and then we just return ourself. So x plus one, we just return the same object. Then later on, when it's time to actually, when we have the data present and we actually want to run it, we can call this evaluate method. That will take our objects, so in the case of de-Plython, that'll be a data frame. And then for all of the lambdas or functions in our queue, we just keep running those on all of the output. It looks sort of like a reduce for those of you familiar with functional programming. And then we return the output. So you can copy and paste this, and this will run in your console. Foo equals later plus 10. This is our input. It'll give you the right thing. And these are reusable. Okay, I'm running a little short on time, so I'm gonna skip some of this. For piping, it's kind of the same thing. We just overload the R-shift operator. And we have a lot of other features that are very useful, such as joins. You saw the head statement, implementing a lot of what's in de-plyer for those familiar with de-plyer. All right, the future. So we have, de-plython is working right now. It's up on GitHub, and there are things we want to do still. So it's very close to Panda's speed in a lot of cases. I should also mention that this is using Pandas in the back end. So it's gonna be very compatible with Pandas code. It's, the data frame we're using is just a subclass of the Pandas data frame. We're not trying to replace Pandas or anything like that. We're just trying to build this additional API on top of it. So I didn't mention that earlier, but I'm mentioning it now. So we're very close to Pandas in most cases. When we have a lot of groups and you summarize on those groups, it's a little slow at the moment, but we're working on fixing that. So that's one goal, get it as fast as possible. We'd like it to be a little more usable, more seamlessly integrate with code with the built-in Python functions like Max, Min, et cetera. And full features. So for the R fans, Tidy R is another great data manipulation tool. We'd like to implement some of those features. There are other features inside of Deplier that we're not done yet. So these are some plans. It's actually being used a few places. So I've heard from people at Square, people at Twitter that are using this currently. Hopefully you're the next person to use this. I don't know why it's all Jack Dorsey startups. Maybe he's a fan. Who knows? And yeah, we're up on GitHub. You can just Google it pretty easily. I'll show the link at the end of the slide. So in conclusion, go back to this first slide. Have a question, have your brain. Your brain needs to turn that question into texts. Hopefully Deplython is a great way to do that. With that, I'd like to say a big thank you, particularly to the Deplier people, to the Pandas people. They do extremely excellent work to our Deplython contributors who have been an awesome, awesome help. And to you, the audience. And with that, I have a few minutes that I saved for questions. Thanks.