 The Python landscape has been growing way. So you have traditional packages like NumPy and SciPy, which if you use Python for a while in the scientific format, you've used those packages most likely. But then you and Matplotlib for visualization. But then you have new packages coming out that are really having a large impact. So things like Pandas, which is a data frame for Python, a lot like ours data frame. There's SciKit Learn, which is this really nice machine learning library. It's not the first, and it's certainly not the only one, but it has a great interface that allows for an API that you can paralyze quickly, if you'd like, for example. There's IPython Parallel, which is a very fault-pollerant way to execute thousands of simulations. There's Spark, which I'm actually really excited about Spark, so I put in some Spark examples today. Spark is a map-reduce-like framework, but it's in memory. It's very fast, and it has more functionality than just map and reduce. So we'll talk about that today, hopefully. And then you have traditional things like HDF5 has a few interfaces to it, like PyTables and HDF. There's also MPI for Py, and we're not going to talk about those. So there's a lot happening. And then on top of all that, if that wasn't good enough, there's also this IPython notebook, and it's a very, very cool web-based interactive framework, and you can do all these things from the notebook. And so we'll do some of that today. Like it. So the objectives, really, one of the big goals is just to become familiar with the IPython notebook and run it if you like it. I'd be happy if that's all we did, but I'd also like to just introduce some of the packages out there that you could use. And unfortunately, with just two hours, maybe even a little less, probably can't go too deep into any of these. So here's a couple of questions just to think about. Talking about data analysis, how do you currently wrangle data? And that could be reshape it from its current format into something you can plot or something you can analyze. How do you visualize your results? We probably have a tool that you'd like to use for that. How do you perform statistical analysis? Or if you do some machine learning like PCA or linear regression, what tools do you use to do that? And then how would you run, say, 1,000 simulations in parallel? And that's becoming such a common thing. It should be really surprisingly difficult for people to have a tool that they can do that with. Data's getting a little bigger quickly. The tsunami's coming, apparently. And of data that is. And you have a tool in your toolbox that allows you to plow through a terabyte of data interactively and kind of quickly. And the answer for a lot of folks is no. So I think what I'm excited about is Python is providing a lot of capabilities in this way. So here's an outline. I'll talk about the iPython notebook a little bit, probably kind of quickly. And I might skip a few things. We'll do some, we'll log into an Amazon instance. And you can run through the code that I'm discussing while I run through it. So you won't have to actually type any Python code, but you can kind of see it in action. And that's sort of a good intermediate result, and it goes a little quicker. And I think we'll work on functional Python because the functional programming model is becoming really popular for distributed computing. And we need it for iPython Parallel and Spark. We'll talk about the Pandas library. And then we'll do some parallel examples, and we might do them all. We might have to skip one so we get to Spark. And then you can just sit back and relax and enjoy the Spark portion. Because that's not on Amazon. Okay, does anybody have any questions before we get rolling? I haven't explained, yeah. So if you got a piece of paper, I haven't explained it yet. So that's a great point. I'm gonna hold off a minute on explaining it. I will say that if you go to the web page, you don't have to do any of the Amazon things, you can click on the outline and you'll be able to follow along just the HTML, and it's got the answers. I also put some extra things that we're not gonna have time to cover, but I just couldn't leave them out. And we might have time if we hurry. And then I have some links at the bottom to everything we're talking about. So I'll get to the Amazon in about 20 minutes or so. Any other questions? Okay, so before we talk about the notebook, let's just talk about Python. There's a lot of different languages. How many people use Python? Yeah, okay. So you don't even have to raise your hand. Just yell out your favorite thing about Python. Where'd he go? I can't hear you. You have to really yell. What's that? Interactive, bam, I love interactive Python. Yes, what else? It's free. It's free, yeah, as in beer and as in open. What else? Easy, yeah, the syntax is easy. It's easy to learn. All the good ones. Anybody else? What's that? ArcGIS uses it, yeah. That's right. But what's that? For fun. Think of all the fun we're having already. Okay, that's a good list. I made a little list also. It has very simple, clean syntax. We've already hit easy to learn, interpreted. It's strong, dynamically typed, runs everywhere. I think that's pretty cool. You can run it on just about any blocks. It's free, it's expressive. And I think that's one of the things I love about it. With very few lines of code, you can do quite a bit of work. And there's one I didn't put in here, but it has a lot of abstraction. Of course, you've got several options and programming styles you can work with. So procedural, object-oriented, and functional. If you haven't used functional, we'll talk about what that means in Python at a very high level. Here's an abstraction I just wanted to show you. I hope you can see that. I'm reading in an HDF-5 file, two of them actually, and then I'm multiplying them together with the NP-dot. So this is the numpy package and the py-tables package. And it's, I don't know, 10 lines of code or something like that. And what I love about Python is I can write this almost without going to Google to figure out how to do it, right? I mean, most of this I can do. And if I compare it to my C or C++ implementation, which I did, Jan is super computer. And the x-axis I'm making the matrix really big. So I think the 25,000 by 25,000 is probably like six gigs or something. So I've got three of those in memory. And then time is on the y-axis and it's in a log scale. So I'm comparing my C++ implementation where I use HDF-5's C library and the math kernel library from Intel. And I'm comparing it with numpy that's using the exact same stuff underneath. So I compiled numpy with MKL and I'm using py-tables. And voila, you can't tell the difference. And that is so amazing. So I get all this benefit without actually having to learn the interface these libraries. And believe me, when I write the C version, either with Intel or with HDF-5, I always have to go to Google. So I think Python's pretty awesome that way in terms of the abstraction. So, let's hit the notebook for a few minutes. Notebook is a web-based Python environment. So you can execute Python code through a web browser. But you can also embed text and video and LaTeX and HTML and all kinds of things inside the notebook. You can run them locally on your Mac or you can run them on your favorite cluster resource. Very cool. And it makes it very easy to share the results. So here's kind of the traditional or at least one traditional option. You have, you're logged into your favorite machine which could be your local machine, just a terminal. Got an editor open, maybe you have an IDE. And then you want to do some interactive data analysis. So you've got a plot window that pops up and then you alt-tab around or you click and point around those things. And what the notebook does is it sort of allows you to stay in a single window and you get this inline plotting and inline commenting. And the comments aren't like a pound sign for Python or something with a comment. It's LaTeX or it's Markdown which looks like it's an easy way to generate HTML. So you can have really rich comments, really nice displays all in line in a notebook that's very easy to share. So we'll be playing with that today. This is just to say you're not separating your code and your text and your results. They're all together kind of in the same document and it's great for telling stories. It's not as great if you're doing sort of an object-oriented programming model where you're building this large object. I would stand a file there but you can always run the model from the notebook. Just to say that you can run it locally, remotely. You can run it anywhere you want and it looks the same from your perspective. And that's actually kind of an interesting point. If you're back at the traditional model where you pop and open a window and you have an editor and you've got a terminal, you've got to be careful to SSH dash X or Y that session so you can do this kind of X term around thing. With the IPython notebook, you don't have to do that. You just log into the browser and you get everything you need to sort of work on your results. That's a little convenient there too. You can export these in a variety of ways so the slides you're looking at, this is actually an IPython notebook just running a slideshow. Slides you might have looked at online were just generated from the notebooks quickly and I'll show you how to do that. You can also generate this right to Latek if you want and it'll give you a PDF or a Latek file. So there's a lot of ways to share. There's also an online service called IPython Notebook Viewer where you can just, if you like GitHub or if you like gist, you can just save these as gists and look at them, you know, display them through a web viewer. There's some shortcuts and I think the only one you need to know today is Shift Enter. But if you're going to use this, you want to memorize a few other ones too. And so I highlighted, I don't know, the 10 that I use all the time. But Shift Enter just says go from this cell to the next cell and I'll demo that in a second. And for today, that's really all you have to do, Shift Enter. You can double click on cells to edit them and things like that. There's different kinds of headings and you can move cells around with keyboards, shortcuts, but Shift Enter's what we need today. I mentioned, I'm going to skip over some of this because we have so many interesting things to do. I don't want to bog down here. This is actually kind of, I want to show this video. So this is, here I'm using, I'm using the IPython display YouTube video object to display YouTube video. And this is Fernando Perez. He started the IPython project. He was a CU student in Boulder. And this is actually in 2011 and I just want to start it here. He's showing how, oh, you're not going to hear this. Let me see if I have a... It's very nice visualizations that the New York time does with D3. D3 is a library out of Stanford. Well, we have an extension called D3 Graph now. And so if I now, if I now load, simply display the same object after having loaded. Okay, so what's the point there? So has anybody followed kind of the New York times and their new visualization with D3? It's kind of this incredible JavaScript library. It's highly interactive in the web and it's kind of changing visualization for the web. Well, they ported it to some Python library. So if you're in the web browser, in IPython notebook already, you can actually do some of your analysis with D3 if you'd like. And of course, everybody got really excited at PyCon here. So the video. Let's see. I think embedded plots. And so part of what this example is just showing is, if you have some sort of data structure, in this case I just have a NumPy vector and I want to see what's in it, I just hit X and hit return. And just like I'm in the shell, I can kind of see sort of the contents of this and it does all the right things where it doesn't display the whole thing and put the ellipse where I need them. And then I can also just, you know, if I want to look at a histogram of this, I can just type hist and it shows me the histogram right below. And it's all embedded in the notebook. I think I've hit that point already. One other thing I like about it is it's totally customizable. So it's just, this is just a web interface. So you can modify the CSS to make it look differently. You can add JavaScript if you want. You don't have to be, you don't have to just accept the current look and feel of IPython Notebook. You can also roll your own output formats if you'd like to do that. There's these things called magic commands and I'm gonna plow through these. I would encourage you just to look through these again. And I just want to point out, for those of you that love R and Ggplot, that you can execute R from Python and from the notebook viewer. And so if you would prefer to use Ggplot for something, you can send your data with the R magic commands and plot your stuff in Ggplot if you'd like. And you can do that with Bash and you can do that with Julia. You can do it with Ruby. So there's a lot of ways to kind of use Python with other languages through the notebook. I'm gonna skip this slide too. It's just, if you do want to run this on a remote cluster, it has the steps you need to take. Most of the time you have to log in, launch the notebook and then tunnel back to it because rarely are your cluster supercomputers gonna open ports that would allow that. And there's some help here if you'd like to see it. And I'm gonna demo logging into Janice in a second. Yeah, I think that's good. So let's do, oh, not conclusions yet. Come back to that slide. What I want to do now is have you guys log in to Amazon instances. So I've got one going here and I apologize. You have to type in that long Amazon text string. I couldn't figure out a way to have, to give people an online link without everybody trying to log into the same one accidentally. So type that string in. It should bring up a certificate with a big yellow page that says this isn't verified. You should just accept it and trust that there's nothing bad that's gonna happen. And hopefully when you do that, you get to a screen that looks like this. What I might do is just, let's just take five minutes to everybody to get on that screen. And I'll just come around and if you're not there. Yeah, so this, what you're gonna see once you accept it is sort of the browser for the iPython notebook. And you can kind of go anywhere in your system you want with that now. Is anybody having some issues? Just raise your hand, yeah. Oh, you can't share, but let me get you one. Go to the one that has Amazon in it. And sorry about the double HTTP, that's a mistake. Here you go. Does anybody else, would anybody else like a login? Doesn't have one? Got a few extra. That's your chance. Go and then put your password is RC. Anybody need some help or a login? Yeah, so you mean if you wanted to run it locally? Yes, yeah. So you can create your own notebooks if you'd like. And these are your instances. So if you wanna do something else while we're talking, you can. Yeah. Oh, if you go back to that page. Yeah, you can hit new notebook and it will bring one up. Okay, is everybody, we probably didn't need five minutes. Everybody running that wants to be, okay, hold on. Anybody else? Okay, okay, I'll fix it. Oh, you know what, actually 39. Let me just give you a new, well, let me fix it. Is your still empty? Hit refresh and see. Oh, yeah, so if you were to run these on your local machine, you don't have to log in. So for example, and I'll just show you, I'm on my Mac here. I'm gonna kill this and just restart. So here is, bigger so you can see it. Here's how I would log in. I would say ipython notebook. Because I have ipython on my Mac and it says, there you go and it brings up this window. I didn't have to log in, I'm just running locally. The reason we're on Amazon is because one, I wanted everybody to have a couple cores to play with. It's really hard for everybody to be on the same software version. And this is such bleeding edge code that you could be off by 0.0 in version and it won't work. So I tested everything this morning. Man, I hope it works. So let's do this. Go into the folder, oops, I'm in the wrong spot. Let me go back to Amazon here. So go into this csdms folder and let's go to initialize and just run this notebook and the way you can do that is you can say, sell, run all. And it should be really quick. This just grabs data if you don't already have it, you should have it and it gives you some more pleasing matplotlib options. So again, you just go up to sell and run all. And there's nothing we really need to talk about here, this is just, we just wanna do that one step. Okay, I'm gonna close that and you can shut that notebook down when you get back to the page and then let's go to Python and you should see something that looks like this. Everybody, everybody's kinda seeing this. So I thought we would do the scientific hello world. So if you click in the box and you hit shift enter, it should execute this code and there's just a ton going on here so let's just really quick talk about it. I import the future print function cause I like it. You don't have to do that, it just means you put parentheses around print instead of a space. So that's what the first line does. Import math, so in Python, and some of this is gonna be revealed, I'll go quick. In Python, if you wanna use a package that's not part of the standard C Python when it is initialized, you just import it. In this case, I'm importing math. There's a lot of ways you could do that. You could say from math import star, you could say import math as M if you wanted. I think import math is very common. And if you do import math, you gotta keep everything, you gotta reference all its methods and variables from the math namespace, which is a good thing. It keeps namespace collision from happening. So we do, first thing I do is I cast a string 4.2 as a float. So in Python, you can cast variables back and forth. In this case, I just say float parentheses. And then I store that value in R, and then I say math.signR, and I store that value in S. Kind of, really I'm just kind of flexing scientific Python muscles here a little bit, right? We're not doing anything useful. And then I call the print statement, and I'm using the new format, kind of string formatting. So I say, hello world, the sign, and I have curly brackets zero. That means the first thing I'm gonna pass the format command equals, and then I have the curly brackets with one colon, and the one means it's the second thing I'm gonna pass. And then I have a little formatting string after that, which is 0.2f, and then I say dot format RS, and that's it. And so then it passes the value of R, which is 4.2, and it passes the value of S, which is 0.87, and it formats it for me. So that's a pretty compact little chunk of Python code. Does anybody have any questions? Yeah. What's your notebook? What user number are you? Should be like a user zero zero. Yeah, do you have your username? I can restart your notebook. Should be on your paper, it says user five. Okay, let's try this. Okay, you can try it again here in a second. Okay, any other questions on the code? Okay, so let's do some functional Python. How many people have done some functional programming? Anybody? Like, okay, Espusers or Scala or something like that. So in Python, there's a quote here. This is, I think this is off of Wikipedia, but Python acquired its lambda map reduce and filter from a list packer who missed them. And so we've had these for quite a while in Python. And the reason I wanna talk about them is we're gonna use them in parallelization and distributed computing quite a bit here. So here's the map abstraction. And this is such a common pattern that Python actually elevated it down to syntax. So it's not just, you don't just have map, you have list comprehension, if you're familiar with that term. And so here's how it works. Let's say you have a function. So in this case, I have square x. I wanna take two numbers and just multiply them together. And I have a list of numbers, which is just one, two, three in this case. And the pattern is I wanna apply that square function to every single number in my list, right? So the one way I would do that is I would have a function and I could say for x and num, I just call the square function and I put that in some results list and then I pass back the results list. And this happens so often in so many languages and it's so common that there is a map function. And the map function, if we do this, we can see that the map function just says, I noticed that I have a container of something and I have a function and I just wanna apply them and get the results back. So all I have to do is say result equals map and the function I wanna apply and then the container. And that's always the case with map. So that's the map function. I don't know what else to really say about it. Does anybody, is that look too familiar or foreign or is it feel okay? Yeah, yeah, right? I don't know if people, I know my own experience is once you see it and you use it a little bit, you're like, oh, that's awesome. Yeah, so I think the reason it's important is almost every parallel sort of for loop that you would use in Python has a map function. So learning to write your code like this means that you can paralyze it like that. Whereas if you have for loops, the other thing is map is really efficient. It's an efficient way to do a kind of a for loop. So I don't know, I'd say, I don't have a good answer for that. I would say use it and see if you like it. I think it's a pretty cool paradigm. This one's a little weird, I'll be honest. So we're gonna move on to anonymous functions. Yeah, yeah, yeah. It's just a shortcut, exactly. That's a great way to say it. And the thing you're doing is whatever function you pass it. So you're just applying the square function to every single element in numbers and then passing back the result. You always have to write the function that you wanna pass unless it's anonymous. That's right. That's just for example, yeah. Yeah, yeah. And if you really wanna do vector code, you're really, you're probably more in NumPy. This is really just for standard Python lists and pairs and things like that. Yeah. Yeah. Oh, okay. Let's fill those back up. We'll tell you what, let me fill them up when we take a break so we can keep moving. We'll take me a second to do that. And we'll take a break in like 20 minutes. This isn't crucial that you're executing, I don't think. Okay, so map is a little weird. Let's do something even more weird. Let's do, instead of always writing the function, let's just write it on the fly, right? So we don't actually define the function. We just have an anonymous function and that's called a lambda expression. And it's so popular, Java 8 is getting it and they're really excited about it. So here's how it works. And I should say the reason this happens is in Python, data are variables, essentially. And function are, I should say functions are data. And so you can define a function as a variable and you can pass functions to other functions and functions can return functions. And that's a little, that is a little weird, but here's what it gets you is I can say lambda x and then I say x squared. So instead of doing a def square parentheses x, I just say lambda x and it's all in one line. And so this says I have an anonymous function x times x and I wanna apply that to every function in my range of 10. So now range 10 is kinda like my nums. So if I wanted to, I could just limit how much we're doing here. I could say numbers and I've applied my map function to the numbers. If I wanted to do this in a single line, I could define numbers as the range function. So that gives me a list from zero to nine and then I apply this anonymous function to it and I get my result back. And of course I could store the result in the res list and then just print that too. So that's an anonymous function. When we talk about Spark, we're gonna have to use those. I mean, we don't have to use them. You kinda wanna use them because it just makes it a lot more concise. But you don't need this for parallelism. Okay, let's quickly do reduce. Reduce just takes two arguments from your container and then humidively applies a function to them. So if you have, for example, a function this is add numbers x1, x2 and you do a reduce on that function on your numbers list, it'll just go through your numbers list and add the pairs humidively. And if you had subtract, it would subtract them and multiply, et cetera. So it just gives you a single value back and of course you could do that with a lambda expression also. So reduce just think reduce always takes two values and it will apply those consecutively. Okay, filter's actually kind of a break from what we've been doing. Filter just says you give it a Boolean array, a function that gives you a Boolean value and if it's true, it includes it in the result and if it's false, it doesn't. So it's just a way to take a huge list of numbers and filter it by some criteria and that criteria you define in your function. So that's the filter function. So for example, I can say filter my results if they're less than 10, for example. And again, if I'd like, I can swap out the less than 10 with a lambda expression and then I don't have to define that function if I'm only using it in the one spot. Okay, so I would say that those are new concepts for a lot of folks. So I would just kind of digest them and don't worry too much about making sure you understand every little bit. And I think you'll see it enough today that it'll start to make sense. Res is just a list that we created somewhere up here. There we did. We created res, this was the value of res when we passed it in. So it's just a list and we're filtering numbers that are, in this case, less than 10. So if x is greater, it's true, it keeps it. You know, I just realized kind of a bummer is we're gonna do spark right around noon and everybody's gonna be hungry and it's gonna be a bunch of functional programming. So, you know, I don't know, I'm open to skip and spark if we're having fun with other things. Let's see how it goes. So let's do sort of the hello world for data analysis in Python and that is let's read in the Hamlet text and count all the words and make a frequency distribution. So you should have Hamlet.txt in your data directory and I would just kind of skip over this next block. It just basically gives me a list of words using some regular expressions which we don't need to talk about. And so if I take a common way to calculate how many unique things you have in a list it's just to take the set of that and then you can convert it back to a list. So you just take your list, which has duplicates. You take the set of it and you convert it to a list and bam, you've got, and then you take the length, you've got 4,086 unique words in Hamlet. So let's do a frequency. And again, this is partly for review, but here we can throw out all the words that have a length greater than two. And so I'm gonna use a filter function. So I have my words list and I say if the length of the word in that list, so x refers to an individual word in my words list. If x is greater than two, keep it, otherwise filter it out. So if I do that, I can see now, and actually let me just print some words here. You can see, oops, got that fancy new print statement. You can see I've got a bunch of words in here and I've got things like to and of in this, in fact, I can select out and just print 10 of them to make it cleaner. And then when I filter those out, I get rid of say of, because it has length too. Okay, the Python dictionary is one way we could do this. So we have a list of words. There's a lot of duplicates. We wanna just store key value pairs. So if you haven't used the Python dictionary, you should learn this one for sure. It's a very popular data structure. You basically stick the word in as the key and then you just wanna increment the count every time you see it. And so we could certainly do this in a map function, but let's do it in a for loop, because I think that's cleaner. I initialize the dictionary and there's a few ways you can do this. This is just one way I just say dict with parentheses. So I have an empty dictionary. That's what D is. And I can even print D here just so you can see it. And then what I do is for every word in my dictionary, I go in and I use this really helpful function called get. It's a dictionary method and I take my current dictionary and I get the word and if it's not there, I get the value of the word and if it's not there, I return zero. So what I can do is I can just walk through these words and add one to the value with the get method. So in other words, the first time I haven't seen the word, I assign it the value of zero because it wasn't there. Second time around, I get that value and it's now one, so I just add one to it and I have two, et cetera. So you don't have to use the get method, but if you didn't use the get method, you'd have some if statements. Like if this word's not in the dictionary, then assign it the value zero, else get the value and add one to it kind of thing. The get kind of wraps that up. So there's our empty dictionary and then we just filled it with the for loop. Good. Let's just look inside that really quick. So now I can say something like for K and the keys. And I'm only gonna look at the first few keys because otherwise things will get, I'll print the key and the value of the key. That's key and the dictionary value for the first 10. And so you can see that we just have a word and values and counts going. I'm just gonna skip over this next block. It's kind of how you'd sort a dictionary and I think it's just gonna be kind of horrendous, but basically what this does is just says, hey sort my dictionary and return some tuples for me. Word and counts. And then of course I can wrangle them with a map function. So right now I have them as a list of pairs but I can't plot that. I need them as lists of values. So I can use a map function to grab those or I could do the beloved list comprehension in Python. And again just so we can get going I'm gonna skip over that a little bit but feel free to ask me questions at lunch if you'd like. Here's a little utility for plotting these then. And so with not too many lines of code we took the book of Hamlet and we counted the word frequency, we threw out small words and then we can say oh look at Hamlets is popular or more popular than word that. So we see the word, we hear the word Hamlet at least as much as we hear the word that or it is. Okay. So the point here is that we have this process of taking raw data and processing it and that could be very large data and it could be very large scientific data. And we did some wrangling, we might wanna do some exploring, some visualization and there's a lot of ways to do that and I think Python with the notebook is a really excellent way to sort of interact with your data and there's so many new tools that we'll talk about after the break. Hopefully it'll inspire you to sort of connect a little bit more with the notebook so. Okay, here's my proposal, it's 11.22. How about we take, does it, do you guys want a break or we just wanna just fix some notebooks and get going or how about how many people would like a break just raise your hand. I'm with you, let's keep going. Okay, who's got notebook issues? Okay, hang on just a second. So I'm gonna do this, I'm gonna restart everybody's notebook and copy the data into everybody's folder again, okay? So you're gonna get kicked out of your notebooks but it's easier that way, here they go. Oh, I maybe shouldn't have done that, that's a lot of, let's see, that's gonna take a minute. Well, okay, let's stop that. Okay, who's got, just yell out usernames for me or user numbers if you've got zero one one? Okay, okay, and you are there and, what was yours, five? And what was yours? Okay, okay, anybody else? Me nine, okay. Anybody else? Two, okay, so these notebooks are pretty stable. Anybody else have? Okay. Yeah, you'll have to re-accept this, it generates a new certificate so you have to say, yep, give me a certificate. Okay, and I need to copy some data over really quick so let me do that, okay? Yeah, it'll just, it just kicks you out completely and then you'll have to accept a new certificate. Yeah, there's a cost, it costs, so I have about 40 nodes running and they're large nodes, they have each have four cores and it's about 10 bucks an hour, so, yeah. Okay, needs a reboot. It should work on Firefox. Is it an older version of Firefox? You might have to clear your cache. Okay, try clearing your cache and seeing if that works. Hopefully that works, if it doesn't, I think I have a few other instances, maybe you should just try a different instance. Anybody having extreme malfunctions with their notebooks? I'm gonna try a new one. Okay, I'm gonna give you guys one at least. It's just broken, I'll swap with you. Yep, and then you're. I asked for a password. Yep, yeah, password. Yeah, the password's RC and then the username. Made it. That one, I'll restart it. Okay, it's totally possible this notebook is just gonna bomb and this might just be a demo. Is anybody having success with their notebooks? Okay, so some people's working, some it's not? Okay, I apologize for those that it's not working. You know, I woke up this morning and I was like, I'm gonna do something really simple. I'm gonna fire up a cluster on Amazon and launch a bunch of notebooks with users and copy some folders around. You'd think that would be easy. Do you wanna do a new, I wanna try a new notebook. It just, yeah, looks good. Okay, let's go do some pandas. And you guys, okay, is it falling? Might have to type in it again. I'm gonna leave the three remaining ones up there and if you wanna swap out, just put your old one here and I'll restart it. Oh, okay, so good, okay, yeah, I think it is, yeah. If it's not, you can do the swap out trick. Okay, so I admit the functional thing was a little weird and let's do something a little less weird. So I'm gonna do this pandas library. So that's number three. You can just click on it to fire it up and make it a little bigger in case you wanna just watch. So what did pandas bring to Python? It's, you can think Excel spreadsheet and it gives you basically a table in memory that's a lot like a spreadsheet and it's incredibly efficient. So it's a lot like an R data frame. There's a lot of other languages that are trying to duplicate the data frame type thing. It's built on top of NumPy and Wes McKinney, the author of the code, is very conscious of speed so he does a bunch of speed tests and it's had a huge growth recently and it's great for medium data and when I say medium, I've killed Janus nodes that have 20 gigabytes running and gone up to a high mem node with 100 and pandas is just fine, it works great. So I don't know where the threshold is past 100 gigs but for that size data set, it's totally appropriate. I put some resources on here. There's some great videos online for pandas if you wanna watch them later just to get more familiar with it and I would say this is one of the Python libraries that you just have to commit to using a little bit because it's a lot of syntax and it's a little different and I think if you just try to do your workflow over and over again, you'll get it. The other thing you can do which is a little nerdy but just consider it is set aside some time in the morning and just do the same pandas code over and over again. It's kind of like a karate kata. You're just practicing useless moves by yourself and if you do that, you'll be really quick with pandas. So you can consider that. This could be your first kata, for example. Let's see, let's skip some of that. So here's an outline we're gonna shoot for. We wanna just reshape our data a little bit. Pandas is great at reshaping. So we'll talk about what that means. If we have time and there's interest, this might be a group choice where we do more pandas and less functional programming. We might do this movie's example where we could do some group buy and some joining and this is a lot more like database for spreadsheet kind of work where we can take two tables that have keys, join them together, pandas does a great job of that and then we can group buy and summarize in different ways and the example there is kind of fun. So here's, let's just get started. This is modified version of a link I put here which is I think a video. Yeah, so if you wanna watch this data wrangling kung-fu with pandas. And I'm gonna skip the first couple of lines. I don't, they're not really important to talk about other than just creating some data to play with. So here's what our data looks like. It's a bunch of dates on column one and then we have some models, I think we have ABC and then we have some value for the model, right? And this is kind of a stacked format and the idea is first off, we just wanna read that into a data frame. So pandas has a great method called read underscore CSV and we just pass it our CSV file and I'm gonna store it in the variable df and I'm gonna print df and you can see what we get. We get a column on the far left side which is an index and we can set that index if we want and then we have three columns. We have a date, a type and a value. So what we wanna do is manipulate these columns a little bit to kind of see what we get. So we wanna sort of, we wanna reshape this table with pandas. Yeah, oh yes, good. So the only thing you need to do is you can either hit this play button which is just run cell if you wanna do that or I like to do shift enter. I just hold down the shift key and hit enter and it will also run that command. Okay, so who's got some thoughts on why you might store your data this way? Anybody? You've got some models, you've got some dates, you've got some values. And maybe ultimately you wanna do is see like a comparison of models, right? So one reason this is popular is you may not know how many models you're actually gonna have. It kind of seems silly for this example but I was helping somebody recently with a project at work and they said, I think I've got 200 different columns. I don't wanna reshape this into columns. And when we reshaped it, it was like 2080. And I was like, does that sound about right? And he's like, oh I didn't realize I had that many. So this is a great format. If you don't know ahead of time how many columns you're gonna have, you just store it in this format and then you reshape it with pandas and get exactly what you need. So the other reason is maybe you wanna have more than one metric associated with each model. So then you just have a duplicate row or something. So you could say, this one I have the bottom row, it's model C seven. That could be model C, my value type seven and the next row could be model C, my other value type 20 or something. And it's a very compact, robust way to sort of gather data and then you can reshape it later if you like. So here's the idea. This is, it's called pivot. It's very much like a pivot table in Excel. And I think once you see it, you'll get it. But what I'm gonna do is I call, I take my data frame, which the variable name is DF and I call pivot and I say, I give it a row, a column and a value. So I want the rows to be the date of my new table. I want the columns to be the type, which is the models. So models will be across the top. And I want my values to be kind of what's in the actual table. And so if I do that, it's gonna take a second, more than a second. There we go, just had to do it again. You can see what we got, right? We have a new representation of that data table in a way that maybe we wanted to see it originally when we gathered the data, but it wasn't efficient to gather it that way. And then in pandas, if you wanna see things like, what are the titles of my columns, you can get a list back with columns or you could get a list back with the index, which in this case is the date. And there's some, if we wanna access certain things we can, we can say, give me the results of a specific column. So we just give it the column name and it will give us just that column of data, which is kind of handy. If we wanted just the actual values in that column without the attached date index, we can just say the results with the column name and then dot values and it will just give you an array of the values without the index included. Okay, for rows, you can access, there's a few different methods and the one I remember is this IX method. So we can't access rows with a name because we're using that for columns. So Wes had to provide a method and the method is IX. And so you can say IX zero and it gives you the first row. You can also say IX and give it the name of the index, which in this case is a date string. It's a string with a date in it, I should say. It's not a date, time, object. And that will also give you the same row as well. So you can kind of get to rows or columns with indices like zeros, ones, or you can get them by name if you'd like. You can also, this is kind of numpy like, but you can say I want the third, fourth, and fifth row and I just want from columns one on and it would give you a smaller subsidiary table. So I'm slicing into my table in a very specific way. There's a bunch of different ways to sort of look at sort of statistics, so I can say mean or sum or count. And just like in numpy where you specify an axis of zero, one, you can do the same thing with pandas and it has the same interpretation. So here I took the mean of all the rows. So I averaged the rows with axes equals one. Here I'm counting how many things I have in each column. So now I've got columns with a count of the number of items. So now let's create a new data frame and you can kind of glaze over this step if you'd like. It's not that important. But I'm creating a data frame from a dictionary, a Python dictionary that looks like this. And now I'm gonna add it to my, and I just wanna put that on my original data frame. So I'm just gonna use the concat method and put it right on the bottom. So now instead of, and I can use the df shape, so I can look at the shape of my data frame. Instead of having 12 items, I now have 14. So my assumption went well. And let's, you know, we could drop say row two if we wanted. And we do that with axes equals zero here. The head command also is handy. There's a head and a tail command. It just lets you peek at the top or the bottom of your data frame. So that's useful. And we can drop, also drop a column if we wanted to. Yeah, that's a great segue. So the question is, you know, so far we've been working with data where everything's present, right? So let's do some missing data. So if I take my, now my 14 rows and I reshape them again, I am gonna miss some data, right? Because I left off, I only added two in there. And so here's the missing data thing. We reshape with the pivot and we get these Na's. And the way Panda's handles this is brilliant. Just, I mean, I think it is. You can still take the mean and the count. And unlike NumPy, it doesn't just blow up on you, right? So it just says, I'll just skip it. And if you wanna find those missing values, there's two handy methods. One is isNull and one is fillNa. And so you can see if you have any null values. And if you wanna fill them, you can. So here's, you know, isNull gives me a Boolean matrix basically that says, yep, I've got two missing values. If I wanna just fill those missing values with zero, I can also do that. It automatically did that for me, yeah, yeah. The other thing that's kinda nice, and I don't have a demo of this, but if you had, if you were merging two data sets together and one of the columns had a bunch of missing data because it didn't have values, you can choose to fill that linearly. You can choose it to fill forward, fill backward, fill it with zeros. I mean, it's very handy for that kind of work. I can see that everybody's probably thinking of scenarios where they can use this. So I don't know, maybe at lunch, we can have a special, special group for that. Talk about the scenarios. See, the one thing you'll notice is when we do this reshape, we get our index is now the date string. And to be clear, this is just a string. It's not a date time object. So one thing that's really handy is, and with pandas, if I wanted to actually make a copy of this table, I actually have to say results.copy, otherwise I'm getting a view of it and changes I make to the view will persist in the original one. So this is a lot like NumPy in a sense. So here I filled NA with zero and I added this in place equals true. And this actually gets me a lot. So if you don't include the in place equals true keyword, it actually fills the values with zero, but it doesn't do it to the original set and it returns a copy of it. And the copy in this case is just nowhere. I didn't assign it to anything. So it's quite common that you might say temp fill NA zero and not assign it back to temp and it doesn't work. So you can do, basically what I'm saying is you can do this or have to do something like this where you have temp over here. So those are equivalent. One, it does it in place. The other one makes a copy of it and overwrites the original value. Yeah, yeah. And in place isn't available for everything. So sometimes you have to just look in the documentation to see if you can do it. So here's, I use reset index a fair bit too. If you do some sort of reshape or group by, it might give you an index and if you wanna reset that index, you certainly can. And so let's see what happens here. So here I called reset index and I did an in place equals true so that would do it to the original copy. And now you can see that my new index is just integers and my date string is actually a column. And you can, there's an as matrix command which will go to a NumPy array, basically a two dimensional NumPy array. So this is quite handy if you're cleaning up data with pandas and then you wanna pass it to scikit-learn, for example, scikit-learn wants to have a NumPy array, two dimensional array. And so you can clean it up and pass it around. I think the opposite of pivot is melt and that's just a term taken from R. So if we wanted to take our pivoted data and just melt it back into a stacked format, we could do that. And so here I'm just taking, I'm storing the pivot as the results variable. Then I'm going to reset the index and melt it back where I pass the ID vars of date. And so we get kind of the original, we get the original. So we can kind of pivot and melt all we want. The other thing I'll say about pivot and melting is pivot only works if you don't have duplicate data or that you wouldn't have to sum data. If you want to do some sort of pivot with a group by, it's a different method, it's called pivot table. And we'll do an example of that in a second, I think. Okay, you can write values to files very easily. It's just called two CSV and you can either include the index or not. I usually don't include the index if it's an auto-generated one. And that's kind of a, that's like a very, very short introduction to pandas. Brights it in the directory where the notebook's running. So yeah, so if you, and you know, if you want to see it, you can kind of come in here. Let me actually just write it. One thing that's nice about these notebooks is you can just do some, you can just do some Linux commands with a bang. So I say exclamation point ls and I can see what's in the directory and here's the file that I just wrote. So it's just in the same directory as my notebook. Okay. So it's 11.45. I, we don't, we just don't have time for it all. And so let's go to, we're gonna have to do some group voting, which it's, we're just gonna, it's gonna have to be yes, no kind of thing. So there's really, there's really three more things that I think would be awesome to talk about. And I think they each take more than 15 minutes and it's gonna be lunchtime soon and people are gonna lose interest. So here's some options. We can do more pandas. The pandas is kind of fun. We'll do some group buys and we'll figure out the best movie for date night. So if you haven't seen that one, it's, it's an interesting way to use pandas to increase your social status. So that's one option. And I'm not pushing any of these. The other thing we can do is we can look at parallel Python. And I think you get to pick two. So the vote is gonna be which one don't you wanna see? I think it's how we're gonna do it. So pick two. The second topic is parallel Python. I'll just show you some really quick and easy ways to parallelize your code once you have it in the map format. So this is, you know, and this has been successful at CU, you know, running thousands and thousands of simulations on hundreds of cores. But I think that's pretty cool. And one of the examples is a machine learning classification example I took from scikit-learn. So that's an option. That's option two. So option one, date night. Option two, parallel with some machine learning. Option three, this, yeah. Yeah, yeah, but I think we really only have time for two. I mean, I'm all for trying to cram them all. So the option we vote out will just smash in at the end if we have time. The third option is Spark. And it's a very functional paradigm, but it's sort of the only thing in Python that lets you do really massive data sets. So we'll talk about MapReduce if you've heard that term. And we'll do some very simple examples in the Spark framework. And just so you know, Spark is, you know, I've recently run, you know, a few hundred gigabytes on Janus interactively over 40 nodes with Spark. So it's a very, very cool framework. It's not something you can download and just go play with on your local machine because it's not really built for that. It's a cluster tool. So you have to have your favorite cluster folks install it and give you access to it. Okay, so that's the three. I'm just gonna, what's that? Yes, please. Yeah, first option is more pandas and it's date night with movie. Yeah, pandas is like the R data frame. So R is a statistics language. And yeah, yeah, it would be a different library. And so there's sort of, if you wanted to emulate R and Python, you would have Python plus pandas for the data structure plus scikit-learn for the machine learning plus stats models for the statistics plus matplotlib for the plotting, that kind of thing. Yeah, it's just getting you the data structure and it gets you, you can summarize and count, you know, put different buckets, but it doesn't give you linear regression. It doesn't give you principal components. That's scikit-learn. So all right, so who wants to analyze some movies with pandas? Oh wait, no, I'm sorry, we're voting for the one we don't want. All right, how many people don't care about pandas anymore? You've had enough. Okay, fair enough. I wanna hold that against you, yeah. Okay, that might come at the end. How many people don't care about parallel Python? Oh, it breaks my heart to ask it that way. Okay, how many people could care less about Spark? Oh, okay, just give me a minute. I'll digest that one. Okay, here's what we're gonna do. We're gonna go really fast so we can get it all done. I'm just kidding. All right, fair enough. Okay, let me just say this about Spark. Just keep it in your mind that when you have massive amounts of data that you wanna analyze and you don't know how to do it and you don't wanna bust out MPI for Pi and try to fiddle with who gets the data where, then just know that Spark is ready when you are. Okay, movie night. This is actually fun. This is a good break. We've been kind of getting in some intense pandas with fake examples. And now we can do something with some real data. All the credit goes to Wes McKinney, his Python for data analysis book is fantastic if you wanna learn more about it. And here's the question. We have this rankings data set of male and female and movies and rankings. And we wanna know what's the best movie for a date night, right? Cause it's true that men and women have slightly different tastes in movies as we'll see. So what we're really trying to do is pick a movie that everybody's gonna enjoy. And by everybody, I mean male and female. So let's do this. Let's read in the data. So this is an ugly set of, an ugly function in a way. But it's because the movie data, it's three files and it's not comma separated, it's not tab separated. It's double colon separated. I've never seen a double colon separated file before this but it makes sense cause you might have a tab in the movie you might have a comma in the title and you don't wanna accidentally split on that. So why not pick two characters you never see together? Hence I think the double colon. The other thing is these files don't have a header in them. So I'm reading in the table, I'm passing it how I wanna separate it and I'm saying header equals none but use the names I give you and I just put the names as a Python list above the read call. So all that to say I'm gonna just read these in. If you haven't seen this notation before, this is called Python unpacking and it's where if you give back three things you can just assign them to three things. It's a very popular way to do it in Python. Okay, so I've read the data and I can just look at my users for example my ratings and my movies. And what I wanna do is I kinda notice that well I have users have users IDs and then in the ratings each user can rate multiple movies so they have a user ID as well and I wanna merge those together and obviously pandas is gonna fill in the missing values cause these are different sized tables. Users is smaller and ratings is much larger and then the same thing goes for movies. My ratings has a movie ID and that links to a table with IDs and that's gonna have one ID for every single movie and what I'd like to do is just link all those together so I have one huge table with all the information and then I can just apply pandas. So I'm gonna skip this little bit of code. This is what I call Perl Python in that it looks like Perl and meaning when you come back to it in a month you don't even know what you did. So I'm gonna skip that, I'm glad some of you found that funny if you haven't used Perl. I mean I tried for a while and it was hard. So here's the cool part about pandas. There's a merge method on the data frame and I can just give it the data frames and in most cases it does the right thing if the column headers are correct. So first thing I do is I say merge ratings and users and don't create a data frame for that just take the result and then merge that with movies. So I've got this kind of nested merge going on here and when I do that there it's done and in fact I can see I'm gonna look at the shape of this thing. I've got a million some rows with 12 columns now. So I've got all the columns and I've got all the rows and you'll just have to trust me but it did it correctly. So this is very difficult to do if you didn't have pandas in Python. Leave it at that. Let's see what kind of columns we have here. We've got now all the columns we hope to have movie ID, rating ID, user ID and a bunch of other things. And what I wanna do is just select out a few columns to work with. I don't want all the columns. I want the short title and I want the rating. And so I'm just gonna grab those two values and I'm gonna create a new table called temp. All right so here I've got the short title and the rating and notice I've got multiple movies in there because I had multiple users. So now what I'd like to do is ask some simple questions like what's the average rating per movie? So I take all the movies and I group by title and when I do that group by operation I just wanna take the mean of the ratings. So that's a pretty straightforward thing to do. And in pandas there's a group by method and I'm gonna print the result there just to make it clear. So in pandas when you say group by and you pass it a column name it creates a group by object and that group by object is actually a list of data frames. So we can apply things to those data frames if we'd like. So in this case I'm gonna do, I'm gonna take my group and just apply a mean to it. So I've got everything grouped by titles. I walk through each of my group, I take my data frames to share the same group and I just take the mean of the column which in this case is just ratings. And so now I'll get a data frame with and I'll describe it here. The head. Now I've got a data frame that has one movie on each row and the average of the rating, right? And that's what I wanted. Oh, we're getting close. And now I just need to say well I really just wanna see which one's the highest rated. So I need to sort this new data frame called mean rating. So here we go, we can use a sort method and that's pretty straightforward. I take my data frame, I call the sort method and I pass it the column I wanna sort on and then I can say ascending true or false depending on how I want it sorted. And then I'm just gonna look at the top 10. So this line here, I'm gonna roll this up a little bit is gonna give me the 10 highest rated movies that if you haven't seen, you should go watch, right? So there they are, let's make them, make sure they're big. So there's the highest rated movies they all got average rating of five. If you haven't seen them, I bet they're incredible. Let's walk through them really quick. There's bittersweet motel, I haven't seen that one. Actually, right, so how many people have seen at least one of these, right? So what's the problem really? Is that, yeah, we've got a couple big fans, right? So we need to do some filtering on our movies. And so let's go back to our rating and instead of taking just the mean of the rating, let's take the count of the rating too, right? So we can get a mean and how many there were and then we'll go to filter with that. So here I'm gonna do that again. I'm gonna group by the rating but instead of just taking the mean, I'm gonna use the ag method and grab the mean and the count. And I can add identifiers to this list all I want. In this case, I just have two. And now I'm gonna sort it and look at the head again. I get the same list and yes, I can see that smashing time actually had two five star ratings and Gate of Heavenly peace had three but everybody else just had a single. And so we wanna, you know, ideally we wanna throw out the low counted items. So again, pretty easy to do in pandas. Here I can say I'm gonna grab my mean rating and I'm gonna grab the count column and I'm gonna say if the count is greater than 1,000, put that in a variable mask. So mask and here I think I've got some. Mask is a panda series. It's got 210 items that were greater than 1,000 in it. And you can see that what it looks like, if I just look at the top of my mask is it has my index short title and a true false value. So for example, a million ducks did not have 1,000 ratings. Neither did night mother. Now we might wanna double check these results. Okay, so now what I can do is I can apply the mask to my IX method. So remember IX helps, lets me select out rows. So I'm saying just give me the rows that have greater than 1,000 ratings. And when I do that, I can see that I have, you know, space Odyssey had 1,700 ratings and I have mean. So now I'm ready to do what I came to do, right? I wanna sort these and see if that's just a double check. Here's the highest rated movies with at least 1,000 votes. So suspenseful. Here it is, ready? There we go. Okay, so this makes more sense, right? Shawshank Redemption had 2,227 votes and it was the highest average rated movie with 4.55 stars. Okay, so that's, I mean that's a pretty basic operation that you might imagine doing to your own data set where you filter, you do some summing, some group buys, that kind of thing. I've run into this, I actually stumbled upon pandas when I was trying to solve a problem on our supercomputer. We've got like 1,300 nodes and it's collecting node data and I wanted to group by node and pivot a column and I was actually trying to do it in a dictionary for a while and that gets to be really tedious. So I went looking for pandas a few years ago and to say it's changed my life is strong but it's been a great addition to my tool set. Okay, so now let's say, so we've done this. Let's do, yeah we're in good shape. We're gonna do a few more minutes on pandas and then we're gonna do some parallel. I'm probably squeak and spark at the end if we have time. Let's look at movies by gender to see which ones we like the most by gender and that will allow us to use a different kind of pivot table. So I mentioned when we use pivot, it's great for just reshaping a table but what we wanna do is we wanna have movies and then we wanna have on the columns, we wanna have mean rating by gender. So we want two columns now and so when we pivot that, we actually need to do some group by operations as well and you could do this as a group by step and then a pivot if you want but pandas gives you a helper function called pivot table that does it kind of in one line. So I take my original data set pd right here, I apply the pivot table method and I say, here's what I wanna do. I'm sorry, I misspoke pd is the name of the library. Data is my original data set, there it is. I wanna do, I wanna pivot on rating so rating is gonna go inside my new table. My rows are gonna be the title, my columns are gonna be the gender and I have to provide it an aggregate function because it can't just pivot without aggregating some values, right? We've got too many movies that we're gonna decide. So I can give it the mean when I do that, I'm just gonna look at the first few rows, I get my title and then I get my gender columns, right? That did a pivot and a sum for me. If you can't remember which, like me, sort of what pivot table takes, you can always just type the name of the method and the notebook with a question mark and it'll take you right to the documentation to figure out without going to Google. Anyway, there we go, come on. Okay, so now that we have that, we can sort by these different columns, right? So here I'm gonna take the mean ratings, sort by mail and look at the top 10. That's not quite right, here we go. I haven't applied the mask yet. So these ratings had movies that only had a single count, right? And so what I'm doing here is I'm saying re-identify mean ratings to be just the movies that actually had more than 1,000 ratings. So it takes my table with male and female columns and it just selects the rows for those movies which had 1,000. Okay, and then when I sort by mail, I can see that we like things like the Godfather and Raiders of the Lost Ark and things that you might suspect. And if I sort by female, we have a different list. But there's some commonalities, right? Like you can see Sasha, Nick, Redemption's in there. And usual suspects shared by both. And so we could keep going with this, but we only have 25 minutes left. So here's, I'm gonna show you one more pan to trick that's really cool. Let's say we wanna create a new column that shows the difference in ratings. So it takes those columns and just subtracts the male and female values. It's very easy to do. We just basically write it down. We say mean difference of our diff and then we pass. I'm gonna do the absolute value in this sense. And so now we can see which movies we disagree on the most and agree on the most. So here's, for example, the ones where men tend to like them quite a bit more than women. That would be Animal House. This is where we, I guess we differ the most. So Mary Poppins, a little bit more popular with women. Reservoir Dogs, more popular with men. And let's look at, here's agreement. So we all like Cherry Maguire about the same, which isn't to say that we liked it. Goodwill Hunting, notice that's actually a pretty highly rated movie that we agreed on. So here was the idea of Date Night is, why not take this? And again, you can do this kind of in private or I have to share this. You take this list and you say, I'm gonna look for the highest rated movies that are most similar. And then that's the one I'm gonna rent, right? And just in the interest of time, I'm just gonna roll through it to the answer because, man, I'm psyched to get to parallel computing. But you could do things like, well, let's look at the distribution of differences and just pick movies that are very similar. So I might say here my threshold's 0.1 on the difference, right? And then I could look at the ratings and say, let's pick highly rated movies like something above, you know, 0.425. So here's the masks I applied in pandas fairly quickly and easily. And so, highest rated movies with the least amount of difference north by northwest, rear window, unusual suspects, and Shawshank Redemption. So there's your practical take home list today. It's not quite Friday, but you'll be ready when it is. So that's pandas. Let's stop there. Any questions about what we did in pandas? There's actually so much to pandas. You could spend a day or two on it. There's time series. There's multi-hierarchical indexing on rows and columns, which is very useful. So anyway, I hope that at least gets you excited to check it out. All right, I am gonna, for the next section, you can run the parallel examples in your notebook. I'm gonna switch over to Janus because I get more cores per node. And I think that's nice. So if you wanna run, and I'm actually gonna spark, I'm gonna fire up a spark notebook just in case. And this is kinda just to show you that to launch remote notebooks once you have it automated is just a script on your command line. So this, for example, would log me into Janus, fire up a notebook and create a tunnel for me so I don't have to do that every time. And having this available book more likely. So let's give this a second. If you do have a, I know a lot of you guys use the beach cluster. If you want some help getting this working, I'm happy to do that, just come out with me. Okay, so I've got a notebook running and I've tunneled it back to localhost 999. So I'm gonna go to that localhost. And now I'm on Janus and go to, I've already cloned this repository over here. I'm gonna go to the parallel example. So when you run these, it'll look different because you have four cores on your nodes and Janus has 12. And of course, this applies to, if I wanted to use multiple nodes, we could do that. Okay, so this is kind of a quick and impatient way to do it. You just wanna see some parallel code in action. So the first function I have is called work and you can imagine this being something much greater, like running your simulation or doing something that matters. This just sleeps for x amount of seconds and it records the start and end time and then returns the process ID that ran the work function and the amount of time it slept, basically, start and end time. So it's just a function so we can see what's going on and I don't have anything really real to do. Now I'm gonna create, so I have a function. Now I'm gonna create a list of job times. So here I'm creating, I'm gonna say sleep for 0.1 to 0.3 seconds. I'm not sleeping very long. That's because I just wanna show the parallel capability not really do something realistic. In reality, you might be running things that take minutes to hours and you wanna parallelize those. So I'm gonna create 50 of those things and I'm gonna do just the serial map. So we've already seen this but just one more time. What I'm doing is I'm saying, here's my job times. I wanna apply this to my work function and then return the results using a map. And if you set your code up this way, you'll see it's very easy to parallelize. So this is gonna take a while because I'm doing it in order. I'm just walking through my array, sleeping and there it's done. So that took a little while. So there's, the multi-processing library is built into standard Python. You don't have to download anything. You don't have to install anything. And it's like five changes and you're running in parallel. So here's the changes. I import the library. So I import multi-processing. And in this case, I grab the CPU count and I just print it out and it's gonna say, it's 12 because I'm on Janus and actually I'm gonna clear these cells because it's more exciting when you don't see the answer. Okay, so I've got 12 cores, right? And then I create a pool of workers. That's really multi-processing in tax, I guess. So I create a pool of workers and then the only change I have to make my code, my actual code is I just put a pool dot in front of my map function, right? And if I do that, it's smart enough to know that I've got a pool of workers I wanna map these independent tasks on. So spread them out and let them do the work and give me the result back and I'm done before I know it, right? And we'll look at what that, we'll do, I have a plot in a few minutes and you can actually see what it looks like when it's running that. The other option is that I Python parallel and the advantage I think is it's really, it's either hard or impossible to get multi-processing to work across nodes and I say hard because I haven't verified that it doesn't work, I've heard people doing it, I just haven't been able to do it. So I tend to use multi-processing when I'm on one big node and I just wanna quickly parallelize something and I don't wanna think about it but if you have multiple nodes and you wanna use cores across multiple nodes, that's when you might wanna use IPython parallel. So the way that works is much like the pool, you kinda create a client that attaches to your workers running remotely. So if I do this, I get an error. Okay, because I haven't started the workers yet. So multi-processing launches the workers for me automatically kinda behind the scenes. IPython parallel, I need to do it myself. And so I just go to my notebook and I go to the clusters tab and you can do like two and you're gonna be running in the default profile. In fact, you probably only have one profile. You can obviously create more. I'm gonna start 12 workers because I know I have 12 cores. You should start four workers if you wanna try this. Now that I've started the workers, I can go back to my notebook and I pass that client stage because it has launched a controller and some workers and I can attach to that set of processes. Just like a pool of workers, I can look to see how many IDs I have on my client and it tells me I have 12. There's more steps involved, but run across many nodes. So now instead of, I guess this is sort of the load balance view. There's a couple different views with IPython, the load balance view, and I just create a link to the load balance view and then again, I make a couple small changes to my map command. So instead of calling map, I call load balance view dot map and one of the differences is Python is an asynchronous result. So results will be an asynchronous list. So all that really means if you wanna use it is just put a dot get method at the end of your map call and it will return the list to you. So I'm gonna go ahead and run that. Okay, and this broke and it says time's not defined. So if you use IPython parallel, you have to do a few other things to ensure that the workers, which in this case are running the work function, can actually see all the modules. So I'm gonna scroll up for a second and just show you that the worker gets this work method but it doesn't get the imports. So I have a couple options. I can move those import statements into the work function and it will then be able to see those modules or I can do something like this where I take a direct view of my client, which means I've accessed every single engine and I can use this sync imports command and I import time and OS which are needed by my work function onto each engine and then once I do that, I can run my code and it's done. So again, this is kind of a lot of information but the goal isn't really to take one tool back. The goal is to get like a preview of all the cool things happening with Python and then for everybody it might be different, the things you explore to apply to your workflow is at least what I'm hoping. So I realize that there's a lot of details in some of these commands. Partly I hope that one, these notes will be online for, at least for as long as I'm at CU. And so you can go back and look at them if you run into similar problems. And I think those are two of the main problems you'll see with IPython Parallel. Oh, so if you go to the bottom web address and you just type in one HTTP instead of two, yep, that will take you to this page here and the top is the stuff that we're going over and some of the stuff that we probably didn't have time to go over. So you can read through those if you'd like. There's also a notebooks link right down. Let me just blow this up. If you go here, if you click on the notebooks link, it goes to the repository where all the code is and you can just download these and run them locally if you'd like or however you wanna do it. And then there's also a link to the meetup group and I'll just do that now. All these examples have come from things that we do at CU on a sort of weekly basis during the semester. So if you wanna run NumPy or Matplotlib or learn more about any of this stuff, you could go here and explore it if you'd like. Okay, so let's do a little bit more. Let's just look to see what this map function really looks like in parallel. So I'm gonna load all utility and here I'm gonna go back to my pool of workers. So this is the multiprocessing example and I'm gonna map my job times onto the work function and then what I get back is a dictionary for each item in my work times and then I can compute sort of when that started, when it stopped and I can create a plot of that sort of see what it looks like. And so it's gonna run and there we go. So you can see that I have, these boxes are varying degrees because I started with random work times between 0.1 and three and you can see that I've done, not that I have, but multiprocessing has done a good job of packing the jobs on and spreading them out over my resource and this applies 12 on Janus and you might see four rows if you're running it on Amazon. Okay and then we can do the same thing for multiprocessing. I'm sorry for IPython Parallel and again it does a very similar thing where it just sort of packs each job across the resource. So that's an excellent question and I don't know the answer in terms of multiprocessing but it's a good segue to this link at the bottom and there's different schedules and I think kind of gets at your question. There's different ways you can schedule, if you have a bunch of resources you don't know how long they're gonna take but you could say for example, take the total amount of work and hand a portion to each core and just let them sort of do the work or if you think there's a large degree of variation you might hand one out at a time and then when they are finished you give them another one to do and that's more of a dynamic schedule and so this, let's do a static schedule on, this is code I generated on 20 cores and so here's a static schedule where everybody gets exactly, I don't know, 20 jobs or 10 jobs, maybe seven in this case, 10 jobs and when there's a large variation in the amount of time it takes each job to finish then a static schedule tends to be less efficient than a dynamic schedule and so here's the same schedule but in a dynamic way that's able to load balance a little bit more and so probably the answer to your question, I don't know that multi-processing is doing a static necessarily but there are different types of dynamic schedules as well and that may have either fit the data better or it might be that it's just more efficient. Okay, it's 12.20. I wanna show you one cool thing. So in terms of parallelism, iPython is pretty amazing in the sense that it's fault tolerant and it's elastic and what I mean by that is you can lose an engine so say you're running 12 engines on Janus for some reason your code just kills an engine. It doesn't stop the whole process in fact iPython will just say oh I've lost an engine, jobs were on that engine, I'll reschedule those jobs on the resources I still have. It's also fault tolerant in the sense that you can tell iPython if the job failed or not and if it failed it will retry it a certain number of times that you specify on a different engine. So here's the same workload we just looked at and what I did was I had each job fail like 3% of the time just randomly fail and then I had at some point after a couple minutes I logged in and killed three engines running so you can see these big white gaps at the bottom here there's just places where I just, my iPython engines are just dropping off of my resource and what's amazing about this is you can't do this with MPI which is sort of the standard way historically to do distributed computing. You lose an engine, your whole job quits. An engine, if a trial fails you might lose your whole job and so iPython parallel has been really great for making things run in parallel that you can't make run with MPI. So independent simulations that just sometimes fail or sometimes run nodes out of memory. If you have that situation you should think iPython parallel for doing some of your parallelism. All right, we have eight minutes left. I think that's probably not enough time to really get into Spark or scikit learn for that matter. So I might start concluding here. Okay, so here's what I thought for conclusions. I hope you got a little bit of a feel of what it's like to work in the notebook without necessarily typing in Python code. It's just a great interface for working with data and doing exploratory analysis where you can see kind of your flow and it's very easy to share those results. And I didn't go into how you do that but it's trivial to create an HTML file or share a notebook or convert it to LaTeX. And so it's a great way to work on things, to collaborate on things. And I actually like it as a development environment for certain times with Python. So I hope you enjoyed that. I hope you learned a little something about what the new landscape of Python can really do. And that is, we have data frames with pandas. We have ways to paralyze our code that make it fairly straightforward if you use the map function. And we didn't get to Spark, but I think I've said it enough that if you have some really large data sets you might consider Spark. It's amazing to do iterative programming from a web browser on gigabytes of data in a reasonable amount of time. I just don't know any other, I don't know any other language that lets you do that other than Scala in the Scala notebook. And then I think finally, my hope is that in just two hours it's hard to take away any sort of specific thing but I hope you feel inspired to maybe convert part of your workflow to Python or maybe you saw something in Python that you enjoyed. It's a really, it's a growing ecosystem. It's just been killing it the last few years. So I hope you got to see some of the things that I've enjoyed using. I wish we'd had another hour or two. I'm sure that would have been challenging in some ways but anyway, that's all I'm gonna talk about today. If you have any questions, you can send me an email or we can chat at lunch or something like that. So thanks for coming. Oh, for Amazon, yeah.