 good morning everybody. It's a big honor to announce our next keynote speaker, Travis Elephant. He's a good friend of mine, if you don't know him, he's also of NumPy. NumPy is probably the scientific Python package. Pretty much everybody who's in the scientific era in Python uses NumPy and he's the one who wrote it. And he's also the CEO of Continuum Analytics, the company behind Anaconda and a lot of other open source packages, though he is really supporting the Python community. And he's also a founding member of NumFocus. NumFocus is a non-profit behind PyData. Speaking about PyData, there's a conference starting today, actually in the afternoon with tutorials. So if you're interested in Python and Pictator, this very topic, Travis, we're going to talk about, there will be a lot of talks, tutorials in much more depth about it. So you're encouraged to join the PyData conference. There are still tickets available. So please go ahead and buy tickets. And now I would like to give the world to Travis talking about some content, Python and Pictata. And he's talking about the past, present and future of this field. Please give a big hand to Travis. Thank you very much. I think we're having a little technical difficulty here. So it seems to be a common pattern. Did he show to this place? It does. Okay, Travis. So, do you have a picture now? Do you have a picture now? We're in VGA and it was a mirror. There we are. There we are. You can see a very messy desktop and a bouncing beach ball. Excellent. Maybe you'll come back to me soon. I don't know. I'm still not alive. I don't have the control of my screen system. It's a bouncing beach ball. I don't know why you were swapping things out while I was doing things, I think. Hopefully that'll work. Are we there? Okay, thank you. All right, so you all had time to catch up on your email and now you can give me your full attention, right? That's good. That last minute email you got a chance to send off. I understand. I'd like to check my email during talks, too. I'll try to keep your attention so you don't get too bored. I want to talk a little about me, not too much, because I want to mostly talk about the technologies that I've been involved with. But just to give you a little background, I know not everybody here is familiar with NumPy Stack and with Science. I basically am a scientist by training. My roots are in satellite scatterometry. I used to measure wind speed over the ocean and satellite scatterometers. That's really what got me into large scale data analysis was tracking, you basically had back scatter from the ocean surface in big satellites. They came on big tape drives and they used a VMS machine, a VMS, Vax VMS and they used one of those systems. It was really awesome. It had a floating point format that was different than IEEE 54, 754. That's where I started and we used to make really nice pictures. We did a lot of, I did some Perl. I did some MATLAB. I used a lot of C in order to produce these kind of slides. When I did my PhD program at the Mayo Clinic, I got into a different kind of wave. It was waves I was making inside of people. Basically, we'd wiggle people with speakers. We'd start to shake them and when you shake people waves propagate inside and then with either MRI or ultrasound, you can actually see those waves propagating and from that you end up with a big kind of an inverse problem. It's that equation there which I like to scare people with and I scared my committee with it too. It was fun. It's not that hard of an equation, just simple linear equation. Many of you here are pretty good at it but my goal was to invert that and so to invert that I had to find the derivatives of five dimensional data. Here I was with a very large cube of data. It was too big to fit in memory for MATLAB. The MATLAB double. There wasn't a MATLAB float at the time. I really liked working at a high level. I could program in C but when I was thinking about my data problem, I didn't want to be programming in C because I had to think about pointers and arithmetic and figure out where my memory leaks were. I really liked that high level. I searched around and I found Python. The rest is history basically. I found Python started to do a lot with Python. I did finish my PhD although it was somewhat delayed. This is just to give you kind of a context for Python's origins and where I started to use Python. This is Guido released it. There's arguably maybe a little before this but I know the 0.9.0 came out in February 1991. I was not a Python user then. I really came to the scene in 97. So you can see that 96 is kind of the, I use version 1.4 of Python. It's a great version actually. In fact, if you'd like to try the 1.0 series, you can make a new environment and install the 1.0 version of Python just for fun. And you can see kind of how it worked and do a test environment of Python 1. I started using it in 97 and I have that little gold highlight. Keep that in mind because that's an important date in 94 with respect to data analysis in Python. I'd like to show this slide because this is really what got me going in writing C extensions for Python. If I've done sort of anything else, I've written a whole bunch of C extensions for Python, much to the chagrin of the PyPy crowd. They're probably the number one enemy because I've written so many C extensions that make their job harder in moving people off of the large platform. But I'm not the only one that does it. There's a lot of people that write C extensions for Python because it's so easy to and because it's so extensible. But I show this slide because of Mike Miller, actually very close to my friend Mike Miller, but Mike Miller, he released a package called Table I.O. and that's how I learned to program C extensions. I grabbed his package, studied the source, and then also read an essay by Guido about reference counting. You've got to figure out how reference counting is for extensions. Fortunately, if you write Python and other things, you don't have to do that anymore. But if you really want to get to the root of the code, you don't have to do that anymore. So this opened my eyes to the power of open source. I could basically look at the code, understand it, learn something. I learned a tremendous amount by reading this code. And then I started an experiment with my own modules. And in 1998 was my first extension module for Python. It's called NumPy I.O. It's embedded in other parts of the stack these days. In fact, there's better ways to do it. But that was my very first module in 1998. And I sort of, that's when I, my career as a scientist sort of pivoted. Let's say pivoted, that's the right word. Pivoted into tools for scientists. So in 1998, in 1999, I started to get really addicted, got really caught by that bug. I think there is a chemical compound that is called addiction to open source. I think it's somewhat related to our addiction to Facebook. I think it's connected. But back in the time, I started releasing wrappers in 1998. First FFTW, then CFI's modules, then stats, which was from help from Gary Stringman. He put out something in 1998. And in 1999, I went back one year and looked at the mailing list. It's really nice to go back and look at the history of what you said in the past and gringe a little bit at how stupid you were. But, you know, we all make mistakes. But go back and kind of, and also see kind of how motivated you were and how excited you were about something new. And I was very excited about Python in 1999. I said, hey, we could use Python to build a data analysis environment. We could do all of our calculation code, all of the things I loved using a high level language like Matlab about. I could do that in Python. So that year, I went back and looked at the mailing list and sort of every month, I was sort of saying, hey, here's a new package I just made. And of course it wasn't very pretty. And the web page for it was very ugly. I'm still not a very good web designer. I can put content but not really good pretty pictures. So if you look at those, but back in the day, nobody else had pretty pictures either, so it was okay. I just had a really good website. And I made a bunch of releases. Then a guy named Peter came along and said, hey, this is really stupid of you handwriting FORTRAN wrappers to multi-pack on all the libraries on Netlib. I'm going to write a tool to do that. That's when I first learned the difference between me and a real computer scientist. Me, I'm like, I'll do this manually because I just want to get it done. And a real computer scientist goes, we have to automate this and make it, and in fact, you know you're a real computer scientist and you spend more time automating than it would have taken to do it manually. Now that's not the case. f2pi was a tremendous tool and we worked together basically through the last part of June, 1999 to the end of that year. So 1999, that was sort of multi-pack, it was called the time. And that was my, I put it out on the web and people started to download it. I got started to get to know people from all over the world. Piero Pedersen, the gentleman who wrote f2pi is from Los Angeles, California. I know he likes to water ski in the lakes in Estonia. That was just a tremendous rush to be able to coordinate and communicate with people all over the world and see them use your stuff and contribute back and make it better. It was an amazing thing and it was my first taste of open source community. And I've just seen that grow and grow and grow ever since. You could use the Python for data analysis way back in 2000. Whereas I wrote NumPy, NumPy came from a history that had started in 1994. And here in 2000, I could use it to publish pictures in my thesis. Now this is using a Python interface of something called Dyslin. Anybody here use Dyslin? Anybody? Dyslin is actually a pretty sophisticated tool, although a little bit tedious to use, but I was able to do it, publish all my pictures in my thesis. I think there was 200 different images in my thesis, all with Dyslin and with Python. Now in 2001, that's when Eric Jones contacted me and together with Piero Peterson, we pulled stuff together and spent a whole bunch of time building for Windows. A whole bunch of time essentially going through building a package and we came out with what we called the SciPy library, but really was the SciPy distribution. I didn't realize at the time, but that's really what it was. It was a collection of all these tools together with a single installer so you could get everything up and running quickly. There were a lot of people who contributed to creating SciPy out the door, but it was a lot of work, just on pulling everything together and doing a lot of... I was not a Windows program at the time. I've since learned to make my piece with the Windows platform and it's fine. There's some interesting things about it, actually in the parallel space, there's some really interesting things about it. But as I said before, NumPy really inherited a long legacy of great minds that have gone and tried to build a numeric array object in Python. Jim Fulton, he's embarrassed when I show this because it was just a Python matrix object, but it was real. It was part of the early discussion and it caused Jim Hugenin, as a graduate student at MIT, to get really excited and to procrastinate his graduation in order to write numeric, which became the foundation and that's why I came to Python because numeric existed. So I'm really grateful for that. Then in 2001 some features were desired for numeric and Perry white, Todd Miller at the space science telescope, the folks that put out Hubble and processed the Hubble images. They needed some changes, most particularly they needed to be able to memory map to more diverse data and so they were writing numeric and then at that time I kind of said well it turned out actually it didn't have a class to teach, it was really kind of a nice confluence event. They didn't have a class to teach because the one that was scheduled nobody signed up for, it was sort of one person signed up. I guess they might have been too intimidating, it was smart actually, not to sign up for that class but I ended up without a class to teach and I probably should have been publishing papers in order to keep my position as an academic professor but at the time I saw Nummaray, I saw SciPy, which is built on numeric and I saw modules getting built for Nummaray and I saw this split happening in the community, it was already so nascent and people were still struggling just to support one thing and I felt like I had to do something about it. It was just a really strong feeling that somebody has to do something about this and I don't know who's going to because nobody else knows the code base very well and I had time so I did, I just did it and said okay throw caution to the wind and dive in and basically spent about what I thought would be about a three month project turned into an 18 month project in order to kind of put the first version of NumPy out there. Since about 2007 the community it took a while for the community to get excited about it and get more contributors, but 2007 came and it's really started to take off and lots more people joined. There's still room for people, especially at the low level the number of people that can understand the CAPI of Python and then help maintain the NumPy code base, which is in C is shrinking, right, and that becomes a challenge. But there is a NumPy and SciPy both now are a very impressive community effort and I'm really grateful for that because they wouldn't be what they are without all the people that are contributing and making it possible. I don't know exactly how many NumPy users there are. I estimate maybe three million. That's on the basis of hits to a web page download numbers, but it's always hard to tell because nobody writes home nobody sends a postcard so you never really know how many users of NumPy you have. I wanted to kind of just pause a little bit and maybe to motivate some of you who I know are building your own software packages and communities, talk a little about the things that I've learned and what it takes to do this because if Python is going to have a role in big data analytics it's going to take the work and effort of a lot of people just as it has to date. So one of the things that's most important I think is to recognize that it is hard work initially. It's not easy and usually it's quite lonely actually when you start off on a new venture initially nobody really believes in your idea except you and that's really the way it should be. Others will need some proof that this is actually going to work before they dive in. When I said hey I'm going to merge my memory and numeric and do it on the numeric code base a lot of people said that's great. Go do that, that'll be fine. And it was really until they start seeing results and they say oh this is actually going to work. Okay now we'll dive in and very gratefully do that as well but that's going to be how it works. You're just going to have to dive in and do some things maybe with a small team, maybe you and another person and in fact the more complicated what you're doing is the more lonely it's going to be because the fewer the people that will understand enough to be able to make the trade-offs and help you. So SciPy was another example. I started SciPy and it took a while before we got more and more help and Piaro joined basically the end of the year I procrastinated my PhD by at least a year to create the beginnings of SciPy and don't tell my wife right she's in the room so I shouldn't have said that. I was a poor struggling graduate student with three kids and we were making 18,000 a year in Minnesota and we did great, we did great, but I know she was thinking when are you going to finish honey so we can actually have a job. Piaro Peterson put in tremendous work to create F2Py and SciPy Lin-Alge I was just flabbergasted by the amount of work he put in in fact one of my great anecdotes from Piaro is when I sent out my multi-pack and he basically submitted a make file it's still an incomprehensible make file I have no idea what it said but it was amazing, it sort of built everything including I think made the coffee in the morning people will do that and he put in tremendous amount of work it takes that kind of work to get things off the ground it really does, it's just not you can't just sort of show up and hope that things happen you really have to put in the work the other thing I think is important is you do what's right in other words you got to put aside kind of thoughts of oh I'm going to make a lot of money or I'm going to be able to do something really cool because doing what's right means you have your information you have your knowledge you have your semantic environment the things you know about and those things come together for you to feel hey this is what I think is right and you're the only one that has that feeling and you need to own it and do something that's right timing is everything sometimes you're the right person for the job sometimes it's the right time for that to happen and that's hard to know when is the right time but when the time is right you have to have an urgency it can't wait, it won't wait people are impatient if things don't move in a certain direction they often do something else after a little bit of time you have to strive for excellence and it really takes work and you have to give the best you have it'll never be enough but give your best anyway I love this quote from Mother Teresa there's actually a whole series of these aphorisms the good you do today will often be forgotten do good anyway it's absolutely true and it's essential it's one of those life qualities that I think I would love to another key to a success is while it's initially lonely and you're doing things on your own you have to build a community you have to get others involved and that means interacting with people and that means having your ideas be put up and shut down it basically means you'll need to help other people other people have different ideas than you you have to give up some of your hard one ideas you have to give up some of your ego someone will point out how bad you suck and you probably do in some places listen to that and that's the only way to make progress in the community is for people to come together and have some empathy towards each other it does expose you you know to treat other people the right way you have to care about them and that does expose you because you can't get hurt but you do it anyway and then you keep moving forward and it's a great thing much more if we sit on this topic I think the key to building a community there's a lot of lessons to be learned from I'd love to talk to some of you about your lessons of how you build communities then patience and a little bit of luck also good things take time it doesn't happen right away NumPy took a long time to get real adoption 2007 it was sort of two years afterwards it really was 2009 before I started seeing a lot of adoption of NumPy so it's four years later and the right factors have to come together one of those factors for example in the NumPy sci-fi communities was GitHub for a long time it was just sort of creeped along and then the ease in which you could contribute jumped and all of a sudden we got more contributors alright so that's a little bit of history of what I did in the past with NumPy and sci-fi so what is this NumPy anyway perhaps a show of hands who uses NumPy or knows what it is okay a fair number of you excellent you know the Europeans are always smarter than the I asked that question in the US and I don't quite get the same response there's maybe half of the room here everybody at least has seen it now you're all rejected of course to do something else but that's another beauty of the European but what NumPy is essentially an array-oriented extension it's got an array object and fast operations on an array object it's fairly simple actually it's core here's just simple examples you can build a two-dimensional array and do some computations on it and you can sum along different axes or you can have a three-dimensional array you can have an n-dimensional array and you can operate it over quickly and by quickly we mean you hand it over to a pre-compiled loop or a pre-compiled engine that does the computation and so that's not happening in Python so when you work with NumPy arrays you can release the gil you are releasing that whirlwind trooper lock and you don't have the same problems that happen if you're not using an NumPy array so here's just a diagram of what it looks like as a Python object each element of the array has to be exactly the same type that's one restriction of an NumPy array and the last is its basic restriction but that bytes can be a lot of different things can be a pointer to a Python object it can be a structure with an int a float and 10 bytes of strings it can be Unicode UTF32 is always what Unicode is in NumPy and then there's these array scaler things that will also creep up and bite you on occasion if you're trying to understand what comes out of a NumPy array so fairly straightforward once about three years ago now somebody asked me to come up with a Zen of NumPy if you don't import this you all know Tim Peter's aphorisms that have come up very very nice and this is definitely not at that same caliber but it gives a little bit of a flavor of how I think about NumPy strided better than scattered contiguous better than strided descriptive is better than imperative array-oriented is better than object-oriented perhaps a little debatable, it's a fun one broadcasting is a great idea and vectorize is better in explicit loop unless it's too complicated then you can use scython or numba and think in higher dimensions to solve your problems real quick kind of example of array work computing, I like to show this it's a little bit of a cheesy example but the Fibonacci numbers are so common in Python we like to show them because here's the Python implementation one bad one, one better one in terms of performance now if you compare the Python approach to Fibonacci versus okay I'm an array-oriented guy I use the NumPy stack, what do I do I reach the solution and find the roots of the discrete difference equation and just compute them for any n so that's a vectorized computation here, Fib 2a I just generate a vector of numbers n and then I calculate with the root R1 and R2, I can use the roots command to take the roots of polynomial calculate the power of those roots subtract, those are array expressions happening, I don't see any for loops there but for loops are happening under the covers that's the concept of array-oriented computing is gathering your data doing high-level computations on them all at once and then if you're really clever, you understand that it's basically the output of an unstable filter and I can use the linear filter tool in SciPy to generate the output of at least the first part of the Fibonacci sequence those who have done this in the room will also understand that I'll have overflow if I use the floating point of the machine like I'm doing with SciPy and NumPy but you can get faster performance that's one of the benefits of array-oriented computing is you immediately typically get faster performance so that's usually what people reach for and why they reach for array-oriented computing is to get the fast performance they're looking for there are other reasons to do it however APL really was the father of array-oriented languages it's been around since 1964 but the hieroglyphics of APL are still trying to be decoded I haven't found the Rosetta Stone yet to understand what people actually said in all these wonderful array-oriented codes just kidding actually I know some people can read APL and there were other English versions of that same concept brought in had a lot of the same ideas and NumPy is a descendant of APL alright so another simple idea of array-oriented computing is to gather your data together whereas a lot of object-oriented approaches end up scattering your memory all over the place so if you gather that all together and make objects essentially rows and a table of attributes now you can do column-oriented processing your data is all together and your modern processors can scream through this array-oriented computing is perfectly suited and matched to the vector computers of today the multi-core, the multi-CPU so whenever you can do that you get the added benefit of actually being able to take advantage of that hardware that's otherwise not really exposed very well I'll move on I'm going to skip this example put it up briefly just so you get a feel for the kind of code this is something I did once it was awesome somebody tweeted here's a problem and I said oh I think I can solve that rather than spend time on company stuff so I had a good time just playing with this problem and this is what I came up with to basically find a circle out of this roughly circle-like image so that's the kind of thing you can do now NumPy has had a story in data analytics for a long time because of these structured arrays I said briefly before that every element of a NumPy array can be an arbitrary structure can be integer, float, whatever and so I can think of a one-dimensional NumPy array as a table, as an Excel table and it's a really nice mapping however it's an array of structures which sometimes isn't the optimal data structure when you're trying to say add new columns quickly or do computations down the columns so even though it works it's not as flexible as we'd like and PAN has emerged as basically this generic structure of arrays where it basically has pointers to different arrays under the covers so it sits on top of NumPy and it provides a lot more user-friendly tools for people doing data analysis so whereas behind the past when you're using NumPy to do data analysis you might have to write 5 to 10 lines of code with PAN as its one or a method call and it's quite a bit simpler so a lot of people have come to the Python data community because of PAN as this list basically comes from a user of PAN as I love PAN as a few of these reasons and then I modified it slightly so currently today in Big Data Analytics, Python this is the basic key libraries and I might be missing a few here but these are sort of the basic ones NumPy, SciPy, PAN as Mapplotlib, IPython the list goes on, it becomes quite a stack so when you're sitting there I want to use Python for data analysis you have to get a bit of stuff together to make that happen that's really why we created Anaconda and Kanda maybe interested to note that whereas a lot of people have been using R for data science, Python is rapidly creeping as a equal footing for data analytics with R a lot of people, this is a recent survey done at O'Reilly and they surveyed the people that attended Strata in 2012 and 2013 this is a revelation to O'Reilly as well I think because they've been really searching for people to write books on Python for data analysis I think also West McKinney's book was successful and that really opened up a lot of people's web gates as well like hey, there really is a market here, we should get books I've been asked to write a lot of books, I have no time to write any books so far I handed that to a few other people maybe we'll get a book out of Continuum we also see articles like this I don't want to get into language wars I think actually we can work together with the R community but it just goes to show that it becomes kind of a choice you can do everything you need to do in Python pretty much Python is growing as the top language in schools many of you have seen this, but I'd like to show it we're in a Python conference we should kind of celebrate the fact that Python is being used forever lots of places this is US schools, top universities Python is the number one introductory class being taught now some will say that's how languages go to die so maybe it's not such good news right now we're going to get a whole lot of people using Python I have no idea what they're doing I trust that our community is vibrant enough and robust enough to welcome them in, train them up actually unlearn the things they might have learned wrong in school and help move the community forward so I have here, I know I'm running out of time I've had plenty of slides plenty of things to talk about but I wanted to talk a little bit about why I think Python is fantastic for technical computing as I said I was a domain expert a data scientist coming to Python and I had reasons for it we're the same as they are today one, syntax, it gets out of your way I don't have to learn jargon and concepts it basically leverages my English language data centers perhaps there'll be a language that leverages Mandarin data centers in the future I won't benefit from it but others will this leverages those Latin character centers white space, I love white space the fact that it conveys intention and I'll tell you why because my field of view is limited I have limited horizontal and limited vertical I can see and I have to understand something in that limited space so if I'm using that up with braces and brackets and things that are unnecessary it's just, it's waste for me it's also why if I have long, long paths long, long dot paths for things and my variable name takes up the whole screen I'm in trouble so I'm kind of I'm not a big fan of that either complex numbers were built in early overloadable operas were built in early this is a mistake Java made for the scientists scientists need complex numbers T is the reason electrical engineers have jobs and it's immediately has the FFT you've got to have a complex number or you'll have 200 of them and nobody will be able to agree on what it should be just enough language support for arrays which is the brackets, the ability to have commas go immediately to tuples you don't have to have that funny indexing these things were added actually at a critical time and I definitely have to thank the Conrad Hinsons the Paul de Bois, Jim Huguenin who worked with the Python devs Guido and others to make sure these were added language in early time it's been fantastic occasional programmers can understand it occasional programmers are the people like I was who don't have to solve this five dimensional differential equation and don't want to spend time chasing pointers but I need to be able to see code and read it and understand it and Haskell or Clojure is too much to put in my head and remember so Python works perfectly in that space and I used to say that packaging was a problem with Python I no longer say that because packaging is awesome with Kanda Kanda makes your packaging problem go away and it's fantastic we get that feedback from users all the time so I'm not just saying it because we put it out I'm saying it because we get that feedback from people and I use it and I love it it solves exactly the problem I've seen with not being able to get everything installed easily and quickly so lots of great things about Python system, general purpose, supports multiple programming styles all these things you know but a critical one is that it does have critical mass because you could have the ideal language but not enough people using it and you'd be stuck you could not build a community that's the hard thing and that's a bit of a chaotic question about when that will happen I can't give you answers it's sort of one of those emergent phenomena now there's things I don't like about Python and we could all probably rag on it together but some of these are being addressed I would love to see anonymous blocks you could then send around places really for deferred evaluations the most common use case for me I would love to be able to do slice syntax outside of the brackets please just let a slice syntax be able to create the slice operator I use that a lot as a ray-oriented programming guy to see Python run time to gil global variables inside lack of dynamic compilation there's some work to be done there that's a really hard one I would love to see some language extension I've seen a lot of very creative uses of import hooks to have kind of DSLs imported into Python you can extend Python as you like with the import statement and it can be hard using a general purpose language because the devs of that language don't understand your use case and I'll have a story about that in a little bit about PEP3118 so NumPy I like it, it's good, it's got a lot of good things but there's got a lot of problems with it too the D type system the data type system which essentially allows the structure to raise it's too limiting and difficult to extend it grew out of numerics data descriptor that was there at the beginning and it kind of extended it just far enough but it needs to be kind of overhauled the immediate mode creating huge temporaries all the time when you have an equation to evaluate it's almost an in-memory database really close, if you're using SQLite you can also use NumPy for the same purpose it doesn't really evaluate some of the operations that you would like lots of unoptimized parts lots of embarrassingly unoptimized parts actually, if you start giving the code you'll say who's the idiot that wrote this hopefully the blame's not me anymore because somebody's changed the comments or something the code base is organic and hard to extend I think as I reflect on back to the history I think one of the most important pieces of work that I did in 2005-2006 was actually sit down with Guido I remember flying to San Mateo where he was working and said I'm going to have lunch with Guido and say how do we get NumPy into Python and I saw this little professor and think I can go do this, he's a very nice guy very accommodating so I went there with Paul de Wa and we talked about what can we do to get NumPy into numeric or the new numeric at the time we were calling it into Python and he cautioned about if you get into Python it won't be able to be updated very quickly there's some downsides too we definitely want the structure of the NumPy rate in Python so I spent time writing this PEP to really extend the buffer protocol now raise your hand if you know what PEP 3-1-18 is or know what the buffer protocol is I should see a fewer hands because this is one of those underbellies of Python it's kind of on the lowest level but it's really the ability for arbitrary objects to share data and that's what the buffer protocol allowed that initially but it only allowed you to share a single point of data and no metadata about the data you couldn't share that it was really an array it had this kind of data type in it so the extended buffer protocol was really all about getting more metadata around the pointer to memory that was being shared it really makes possible a heterogeneous world of powerful array like objects so it isn't really necessary for there be to only one NumPy there really could be a lot of array like objects that share memory and they could really operate independently and coexist I think adding multiple dispatch the language would actually improve that better a multiple dispatch library would actually make that a heaven at that point you just need PEP through an eight multiple dispatch we don't need a single array library like NumPy so that's what I think of the future of NumPy and the future of the world I think of a heterogeneous world the world with a lot of things working the buffer protocol exposes this idea some day I'll maybe give a talk about this I'm really not a I don't feel always qualified to do this I'm a scientist by training I'm always kind of an approaching computer scientist software developer by effort and trade and learning from other people but there's something real about this the dual of encapsulation the buffer protocol exposes these data types they sort of invert that idea of having data and having methods attached to that data the data is exposed and you talk about it and then you throw code past it I'll talk a little bit about that I think what it allows for us to do in the future so what of the future what is Python's role in the future we've talked about today, we've talked about a little bit of the past what's going to happen in the future well I've watched the Star Trek episodes so I think I know the future and I'm going to be able to tell you what's going to happen I'll just describe what I'd like to see and some of the principles that I think that will guide the future but we can't ever tell we can't ever tell what's going to happen but one of the things that we've always been faced with is the idea that data has mass data is growing faster than speed of light can carry that data from one point to another therefore that means you're going to want to have data sit where it is and that's true whether it's in a GPU or it's in a memory cache or whether it's on a cluster somewhere you don't want to be pulling data so what that means is a lot of our systems that were built around the idea of encapsulation and serialization are actually wrong they don't work very well when we do that about how we're gonna manage this. There's a blog about this as well. It's a well-known observation. Data gravity, some guys even invented. You can even talk about the formula for data and kind of how it attracts each other. I don't know, maybe, maybe that's useful. I think it is useful to think differently in a relativistic sense. Normally we think about ourselves on a platform while data moves past us. We kind of serialize the data into our objects and then we do our little computation and we serialize them on their way. But when data has mass, that's really expensive. And it turns out our computations are really simple. Our machines can rapidly do the computation and wait all day long for the data to pipe through. So how do we invert our thinking about this and think about it from a data-centric perspective where the code comes to the data and flows through the data? So that's one thing that I think about. And part of what's in PEP318, that buffer protocol, has some of the answers, I think. At least some of what I've been able to see. Some of you, I'm sure, will be able to think even more deeply and better about that. And I'd love to get your feedback and your inspiration. I think fundamentally the future of big data in Python is gonna be heterogeneous. There's gonna be, while as before, we've had this notion of here's NumPy and everybody uses NumPy, it's a single kind of channel. NumPy really is just a description, it's a protocol. I love the talk on Tuesday by Pieter, the zero MQ author, who talked about the decentralized role in the future and contracts being the most important thing. PEP318 is an example of that kind of protocol or contract between objects. And it's a beautiful thing, it's an important thing. And I think that's what the future's gonna be like as well. Much more of that, rather than here's a single library that everything sits around. So, and then what Python, its role, is gonna be doing what it's always done really well. And that's playing this tremendous glue, this tremendous ability to just stick things together very, very quickly. That's the advantage of not having static types is you can pull things together from all sorts of places in an agile fashion, in an iterative fashion, fail quickly, find solutions, and then move forward. So, at Continuum, we're basically, what I learned from NumPy and SciPy and watching it get deployed, what would I do differently? What kind of things would I do differently? And some of that is expressed in what I described before about data having mass. So, the projects that basically encompass that reality is really three projects, conda, number, and blaze. SciPy really was a distribution, not a library. And so, to do a distribution, you gotta have a packager. So, we have to come up with a cross-platform, complete Python-independent package manager called conda. So, that's why we've done that. And all of these are open source. Numba is about making code as fast as possible. So, it's just a little blurb on conda. I wasn't gonna mention it, but yesterday after my talk, somebody said, you gotta talk about conda. Keep talking about conda, because we love it, it's awesome. So, if you're not using an anaconda or haven't tried conda, download mini conda, you don't have to get an anaconda, you can actually just do pip install conda, conda knit, if you want. And now you can do a conda-based management of your packages without even downloading anything from our site. So, it's completely open source, completely free. We communicate with the Python Packaging Authority with Nick and others to try to help understand how we integrate this even better. Blaze and Numba are our two open source projects. And really, they have a lot of, they have some dependency, especially Numba has this LVM dependency, you wanna be able to get it installed. So, the idea around this is Blaze is motivated by generalizing this PEP-318 to all languages and data sets. Really creating this Python Glue 2.0 that glues things together in a marvelous way that makes what you, currently when you do data processing, if you store your data in SQL or you store it in HDFS or you store it in Postgres, that defines how you query it, right? And your query becomes some kind of convoluted version of the query language they've created for you. And that's how you have to use it. And a lot of us in data analysis, we love to use NumPy or Pandas expressions because we like the way they feel, they fit our brain, they match the way we wanna think about the problem. We like to use those expressions. Currently, the only way to do that is to pull the data to us in order to use those expressions. And so Blaze is about inverting that and creating expressions that then move to the data in multiple ways. Numba is motivated by the desire to not have to make people write extensions anymore and be able to write high-level code that is as fast or can be as fast as Fortran. Now, and if that exists, then array-oriented computing can be done at full speed on modern processors with very little effort. And so that's the goal. So I'm not gonna be able to cover all these slides, they're here, they'll be posted online, you can see them. But Blaze, its goal is to deal with data pain, its architecture is divided up into an API, deferred expressions are at the heart, and it's got data adapters and compute interpreters, basically compute interpreters to run on different backends. That's its architecture, and it uses a flexible architecture so you can easily add new ones, new data adapters and new compute backends. The data descriptors, the data format approach, it allows you to have a uniform array-oriented interface to whatever data, to directories of CSV files, to a SQL database, HDFI files, to just JSON sitting on this, directories of JSON files. Then the compute allows you to have a uniform interface to Dyn, which is our next generation NumPy, it's a C++ library and a Python interface to it. Pandas, even just Python actually, you can run a compute on just Python lists of lists, or lists of dicks, or lists of tuples, just to see if everything is working right. Spark, PyTables, we do have support for Spark. Spark is a member of the Hadoop family, it allows you to run in memory on multiple machines. I said a lot of things about Hadoop, I don't like Hadoop normally, but Spark and Impala, I'm finally warming up to. So they've saved the Hadoop ecosystem from my perspective. Blaze expressions, these are deferred evaluations. You basically create an expression, as I'll show an example, and that builds up a DAG, a directory cyclic graph, that describes that expression, and then the arrays in that graph can be referred to various data adapters, and then that gets sent to a compute, all separated, so we've separated out compute from data, from code, so that you reuse those components independently, and then bring them together for an actual computation. So here's a simple example of counting web links. At the heart is what was missing from PEP 318, which is a really good data description language. In fact, I remember those discussions back and forth, we argued whether it would be NumPyD types, or C types, specifications for data, or whether, and then Guido finally said enough, it'll just be a string, like the struct syntax. Boom. And that's the data declaration language in the buffer protocol. Not quite good enough. The data shape we've created, if we spend a lot of time on trying to figure out a data shape that can encompass all kinds of data. So we'd love your feedback on this. It's a separate project, it's independent, it can be downloaded, installed independently of anything else, and it's basically got some parsers for the data shape language, so you can interpret it in various backends. So you construct a table symbol, which is in this case, a simple two column table between a name and a node ID, and they have different types. String is actually Unicode, and then you have these two table objects, and then here's an expression. I'm joining these two tables together as a deferred evaluation and doing a group buy and counting. The load data is what distinguishes where I get the data from. There's different versions of load data depending on where the data is, but when I compute, notice I just compute on a dictionary that I've mapped. My data I got back from where I loaded it from to the actual variables that'll be expressed in that dictionary, in that compute expression. And then that's when the computation happens. That's when you brought together the data and the code on a compute context. And the load data would be different depending on whether your data is in Spark and HDFS, or maybe it's in pandas and a local disk. So you write that to load your data, but then your expression is completely separated, and you can have a very complex expression that looks pandas-like, but then it is mapped over wherever your data is stored. So no longer do you have to write differently in order to have your data stored differently. And our goal is to end data silos. Allow your data to be where it's optimally fits, where you can get the most performance and not have to change your code so much in order to use that best performance dataset. Just write it back into Blaze and it'll work beautifully. So Blaze has an ecosystem around it. We've been a lot of experimentation, a lot of exploring, trying to understand what we mean by this space. Currently, Dying, LibDying and DataShape are the key pieces in the Blaze library itself, which has its different components. DataShape is that general data description language that I think was missing from PEP through 1.8. I'm very excited about this. I think it should have been what NumPy D types were. You can use it now. We use it in Blaze. Dying uses it as its data description language. Dying is a Python wrapper to a C++ equivalent of NumPy. The nice thing about that is you can bind that to Ruby, you can bind that to JavaScript, you can bind that C++ library to wherever you like and have that multi-dimensional array concept across the board. So it can help with the gluing again for Python 2.0. All right, and I think I'm out of time. So I'm gonna skip NumBah. I've talked about NumBah quite a bit, but NumBah's awesome, I love it. It's still growing, it's still pre 1.0. We still need help. We're looking for people to help us with it. I will just show you that CUDA Python works in NumBahs. You can actually, with NumBah today, target the GPU. If you have a GPU, very, very easily. So CUDA Python comes in NumBah. We're working on making interfaces to that that are more Blaze-like to make it much easier and less CUDA-specific. All right, so Python is a long and fruitful history in data analytics. It will have a long and bright future with your help. Join the Pytotic community and help make the world a better place. I wanna dedicate my talk to Amy Oliphant, my wife. I don't know if she's here, she made it over. Thank you for all you've done. I would, nothing I've ever done would be possible to form for you. Thank you very much. Thank you very much. You're a little bit over time, but I think we have time for two questions. Please take the microphones. Thank you, is it working? Yes. Thanks for the very enlightening talk. I had a question, I'm sorry about it, by PyPy and NumPy. You mentioned PyPy. My question is, they are trying really hard to re-implement NumPy Py, but did they contribute? You mentioned that some rusty cards of NumPy weren't very optimized, like you said, very controversial, and did they contribute back in order to be able to optimize these parts into NumPy? It's really hard because the stacks are so different. So the code they write is quite different than what you write to make NumPy work. I am really excited about our number array object. Our number array object writing has a lot in common with NumPy Py. Actually, I finally see a way to collaborate with them. So I'm really excited by that, because I always love to collaborate where I can, but sometimes it's challenging. Okay, thank you. Yeah, excellent. Hello. Thanks for the awesome talk and the insights about the history of NumPy and the ecosystem all around. So my question is regarding the packaging, because I looked up the PyPI index. NumPy still don't provide Python wheels for the wheel packages for Windows, which would be great because on Windows system, it's always hard. Yeah, the compiling is not as convenient as on Linux. So are there any plans to provide pre-compiled packages for Windows? So I think I've heard of plans like that. Yes, I think people are talking about doing that. My attention is on, I mean, Konda install, NumPy solves the problem. So I'm sort of less motivated myself to worry about that, but I think there are some people that are trying to produce wheels. Yeah, so great. I'll try it out. Yeah, excellent. Thank you very much again.