 Today, what I would like to talk about is a little bit of Python's journey to data science. A lot of people today think of Python as a data science language, and it's not always been that way. I want to talk a little bit about how we've gotten to where we are right now, and then give a little bit of a sort of survey of where Python is now in the data science space. So any of you out there who are hoping to get started, kind of have a roadmap of where to start. First a little bit about me, my name is Jake, JakeVDP. You can find me on GitHub or Twitter or Stack Overflow or other places like that. And my path has been an interesting one. I started studying physics, which led me to studying astronomy, which led me to analyzing large data sets and getting into data science and Python, and eventually that led me into software engineering, because I realized the real core of data science is good software engineering. So currently, after all this stuff, I'm working now as a software engineer at Google, and it's been a really interesting experience. If you're curious about that, I can chat with you afterwards. In the Python world, I've done a lot of things. I've been working on a lot of different open source projects over the years, including SciPy, Altair, Scikit-learn, AstroPy, and I have a couple books that some of you might have seen before, the Python data science handbook, and if you're into kind of like real sort of astrophysical data analysis, you can check out this. We have this graduate text in astronomy, data processing, so lots of fun stuff. So I want to start out by saying, as I mentioned, Python has come a long way to where it is now as a data science language, and as the way it started out, Python was not designed to be a data science language, and you see kind of vestiges of this sometimes when you approach Python asking data science questions. So for example, if you say, I want to visualize some data, and you go out in the world and ask various tools how to do that, if you ask someone from the R community, they'll say, you visualize data with GGplot, right? Has anybody used R and GGplot? A few people out there, it's a phenomenal tool, a phenomenal tool for data analysis, data visualization, data science. By comparison, if you ask Python how to visualize some data, you say this says, okay, you should use Matplotlib. Oh, you know, unless you want to do some interactive data, then it's Altair, or you can use Plotlib, or you can use Seaborn's Good for data science, and you can use Bokeh for other visualization stuff, and, oh, Hall of Views is this new thing, you know, there's all these different answers to the same question, how do you visualize data, and all these different packages and approaches people have developed, and this is sometimes a little bit infuriating, especially if you look at this thing, the Zen of Python. Has anyone ever typed import this in a Python interpreter? You get this nice Zen of Python. All these sort of rules that you can meditate about to make your Python as effective as possible, and one of them is there should be one and preferably only one obvious way to do things, and in the data science world, this is just not the case, right? I show the visualization here, but also storing data. You can use NumPy, or Pandas, or X array, or TensorFlow arrays, or PyTorch, there's all these different ways to do things, and it's a little bit of infuriating when you're getting started. You say, you know, why is Python this way? And what it comes down to, why can't there just be one way? It comes down to the fact that Python didn't start as a data science tool, it wasn't designed as a data science tool, or even a numerical or statistical computing tool at all. You know, if you rewind the clock almost 30 years and go back to the beginning, actually, we're getting close to 40 years now, huh? You go back to the beginning of Python. This is Guido van Rossum, and he created Python basically as a teaching language, and also to bridge the gap between the shell and sea. And you look at what Python was in the early years, and it basically was designed to be kind of an alternative to bash coding. It was like a friendlier way to stitch together the different things you were doing in your computer. And Guido gave an interview a while ago where he talked about his expectations in developing Python. And he said, you know, I thought we'd write small Python programs, maybe 10 lines, maybe 50, maybe 500 lines, and that would be a big one. And you compare that to where we are today, where the top companies around the world have millions of lines of Python driving their products, in everything from data analysis to finance to fields in between. So how did we go from that almost 40 years ago, thinking Python scripts were going to be a dozen lines at the most, to where we are today, where Python is this data science powerhouse, where 1,600 people gather in Chennai, India, to talk about how they're using Python in their work. It's an amazing story. And I sort of like to trace this back through a few different decades and kind of organizing it into a few arrows. And I'm going to pick on Dave Beasley right now. He's one of the later keynote speakers. But I thought I'd stick his picture up here right now. So in the 90s, I think of Python in the 90s as like being in the scripting era. And Python was sort of this alternative to Bash. People wanted to write Bash programs to stitch together different workflows they were using. And Python was just a nicer program to use. And if you read what Dave wrote back 19 years ago in this paper and scientific computing with Python, he said, scientists work with a wide variety of systems ranging from simulation codes to data analysis packages to databases and visualization tools and homegrown software, each of which presents the user with a different set of interfaces and file formats. As a result, a scientist may spend considerable amounts of time simply trying to get all these components to work together in some manner. And Dave and others during that time really pushed Python as a solution to do this kind of gluing of different processes together. And some interesting code from that time is an example is the simplified wrapper and interface generator, Swig. And basically what Swig did was it gave you this really nice interface to kind of take your C code or your other compiled code and generate Python interfaces so you could use them more interactively and more intuitively. And this really helped pave the way for all the sorts of things that came later and all the tools that you use today. So that was a scripting era in the 90s. Python is an alternative to Bash. In the 2000s, I like to think about that as the SciPy era. So if the SciPy era had a motto, it would be Python is an alternative to MATLAB. MATLAB is this powerful system that a lot of scientists and engineers have used and still use to do data analysis, data visualization and things like this. And in the early 2000s, a lot of people were pretty jazzed about Python as a tool for gluing together their data analysis workflows and said, hey, you know, why can't we use Python for the kind of things that we're using MATLAB for right now? And one person who was instrumental in this is John Hunter. In 2012, shortly before he passed away, he gave a keynote at SciPy where he talked about the history of the Matplotlib project, which is the project that he created. And he said, basically, before Matplotlib, he had a hodgepodge of work processes. He would have Perl scripts called C++ routines. It would dump data files. He'd load them in the MATLAB. And after a while, he got tired of MATLAB and he started using NuPlot instead. And he looked at this and said, you know, I want Python tools that can take all these different pieces of my workflow and put them into one package. Similarly, Travis Oliphant, who created, co-created NumPy and SciPy with some others in the community, he said, prior to Python, I used Perl for a year, then MATLAB and Shell Scripts and Fortran, C++ libraries. And when I discovered Python, I really liked the language. But it was very nation and lacked lots of libraries. And I felt like I could add value to the world by connecting these low-level libraries together to high-level usage in Python. So, again, the same story. He had all these different tools he was working with and he wanted Python to glue them together and to be this kind of one workhorse that could do all these numerical things. Similarly, Fernando Perez, who you might know as the creator of the iPython and Jupyter projects. He was a quantum physicist at the time. And he said, he remember looking at his desk, seeing all the books of languages, stacked with books on C, C++, Unix Utilities, Perl, IDL, Mathematica, Make. And I realized I was probably spending more time switching between languages than getting anything done. And this is what drove him to create the iPython project and to start using Python to stitch these workflows together. So if you look at the software that came out of these three folks and all the folks around them that were working on the same things, these big software packages are Matplotlib around 2002, SciPy around 2000. I wrapped NumPy into SciPy there. And iPython around 2001. And I dug through the internet to find these original logos for each of these packages. They're pretty awesome. I love the red on yellow of Matplotlib. It's a really, really good design choice. But as time went on, actually, originally each of these packages had some components that overlapped. They all had some component of visualization, some computational tools, some shell tools. And eventually, as these folks started talking to each other, they realized that they had some overlap and decided to start working together. And so that brought us to our modern situation, where we have Matplotlib, SciPy, iPython, which have distinct use cases within the scientific analysis stack. And they're all built on NumPy, which is this unified array library that goes underneath them all. So this is kind of like 2000 to 2010. This sort of stuff kind of percolated out. And we ended up with a really nice stack where we could do in Python the kinds of things that scientists and engineers had been doing in Matlab and are still doing in Matlab. So that's the SciPy era. And if you look at the key conference series, the SciPy conferences were big in that, starting as sort of meetups for the people developing these tools to now where they are the core meetups for the users of the scientific stack in Python. OK, so we had the scripting era. That's kind of Python as an alternative to Bash. We have the SciPy era. Python as an alternative to Matlab. And where are we in the last 10 years? I like to think of the last 10 years as the PyData era. And if PyData has a motto, it would be Python as an alternative to R. R has really developed as a powerhouse of a tool for statistical analysis in particular, and for cleaning and analyzing and interpreting and visualizing data in particular. And a lot of folks in the Python world wanted a Python alternative to R. They really appreciated the beauty of Python and its syntax and didn't want to have to switch languages when they were doing this kind of data analysis. So I think what kind of kicked off the PyData era in a lot of ways was Wes McKinney and his work on the Pandas package. And Pandas is interesting. Wes was working at the time as a consultant in the financial sector, where a lot of people were using R and other tools of that flavor to do data analysis. And Wes was, again, one of these people who wanted to use Python. He loved the Python language and wanted to be able to use Python for those kinds of applications. So if you read in the intro of his book, Python for Data Analysis, he has a really nice discussion of what led him to develop Pandas as a library for Python. And he said, I had a distinct set of requirements that were not well-addressed by any single tool at my disposal. Data structures with labeled axes, integrated time series functionality, arithmetic operations and reductions, flexible handling of missing data, merge and relational operations. And I wanted to do all these things in one place, preferably in a language well-suited to general-purpose software development. So again, you see this theme coming through. It's people who have workflows that involve multiple tools and saying, Python is the tool I want to use for this. And figuring out how can I bring these workflows into Python in a way that will help me and help others. And so Wes created the Pandas package. Other key software development, kind of in the early 2010s, where Scikit-learn, which is a machine learning library that a lot of you are familiar with, Konda is this package manager that came around and solved a lot of the difficulties with packaging scientific software. The IPython notebook and later the Jupyter project provide a nice compute environment for working on analyses and sharing them with your coworkers and colleagues. And there was this key conference series, the Pi Data Conference, which has expanded to this international set of workshops and meetings. And it's been really interesting to see it grow. The Pi Data Conference is very kind of close to my heart because it was my first Python meeting. The first talk I gave at a Python meeting was at the first Pi Data Workshop in March of 2012. I'd never been to a Python conference before that and because I'd been working on the Scikit-learn project, they asked me to come talk about machine learning in Python and it was the wildest thing. You know, I walked into this room and there were all these people who I'd heard of before who I'd used their software, you know, Travis Oliphant was sitting over there and Fernanda Perez and all these people that were like, you know, in my mind there were these gods of the Python language and then I was sitting in the room with them and it turned out they're just normal people like me. And it was so interesting to meet them and to get into that. And I've been enjoying the Python community ever since then. So anyway, we have these three kind of eras of the Python language that were Python developed from this kind of alternative debash where you could write dozen line scripts all the way into now where you can use Python as the full stack for scientific or scientific data analysis and related things. So if there's any theme that I want you to take from this survey, it's that, you know, people want to use Python because they think it's an intuitive, beautiful language. And people look to the tasks that they have to do and figure out how they can use Python for it. And that's why we're here today. It's because of all these individuals who decided they wanted to use Python, wanted to make the ecosystem better. And so Python has incorporated all these lessons learned in other tools and communities and it's what makes Python the powerhouse that it is today. So it's important then to recognize with this perspective that Python's not a data science language, right? But because of all these commitments over the years and all these contributions from people, it's become this general purpose language that can do data science well. So one way I like to think about Python is it's sort of a Swiss army knife, right? You have this one package with lots of different tools in it. But it's different than a normal Swiss army knife because it's a Swiss army knife that anybody can contribute to, right? And so you have lots and lots of different opportunities of different tools that you can use that are written in Python. But the strength of that is you have a huge capability but the weakness is like it's hard to know where to start, right? There's no top-down governing body saying that this is the new package that you should use for visualization. You have a half dozen people trying to solve their own visualization problems and putting tools out there for you to use. And it's really hard to know where to start when you want to do something in Python if you've never started with it. So the second half of the talk now with this framework of where we've come from, what I wanna talk about now is where we are right now. If you're someone who is looking at Python as a potential tool for data analysis or for your own work, where should you start? And I wanna give you a little bit of a roadmap of the types of tools that are out there and what you should look at as you look into this. So first, you need to start by installing Python and installing various packages that you can use in it. And I would recommend this conda installer. Python packaging has come a long way in the last 10 years, but conda was a real turning point in my experience where prior to 2012, I remember struggling for hours and hours to try to get tools like NumPy and SciPy to install on my system. And I was using Linux where it was easy. If you're using Windows, then forget about it. Has anyone ever tried compiling SciPy on Windows? It's an experience I would not wish on my worst enemy. But fortunately, the conda team basically solved this in one swoop where they created this package manager called conda and it basically comes in two flavors. You have the mini conda, which is basically the smallest thing you need to get started. It's Python plus the conda installer and you can start installing the packages you need. Or if you don't want to have to make decisions, you can install anaconda, which is basically mini conda plus the entire universe of useful Python packages. So you install that, it's a few gigabytes and you're just ready to go. I usually start with mini conda because it's more lightweight and you can build up what you need. And if you go to the conda website, you can see there's Windows, Mac and Linux versions, 64-bit and 32-bit, Python 2, Python 3. You can choose where you want to start. I'd say choose the Python 3 at this point. And once you download it, you install it and you get a nice little installer in your command line. And then you have this two things installed. You have the conda tool, which helps you install packages. You have Python itself, which lets you run Python code. And these are both executables that are installed by conda. And now if you want to start installing packages, you say conda install numpy scipy pandas mapout libjupiter and it will install them on your computer. It'll install pre-compiled binaries, whether you're on Linux or Mac or Windows. And it's just slick and it works. So start with conda and install those packages. You can also do more fancy things like create environments, like you can have a Python 2.7 environment next to your Python 3.7 environment with separate packages. It really makes the whole thing nice. So when you activate this new environment, then you have different Python versions and you can actually run the Python command and you're getting your new Python binary there. So anyway, if you look at my computer and you list all the environments, I tend to use conda to basically separate all the different projects that I'm doing. I have conda environments for all the major Python versions and then for the various tools that I'm working on, I have development environments for say, AstroPy or SciPy or Scikit-learn or Vega. And keeping these things separate means that when I say when I try something new in SciPy, I don't break the environment that I need to use to start analyzing my data for my other projects. So it's nice to sort of keep things separate. Conda is a good tool for this. And just as a little side note, there's also this tool called pip, which is for installing Python packages. And if you're curious about the difference between conda and pip, it just briefly, pip is something that installs Python packages in any environment. And conda is something that installs any package in a conda environment, right? So conda is a little more broad in that it can install things like Fortran tools that underlie your libraries. But pip is specific to Python, but it can kind of work anywhere. So anyway, that's package managers for getting started with Python. Once you have your packages installed, you need an environment to start coding in. And I often recommend people start with Jupyter or JupyterLab. So JupyterLab is basically, you can think about it as a front end in which you can develop your code, execute your code and share it. So if you conda install JupyterLab, and then run JupyterLab from the command line, you get this nice little web UI coming up. And what Jupyter does is it gives you a browser-based interface where you can start executing Python code. So you open a new notebook, and you have this Jupyter notebook right here. You start entering code, and the cool thing about JupyterLab is that the outputs of the code are embedded right there in the notebook, including visualizations. You don't have a separate window popping up, and you can kind of keep all your code, all your data, all your outputs in one place. And JupyterLab is a really interesting and dynamic thing. It's built on the Jupyter notebook, which has been around for about seven years now, in one form or another. But JupyterLab itself is a more full-featured IDE, a full-featured interactive development environment where you can edit text files, you can edit notebooks, you can view images, you can view various file types. And it gives you a really flexible way to start exploring the Python ecosystem. So that's the coding environment. And something that's based on the Jupyter system and built on top of that, which is really interesting, is this project called Binder. And you can think of Binder as a way to take your Jupyter Notebooks and your JupyterLab projects that are stored on a place like GitHub, and open them, and view them live without having to do anything on your own computer. It's a way of kind of like turning your Git repo into something that you or the users or your collaborators can execute and work with interactively. So for example, my Python Data Science Handbook, I have this Launch Binder button. And all the content of the book is on GitHub in the form of Jupyter Notebooks. And with Binder, what this means is you can launch it and you have the interactive Notebooks there to start executing code and exploring the code and modifying things and seeing what you can do without having to run anything on your own computer. So it's a really cool system that's been developed by the Jupyter folks. I should also mention a similar, in some ways, similar thing is this project called Google Colab. This has been the project that I've worked on at Google in the last year. And it's basically a Jupyter Notebook on top of Google's computing infrastructure and backed by Google Drive. So if you use Google Docs to store your files, Colab is basically a way to store your Jupyter Notebooks on Google Drive and to execute them in the cloud, all for free. And a nice thing about it is it's zero setup. You can share with people and you have free access to hardware like GPUs and TPUs that let you do powerful computing without having to invest in your own computer and the amount of money and time it takes to get something like that set up locally. So from there, we have our computing environment. Now, what if you wanna do some numerical computation? The real powerhouse behind most of Python's machine learning and data analysis packages is this package called NumPy. NumPy stands for numerical Python. You can conda install NumPy. And essentially the core of what NumPy is is an nd array object. It's basically a very flexible container for your data. And you can define your data and you can start doing arithmetic operations on your data. Like here, we're taking the contents of this x array, multiplying it by two and adding one and we get the array out, this element or y is arithmetic. And notice that the loops and the handling of memory and things like that here are implicit. So with NumPy, you can do a lot of different numerical computation. For example, linear algebra is built in. If you wanna do something like a singular value decomposition, this is a fundamental operation that underlies a lot of machine learning algorithms. You can do that in one line in NumPy. And if you look into more high level tools like scikit-learn, scikit-learn is using NumPy's linear algebra tools under the hood to provide you an API to do machine learning in a very convenient manner. You can do other things like fast Fourier transforms and a lot of the core routines that underlie data analysis. So the key to using NumPy effectively if you start digging into this is something known as vectorization. So if you come from a language like C or Java or Fortran, you might be used to doing these kinds of things by hand. You know, you have an array of data and you say, I want to operate on each value in this array. You might be tempted to write a for loop where you take a define an array for the result. You loop through each value. You assign the, you operate on each value and assign it to the output. But if you time this in Python, this for what is this, 10 million values, takes about six seconds, which is really phenomenally slow in the age of modern CPUs, right? Python, when you start looping over data and doing repeated operations, Python is really, really slow. But when you use tools like NumPy, you get, by using this vectorized approach, not only do you avoid the need to write these for loops explicitly, but you get a hundred X speed up just by writing more intuitive code. That's because what NumPy is doing is it's taking these operations and pushing them from the slow Python layer down into the compiled C layer under the hood. So if you wanna know more about this, I have a talk from PyCon 2015, and there's also a lot of other interesting resources online to learn more about vectorization and using NumPy effectively. So NumPy is good for kind of raw data sets, but in the real world, data has labels, right? You have, and often you don't wanna access your data by index, you wanna access your data by the labels of the columns. And this is where Pandas comes in. Pandas, the name comes from panel data, and it's basically an answer to R's data frame within the Python world. And what Pandas gives you is kind of a wrapper around NumPy's functionality that lets you access sections of your data by name rather than by index. So you create a data frame with a column named X and a column named Y, and you can start manipulating these pieces of data by name. So create a new column that's called X plus two Y and just do what comes natural and do the column plus two times the other column, and it creates the data frame for you. Pandas' other real strength is reading serialized data files like CSV. NumPy has some tools for this, but they're not nearly as mature or user-friendly as the Pandas versions. So you can read data in one line from a CSV file or a JSON file or a database or any other data source. And then once you have the data, you can start doing fast SQL-like grouping and aggregation and the other types of algorithms that really go into more complicated data analysis. So we can group by the ID and take the sum, and we see here that I gotta make my tools go away. The sum of all the A values is four and the sum of all the B values is six. This might seem small, but having this kind of functionality in a very efficient manner is huge, and there's not really any easy way to do this in NumPy by itself. Okay, so you have your data analysis, NumPy and Pandas, and now you wanna visualize it. Matplotlib is this core visualization tool that's been used for almost 20 years now. And to give you an idea of how pervasive Matplotlib is, in my PhD work in astronomy, basically every research paper I ever published had all Matplotlib plots. I never used any other visualization tool to create plots in my papers. So Matplotlib is really powerful and lets you do a lot of amazing things. You can create pretty much any figure in any visualization with Matplotlib. So what it looks like is this, is you have a plot function and you pass in two arrays and it gives you the output. So here's plotting a sine and a cosine. But these days there's a lot of things that people want to do that Matplotlib can't do, and the biggest example of that is interactive visualization on the web. So if you click on a website and you see an interactive chart where you're clicking and zooming and scrolling, Matplotlib can't really do that. So there's been a number of packages that have come along that have tried to address this need. Oh, before I get into the interactive packages, the other thing that Matplotlib doesn't do very well is handle named data like in Pandas. So Pandas itself actually has a wrapper on Matplotlib that lets you plot data by name rather than by index. So if we take a data frame and we do plot dot scatter and say we want to scatter the petal length versus the petal width, we get that scatter plot. And similarly, the seaborne package has this pair plot function and other functionality that lets you do more specialized plotting in a few lines of code. Another package that's really interesting that builds on Matplotlib is this one called plot nine. And if you're used to R's GG plot, plot nine is a really nice answer to that. It basically gives you the GG plot API within Python and outputs Matplotlib plots. So you can do things like define the aesthetic, define the geometry, whoops, define the statistics and the kinds of facet you want. And if you're an R user, this should look really familiar. You get this sort of grammatical approach to plotting but you get it with Matplotlib. So yeah, a lot of tools built on Matplotlib and then going into this more like web interactive visualization stuff. There's a tool called Bokeh and Bokeh's strengths are that it's really good at analyzing large data sets in an interactive in browser manner. You can do lots of different things. So again, you can check this out. Another tool, there's a company, a startup called Plotly that has a tool called Plotly that is similar in spirit. It lets you do a lot of interesting kind of interactive and dynamic visualization from Python. And one thing that Plotly does really well is it has tools for animations. So if you want to create animations within your browser in Python, Plotly is a good tool for you. And then a tool that I've been working on this has been my project, my open source project for the last couple of years is called Altair. And basically this is what we wanted to do with Altair is provide a grammar based approach to plotting similar in spirit to Gigiplot but based on a somewhat different grammar of graphics that lets you kind of more intuitively define complicated interactive visualizations from Python. And Altair lets you do a lot of interesting things like linking plots across each other. So you interact with one plot and it updates the other plots and things like this and all of this in a very compact, sensical API. So if you're interested in that, I have some talks online that you can look at or you can look at the Altair tutorial or other Altair resources online. I'm happy to chat about that a little bit with you afterwards. So if you want to kind of make sense of it all and you don't know where to start in the visualization world, there's this cool website called Pyves which is essentially an effort to organize this information about visualization. And okay, so some interesting things that have happened recently in this space are something for dashboarding. There's this project called Panel that's come out recently and what this lets you do is use Python to quickly define interactive dashboards with callbacks to the Python kernel that you can even post on websites and let kind of users see without having to without having them having to run the code themselves. So this is an example, if you take an Altair chart, and this is a simple, basically a function that makes a chart given a stock symbol. So we get the stock data and we choose that symbol from there and we chart, mark line and encode the date versus the price. And so if we plot Apple here from this data, we see Apple's stock chart there. And what Panel lets you do is it lets you take this function, this make chart function and pass it to a method that creates an interactive version of it and then embeds that in an output display. And it gives you a little widget that lets you choose the input value, lets the user choose the input value and then have the result be reflected right there. And the key to this is that this function, this make chart function can do anything. This could run a machine learning model or fit some sort of sophisticated process to your data. And if you look at the Panel website, they have examples of this where this is live on the internet. You click on the example and you get a version of this that you can just look at. And it's a dozen lines of Python code and it makes creating kind of interactive browser-based explorations of data a very intuitive and easy thing to do. So it's a cool project and like I say, this is relatively new. I've just heard about this in the last maybe six months. And it's out from the same people who created Anaconda and Bokeh and some of those other projects. So check it out, take a look at that. It's fun to look at. Okay, so we have our computing platform in Jupiter. We have our numerical data analysis libraries. Whatever visualization. What if you wanna go a little bit beyond that and start doing some more sophisticated analysis? So SciPy is the tool that you want for that. And this has been, like I mentioned, this has been used for about the past 20 years as a tool for doing scientific analysis in Python. And it has a whole host of interesting functionality. You know, a lot of the, when I was doing, when I was developing new algorithms to analyze astrophysical data, a lot of the approaches that I used were straight out of SciPy because it has things for sparse matrix, interpolation of data, numerical integration, spatial metrics, statistical functions, optimization, linear algebra, mathematical functions. If you're a physicist and you like your Bessel functions, you can compute your Bessel functions with SciPy, related Fourier transforms and similar things. So anything you wanna do in Python, you can basically do with SciPy. And an example of this, here's a few lines of code where we create a, we create an X grid. We minimize the first order Bessel function over that X. We plot the result and we plot the minimum point in that result right here. So this red point right here is the minimum that we found based on this optimize.minimize. So a few lines of code and you're able to do these things that if you're used to being an old school scientific coder doing this by hand, this would be a lot of lines of Fortran that you can do it in just a few lines of Python. And that's really the power of Python for data analysis. If you wanna go a little bit beyond that and do machine learning, you can use this package called Scikit Learn. And essentially what Scikit Learn does is it takes NumPy and SciPy and all the tools, the lower level tools available in there and gives you a high level API for machine learning. So here's an example. If we create an array of random X points, we compute a function, y is sine X plus some noise. We plot it, we get some data that looks like this. Now how might you find, kind of smooth this data and find the line of best fit in there? With Scikit Learn, it's a few lines of code. We can use basic, we can use something like a random forest regressor. This is just one example of a machine learning algorithm that's available. And once you have your model that you create, you can fit it to your data. So we're gonna fit it to our X data and our Y data. This little colon new access thing is just to basically take the data from a row array and turn it into a column array. And then we can compute the grid on which we wanna fit it. And then we predict on this grid and figure out what the Y values are and we plot the results, plot the data and plot the fit on top of it. So this is what a simple random forest model gives for this data set. And the power of Scikit Learn is that the API is regular enough that if you wanna use a different model, like a support vector machine regressor, you just drop in a different model and look, the rest of the code stays exactly the same. It's just the Python class we use that defines which model you're gonna use. And you can see that the support vector machine has different characteristics. It doesn't conform as closely to the individual data points. It does a little more kind of smoothing. And as you dig in and learn more about these machine learning models, you can get an intuition for which one is better depending on what you wanna do with your data and your model. So Scikit Learn's a really cool package for that. And this is something that, the Scikit Learn was sort of the way that I got into the Python open source world as I started contributing to this package about 10 years ago, 2009. And it's been a cool thing to be involved with. So we have all this numerical analysis, SciPy, Scikit Learn. We're starting to build up a really interesting tool set. What if you wanna start moving to larger data sets? There's this great tool called Dask that a number of people, I've already talked to a few people today who are using this in their own work. What Dask does is it basically provides a Python API around massively parallel execution of code. And the cool thing about this API is that it lets you do things without having to worry about the details yourself. So a typical data manipulation with NumPy. We have an array, we multiply it by four, we take the minimum and we print the result, something like that. The same thing in Dask is basically the same code except we import Dask array and we create our array, we multiply it by four, we take the minimum and we print the result. And in this case though, the result is not a number. This result is an abstract representation of the calculations that you just did. And there are ways to visualize what this representation is. And basically you're taking your array, you're splitting it into five chunks, you're creating arrays for both of those, you're multiplying each chunk by four, then you're taking the minimum of that chunk and then you're combining all these chunks together, taking the minimum of each and you get the minimum value. So that way, instead of doing all these multiplications and minimums on one process in one core, you're splitting it among multiple processes. So this is a little silly for multiplying an array by four and taking the minimum but you can imagine how this generalizes to more sophisticated operations. And what it means is that you can call compute on this task graph and get the result. And the power of this approach is of this task graph, this abstract representation of what you're trying to do with your data can be sent out to an AWS cluster. It can be sent to a GPU on your machine. It can be sent to a condor cluster of all the desktop machines in your academic department if you have something like that. It's a really flexible way to describe the problems you're doing and to evaluate them. So another sort of code optimization thing that you might have come across in the Python world is Numba. This is a really cool package and basically what it does is it tries to take Python code and compile it down into fast LLVM code without much extra work. So a typical example, everyone likes their Fibonacci numbers. So here's a Fibonacci number calculator in Python. And to compute the 10,000th Fibonacci number using this sort of loop-based approach is two milliseconds per loop. So you might think that's sort of fast, but it's sort of slow. If you do the equivalent in a compiled language like Fortran or C, it's gonna be a lot faster. What Numba lets you do is add a decorator to this function. Okay, I haven't changed anything except for adding this decorator. And you get a 500X speedup in these sort of loop-based approaches to analyzing or to working with data. What this is actually doing is it's taking the bytecode in this defined function, compiling it to LLVM, which is sort of like a compiled fast language, compiling it to LLVM, compiling that and then executing the compiled code rather than the Python code and dropping in the result for you. So there's been some really interesting things done with Numba to kind of speed up more sophisticated analysis and particularly in image manipulation and things like this. So check that out if you're interested in like how to make your Python code faster. So another one I should mention real quickly is is a Scython. So Scython is sort of a way to marry C and Python. And similarly, if you take this Fibonacci number generator, what Scython does is it allows you to compile your Python code into C rather than LLVM. And for example, if you just run this basic Python code through the Scython interpreter, you get a 10% speedup, you know, it's not that much. But where you can really start taking advantage of things in Scython is if you kind of combine the idioms of Python and C into one script and you start defining typed variables inside your Scython code, then you get the same 500X speedup as Numba just by kind of like compiling into optimized code. And I often say that Scython is my favorite language to program in because it's like, it's got all the beauty of Python and all the expressibility of C. And so I really enjoy writing code in Scython because it lets you do some interesting things. And Scython is really a powerhouse behind all the tools that you're using in the Python data science world. Your NumPy, SciPyPand as Scikit-learn, AstroPy, SimPyPand, and a lot more are using Python at their core to make the tools you use fast. So there's so much more out there that I haven't talked about. I just wanted to basically give you a brief survey of the core tools that if you're jumping into Python data science that you should get familiar with. If you look at all the tools that have been built on top of these, there's a ton from all different domains and all different fields, whether you're working in neuroimaging or astronomy or biology or anywhere in between. People are out there writing Python code and making Python tools to do the kind of workflows that they have. And so as I finish this, I wanna remind you that you should not think of Python as a data science language because that's not what it was created to be. It was created to be a language that's easy to use, a language that's expressible and a language that's beautiful and fun to code in. And this is probably its greatest strength, right? Because Python is such a great language to use, people invest time into making it work for their own use cases. You know, Python is successful because people just like you used it to solve their problems and shared their solutions with the world. And this is kind of what I wanna leave you with. You know, the 90s, I talked about Python being the scripting era. 2000s was the sci-fi era. 2010s is the pie data era. We're coming into another decade. And where Python goes from here is up to the people who use it, up to the people who build these tools that other people are using. So you know, it's up to you. Any of you can kind of decide and influence where Python goes as we move forward in this next decade. And I, for one, I'm interested to see where, what comes out and where it all goes. So thanks very much. Here's my information. And I'd love to take some questions. So you talked about the next generation, right? Like from, so what do you think where Python will be? Is it HPC? High performance computing or making Python more faster? Yeah, I don't know. It's hard to say where Python will be in 10 years. I mean, I don't know, I don't know if anyone would have predicted 10 years ago where we are now with Python being the kind of the core algorithm, core tool for deep learning and things like that. What I really hope, I really hope is that Python can become more of a tool for the web. You know, JavaScript is taking over because it's, because everybody, everybody has a JavaScript interpreter on their phone or in their computer or on their watch even. And I'd love for Python to kind of make headway into the world of the web and the browser. So I don't know if I'm predicting that'll happen, but I hope that happens. Still, we see Python in the field of web development generally, but we have Fortran from last 20 years, but we don't see Python there in HPC. Okay, next question here. I didn't catch that, I'm sorry. Hello, next question over here. Okay, yeah. Hello, yeah. Thanks a lot for your talk, Jake. It was really nice knowing the roadmap and how can we proceed. So I am intrigued by the tool, particularly that you work on, is that all data that you are talking about. And from what I understood, it's a kind of visualization tool. So my question to you is, well, while you are working on it and designing on it, are we also looking at from the universal design principle? For example, me being a visually impaired person I cannot access these graphs by just looking at the screen. So are we also trying to build that functionality in these applications and things? We have some insight into this as well in our talk at around 11, but I would like to know your views on that. Yeah, thank you for that question. That's a really good point that we need to figure out how to build sort of broad accessibility into our tools and into our analysis. One thing I will say with the Altair project that I'm working on, we've thought a bit about that. And one of the strengths there is that any plot generated with the Altair project also has kind of a full description of the plot and all the data embedded with it. So there's some work on the VegaLite library which underlies it to add more accessibility tools onto the VegaLite chart display. And I'm excited about that because it means that the visualization is no longer just a grid of pixels on a screen, but it's the data set and a full description of the data set in a grammar of visualization that offers a lot of opportunities to make these more accessible. So it's a good reminder, though. Thanks for the question. Hi, Jake. It was a wonderful talk. I wanted to ask about Namba. So why can't we just always use Namba? What are the best practices to use Namba? And in what situations Namba won't be able to cause a major speed up? Yeah, I'd say for Namba and Scython both, the times that you want to use them is when you find yourself writing big loops in Python and looping over data. If there are things that you can't express in terms of NumPy vectorization primitives, then Namba and Scython are the way to go. So essentially like four loops in Python are slow and you need a way to speed them up. So I think this is going to be the last question. So last taker. Last question? Where is it? Okay. Yeah, so like you said, that Python kept evolving with data science. I'd actually like to ask a question about quantum computation, which I'm working on currently and how CERC and Qwiskit and everything's just coming onto that. How do you think Python's gonna evolve on those kind of future technologies? Yeah, quantum computation. That's a good question. Looking at the past and how we've gotten here, I imagine that Python will start, people will have kind of ways to call their quantum routines or more sophisticated routines and Python will start as glue to call those. Rather than writing a bash script, you'll write a Python script. And then maybe after a while, someone will come up with a higher level API that can wrap that. And maybe eventually you'll be writing your quantum computation scripts in Python with built on QuantPy or whatever package you end up writing for all of us to use. So thanks very much. I'm gonna be around for today and tomorrow and I'm happy to chat and answer any other questions you have. I appreciate your time. Thanks.