 Thank you everybody. I'm Alessandro and I've been for more than the years the maintainer of the Turbogears 2 open source web framework. I'm the author of Duck Pie which is a JavaScript interpreter for Python. I'm the author of Depot which is a framework to manage file storage in web application and more recently I've been a contributor to the Apache Arrow project. Currently I work as the director of engineering at Vaulton Labs or better, Vaulton Data Labs and I'm the author of the Modern Standard Library Cookbook and the crafting, testing and software with Python books. Today I'm here because I wanted to talk to you about Apache Arrow. Apache Arrow is an incredibly interesting project that is trying to shape the future of data analytics. Apache Arrow is a data interchange standard so you can just go and read the paper that describes the format and use it in any application where you want to have a way to efficiently and quickly store data in memory or on disk. That format can be used to keep data in memory or send it through the network or keep it saved on your disk or any storage you have. Apache Arrow is not just a format, it's also an implementation of that format providing an IEO library to read various data files into the in-memory format of Arrow. It provides a vector computation library to run operations over the data that you have stored in the Apache Arrow format. It gives you a data frame like library. It gives you a query engine which is named Acero inside the Apache Arrow project. It gives you a way to manage your partition and data if you have a data set spread across multiple files on maybe network file system or even locally and it does all that by giving you the chance to use whatever language or technology you want. So if you see the schema that it's available here, the end goal of Apache Arrow is to allow you to write your code in any language you prefer. You're a Java user, a Python user or C++ user, doesn't matter. You always work with the Arrow library implemented for that language and environment and the Arrow library will work with the in-memory format that is specified by the Arrow specifications and by using that in-memory format you will have access to the virus storage or network services that speak the same format. So the idea is to make a lingua franca. I will say that every data analytics software can share so that they can speak each other without occurring in additional cost of translation. For this reason, given that it tries to implement a lot of very different features and capabilities, the Apache Arrow project is actually used and it can be scary on the first time you approach it because there are so many things inside. There's for new users, it's frequently hard to understand where to start to from and what's the purpose of the Apache Arrow library itself. The end goal of the project is probably to allow you to write your code anywhere on your laptop or on your tablet. It doesn't matter. Deploy it on everything you want, like run it on your own PC, run it on a distributed production environment and as they all speak Arrow, your code won't have to be changed and will just work flawlessly across all those environments. We are here specifically to talk about PyArrow, which is the implementation of Arrow for the Python ecosystem. It provides, like I will say, 90% of what's available in Arrow itself. PyArrow is a fairly complete implementation of Arrow. It's probably one of the most complete implementation of PyArrow. For that reason, you can freely use it and be comfortable that anything that is documented in Arrow will be available in PyArrow too. Originally, as I was mentioning, Arrow was born as a columnar data format. Obviously, the fundamental entity in Arrow is a column of data, which is exposed by the array class, usually in most bindings. Specifically for PyArrow, it's the PyArrow array object. If you ever use NumPy, which I guess you probably had, PyArrow at this level is not much different from a single dimension NumPy array. Actually, as we will see, it's possible to convert the PyArrow array to a NumPy array and PySversa without occurring any additional cost of conversion. To give you some idea of what might be different from PyArrow arrays to NumPy arrays, they are very similar from the point of view of usage, but they are very different from the point of view of internals and capabilities. For example, why obviously in both NumPy and PyArrow, I'm able to just create an array of integers? In the case of PyArrow, I can actually store complex data structures inside my array. For example, in that case, where I'm storing two dictionaries inside the array, I'm not going to have references to the Python objects inside the array, but I'm going to have the actual data inside the array itself, so that when I access any entry in the array, I don't have to occur in the cost of looking up the Python object and working at Python level, but I'm going to directly access the raw data inside the PyArrow array without occurring any additional overhead that Python might introduce. Also, the PyArrow arrays are masked natively, so you don't have to keep a separate mask and an array entity, while in NumPy, for example, usually people use masked arrays, which in practice are two arrays joined together, one for the mask and one for the data itself. Also, the way that PyArrow deals with strings is much more effective than the way that NumPy deals with strings, because in the case of NumPy, strings are still Python objects, so every entry in your array will be a reference to a Python object, and if you need to assess it or run any computation over it, you won't be able to leverage any particular performance improvements that your CPU might provide, because you will have to work at the Python level. Instead, PyArrow arrays are much more optimized for in-memory storage and for CPU or GPU computation, especially vectorized computation, so in the case of a standard array, like in this case, you can see the memory format of a PyArrow array that contains unsigned integrals. What gets stored by PyArrow is a buffer with all the data, one, two, three, four, five, six, seven, and so on, and a validity bitmap with the mask that tells if those values are null or not, so you are not forced to use only none as a null value, you can have any value that might be representing missing data, because there might be, if you think about something like a survey, there might be many different reasons why a data is missing, it might be because the person never answered that question, or it might be, for example, because he didn't want to answer that question, so those might be two different values, but both of them will still count as null values, and as I was mentioning the in-memory format for strings, it's far more optimized compared to the one that NumPy uses, so that the main difference is probably that PyArrow keeps a single continuous buffer for strings, so if you need to do any lookup or transform the data or run any operation on all the entries in the array, you don't have to jump across multiple different Python objects resolving references, but you can just scan through the array and perform the computation that you wanted, also leveraging vectorized optimization, SMID, and things like that. Obviously in PyArrow we are not just restricted to representing data, as I mentioned before, there is a component in PyArrow which is named Acero, which is a query and computation engine that provides kernels for multiple functions that allow you to perform transformations, lookups, or various kinds of operations that you might be used to have on top of the arrow objects, so in this case for example I could take a PyArrow array just made of numbers because it was the most simple one to represent, and I might just multiply all the entries in the array by two. In the future we are thinking of introducing a syntax similar to the one in NumPy, so using the multiply operator instead of explicitly calling the computation function that Acero provides, but for the moment those operations are available to an explicit function call, so in this case PyArrow compute dot multiply. If you notice in most cases unless you explicitly ask for it the data in the PyArrow array is not just directly available as a list, but it's an object that has its own internal representation. This is because underlying PyArrow is implemented in fully native code base, so we are not dealing with list or Python objects, and so the standard way you get printed entities is without their content, but obviously you can always ask for their content by calling the two string meter at any time. For example I might want to just count the frequency of the values in the array, and that would be another compute operation provided by Acero, or might be wanting to know the minimum and maximum value inside an array or anything like that. So if you are interested in all the features that the Acero compute engine can provide, you can find them listed in the documentation under the compute page. Obviously I told you that Arrow was born as a columnar format, and that implies that if I have a column I have a table where I can put those columns, so PyArrow also provides the table object which is a bit more like a pandas data frame I would say, or even to the comparison it's very different because tables are both more lightweight and don't provide the same rich set of capabilities out of the box, but they are optimized for storing data, so like columns are stored in a referenced way, so if you want to remove or add a column there is no cost on changing the data, you can append columns without any cost, and you can append rows without any cost of copying the data, because the data is stored in chunked array, so if you want to append additional rows to a table you just append more chunks to the chunked array and the existing rows don't have to be reallocated, touched or modified at all, which is much more performant compared to what pandas does that usually storing a big array, big non-py arrays, multi-dimensional non-py arrays for the vagus columns and things like that, so that if I have to extend the data in those arrays it usually involves a copy of all the data into a bigger memory pool. Arrays provide many of the, sorry, tables provide many of the operations you would expect to be able to perform on them, like for example you can obviously take any row that you want, you can access the columns that you want, you can add or remove columns, and in practice rows, in practice tables have just a set of pyr arrays paired with a schema, so if you look at the table.schema you will have names for the columns, obviously, and the type for each column, knowing the type beforehand is what allows arrow to store those data in a perfectly optimized format for that specific type, so for example if I use an array of integer or an array of bytes, they are represented in memory in different ways so that it's faster to perform a computation operation on any of them, and you can freely easily create pyro tables from python objects or pandas objects or whatever it's convenient for you, just by passing them to the table factory, like in this case I created them to a list of arrays paired with names but I could even use a dictionary where the key is the name of the column and the values are the arrays, and through the SRO compute engine we could have access to most common transformational analytics functions like joining filtering or aggregating the data in tables, so for example here you can see some very simple analytics capabilities that I'm getting from SRO and applying to tables. As usually the pyro compute module is what will get you access to the SRO compute engine and I might have a table where I want to filter the data so look up for all the values that are equal for or I could come up with a filter of any complexity I want, I could perform an aggregation on top of the data so get me the sum of all the values grouped by the keys or I can pick two tables and join them by using a left join and see and get back a new table with the data of all the tables that I join. The interesting feature of pyro I think that is is that if you are already using pandas or non-py in your production environment you don't have to start by replacing everything with pyro, the pyro library provides a zero copy no marshalling or marshalling cost capability of transforming data from the supported sources into the pyro format and vice versa. So for example if you look at the if you look at the schema that I placed on the slide you will see that one of the supported formats is pandas and that means that I can use the two pandas feature of arrays and tables to get back a panda subject from them. So in practice the way that aro is trying to go is that instead of having copy and convert operations for each one of the formats that are out there for which one of the frameworks and libraries that are available in data science it provides support for native representation of pyro in a pyro format in all those libraries and allows you to get data in and out of all those libraries frameworks and storages without ever occurring any cost of transformation or conversion or marshalling the data or something like that. It's not always possible so there might be cases where you have to occur in a conversion cost but aro will be explicit when that happens. So by default it will try to do a zero copy conversion and if the conversion requires a copy it will give you an error and it will ask you to explicitly set a value to say yes you can copy the data so that it doesn't happen by chance and you suddenly discover six months later in production that you had a performance button like you weren't even aware of. And it's fast, it's very fast compared to what you might be used to to leverage every day. For example here you can find a simple case where I create an array of numbers from 0 to 5 and I then store that array that Python list actually into a non-py array and into a pyro array so same as that data created from the same exact original Python object. In one case I'm going to count all the single values all the unique values in the array and get back which are the values and which are which one is the frequency for these values using non-py and that took one and a half second and in the second case I'm doing the same exact operation so give me back the values and the frequency of the values but using pyro and in that case it only took 0.37 seconds so you can see that the order of magnitude is important and the performance improvement by using pyro can be significant and it can be significant not only if you use pyro directly but even if you are paving it with something else like pandas. So for example if I create a data frame so a pandas data frame from a big python object it might take in this example 82 milliseconds but if I do the same by creating a pyro table and then converting that table to a data frame it's actually faster so even if pyro had to perform two operations in this case it ended up being faster than pandas itself so for example one use case for pyro might be to read the data suppose you are facing a format that pandas does not yet support and aro does you might use pyro to read the data into a pyro table and then convert that table to pandas it would be a perfectly effective and well-performing solution to your problem that allows you to slowly introduce pyro in your code base without having to replace everything that you are using at the moment. That's possible thanks to the fact that you can always convert back and forth with pandas and numpy without occurring performance cost in doing this conversion. So for example suppose I want to read the csv file a fairly big one and I use the you see that pandas has an engine option and one possibility is to use the default one so the one that pandas provides internally and in that case reading that csv file took 13 seconds so it's a fairly big amount of time something you will see you will have to wait for it but if you specify to pandas to use pyro as the engine to read the data it will actually only take two dot seven seconds so it would be far faster and you'll see that this is an example what aro is trying to do so make it and something that is natively supported in frameworks out there so for example pandas is able to leverage pyro for storing strings because it's better it's more effective and faster than the native format that pandas uses it's able to leverage pyro to load data because it's faster than the standard implementation that pandas provides and we have supporting geopandas if you need to use geographic data and all those kind of things the the ecosystem of frameworks libraries and tools that support aro is constantly growing every day and the pyro project has proved to be a successful one if you look at the history of the lots over the course of time it's becoming a clear player in the data science and data analytics world actually some of the libraries or tools that you use every day might be using aro inside and you might not even know because they just rely on it to perform whatever operations they need okay going further let me introduce you one very cool feature that pyro provides we all know that everything is fine great and simple and goes straight forward as far as you are working with scsv file on your laptop but what happens when you need to work with real world data which is big it's scattered across multiple files it's on a network file system somewhere out in the cloud or those kind of things usually there things get much more complex because you suddenly have to work with something that orchestrates your data you suddenly have to work with a different set of apis because you cannot really create a plain non-py array out of a file on s3 or something like that you will have to manually do all the operations that are necessary to to make it happen while pyro does that step forward that additional step and and gives you the data set api the data set api is an abstraction that works on top of all the other abstractions so in practice on top of tables in this case and represents you as a single big table well it doesn't really use a table because it uses a data set class but you can imagine it as a single huge table made of all the data that you had scattered across all the files on all the file system where you are storing them and obviously it provides lazy access to that data so you don't have to load 40 gigabytes of data in memory before you can start working with it and it allows and it provides native support for the sfo compute engine so most of the operation you were able to run on a table you will be able to run on a data set so for example you can join data sets if you're looking for even more big data and you can use filter the data sets you can project the data sets you can in practice do any operation you would expect to be able on top of a standard table of data so for example this is a schema that should allow you to get to easily understand what the data set API is about in practice you might have that your data stored in multiple in different file formats it might be in parquet it might be in csv it might be in json it might be on orc it might be on fader it might be on any format that arrow supports and it will be saved on your local file system or on azure on s3 or hdfs or any file system that arrow supports and that will then matter you just create a data set you point it to the location for your data and your data will be available to you you don't even have to care too much about how to load it which files are available and how they are partitioned or any information like that and then you will be able to use your data set to run operation to the query engine or the data from like api that data set provides for example here i'm getting the how to say the pickup and dropouts of taxi trips in new york city from the s3 file system and that data is partitioned by here and month because it's huge data you can imagine that during a single day there are tons of taxi trips on new york city so even storing a single day of data it's already a pretty big csv file and by partitioning them into young and month i can store many smaller files on the file system and all i have to do is just take my data set and point it to the location where the data is available in this case the s3 bucket and tell it how the data is partitioned so tell it that the starting partition key is the year and the second partition key is the month and then that's it that's all i needed to to have my data available for me to work with so for example i ask the count of all the entries that are available there i get back a count that includes the data from all the files not just like the first one of them all the entries stored in all the files that i partitioned and in this case i had the data partitioned across more than a hundred different files and i could pick to the data by looking at the first five entries of it and so on so it's not much different from a plain normal table with the difference that you are working with hundreds of files stored on a remote file system so we saw how aero provides capabilities for performing analytics by using arrays by using tables and by using data sets and how the sero compute engine allows you to run operation on those entries but there is much more in the aero ecosystem you you have many tools really that support aero natively that will allow you to create very complex web application with minimal effort like for example we have harrow flight which is a protocol for exchanging data on network distributed systems so the if the client and server bot talk fly they could pass the data around without occurring in any martianing and non-martianing cost which will make your data transfer very fast compared to what it's usually the standard time you can use the aero to read many different formats new formats are added and supported by aero every month so it's a constantly growing library of readers and writers for the most commonly used formats in data science you can leverage adbc which is a new piece that was added to the aero ecosystem that allows you to divert equally database database management systems without occurring any martianing and non-martianing cost or even without occurring the cost of advancing car source when you read the data so it's a much more optimized protocol compared to adbc and it in practice if you embrace aero you have access to a whole ecosystem of library tools frameworks and even aero that speaks aero this is a quick example those are the most common things that came to my mind that currently not really support aero you can get aero data out of parquet files out of casano database out of hbase out of spark spark is able to use aero to exchange data across the nodes and to use aero to store the data on your on your storage you can use in pala you can use duck db to query the data by relying on aero or you can use ibis which we had we had a talk on ibis i think yesterday morning it's a great project you should look at it if you don't already know it and ibis natively supports aero and the aero hs aero computer engine so that's it i tried to give you a overall idea of what aero is about and what it can do for you especially the py aero library which is the python access to the aero world but if you want to know more you can go and have a look at the documentation especially the getting started section which will give you a quick introduction to py aero and you can look at the py aero cookbook which is a fairly complete set of examples that you can just copy and paste in your code to do the operations that you are looking for if you have any questions i think we have 10 more minutes available for them and i'm here well that was a great talk and if anyone has questions please come up to the mic here hi alessandro great talk and i'm a contributor to pullers which is an aero based data frame library and one of the really common questions we get is historically you've had the same in memory representation for both your data frame and scikit learn your basic machine learning library and now you've got aero for your data frame library do you expect that machine learning libraries will also move to aero or do you think that having just something like an empire a will be the best going forward for those libraries that's an interesting question there is some some work on supporting aero in machine learning libraries at the moment for example i know that we that there is around the py py torch aero that it's doing it's working on that i think it's a it's there is not a single answer it's not something like a project that it's moved by a single entity it's a joint effort across many different open source maintainers and companies and things like that so it's really more a matter of getting together into the mailing list or anything like that make a proposal and discuss it with with everyone else so it's very out for me to predict the future i will say it makes sense if we are able to speak the same language but there might be reasons i'm not aware of because i'm not a machine learning expert that makes harder for for machine learning libraries to leverage aero and in that case probably what i would like to see is for people to raise those problems so that we can extend the aero format for example recently it came to our attention that aero is missing some data types that are available in in velox which is a open source project built by meta and what we are probably going to do is just extend the standard the aero standard the aero specification by adding support for those types so that people and meta can start leveraging aero for for velox yeah briefly the best kind of response i've heard is that your machine learning libraries are really bad doing linear algebra and that might that's a different kind of use case in data frames you might just see this kind of two two kind of representations needed to have the best for both worlds yeah thanks kind of a similar related question what are the kind of roadblocks for teams that you've seen moving over to pyro if they're you know using spark nowadays or if they're using pandas are there anything that people are kind of commonly coming across that's an interesting question i think that first of all you there is you are not forced to to move to pyro in the sense that many many libraries and frameworks are moving to pyro themselves too so that you don't have to do that work but if you want to start leveraging pyro directly and benefit from its performance and capabilities i think i would personally start with some specific functions or features where you see that pyro is immediately providing values so in those cases you might be able to just convert the data from from pyro into your system and might be able to do that without occurring the conversion cost and so your code freely leverage both both libraries and both systems maybe you have pandas as your overall library but there are some computations that you implement in pyro because they are faster and that's something perfectly doable and i think it's a wise step forward in the direction of slowly rolling out pyro into an existing project good yeah sounds exciting look forward to later thanks thank you yeah thanks for the talk i love pyro and i use it with py spark and then they have this pandas udf that uh yeah interoperates but is it a missed opportunity to expose the directly pandas because in essence it's pyro backhand that they use actually another question is whether in that sense pyro has the goal to also have this i mean you have this compute engine do you have this idea of like pandas have like a layer on top of your data representation like apply uh user defined function that kind of abstraction that we now exploit in those pandas udf's to protect our pyro functionality on the rows let's say that's that's actually a great question it's something that i forgot to mention in the in the presentation uh our actual yes support for user defined functions and they are not i will say they are still a lot of uh a long way to go in terms of uh being what we are looking for but at the moment if you are concerned about using them as you would use pandas so implement a python function and invoke it from the sr compute engine that's something that is already doable it's uh it's not as immediately straightforward as you would with pandas where you just invoke apply and pass the function in the case of pyro you have actually to register the function with a name with a signature and then the callable that actually would implement the function but it's i mean it's still one line of code just three arguments instead of one yeah but it might change the way you represent those figures because now you really show it as a sort of interoperability library while it might actually be a unified computer right that's where i was uh going to in the sense that the reason why i said that i'm not yet at the state where we would like them to be is that there is a lot of work that we still have to do to make possible to uh ship user defined functions around so for example suppose that you on your machine implemented the user defined functions but you are going to run the query on a cluster on an arsenal cluster out there the arsenal cluster will need to have a way to get your function implementation into the cluster and run and that's something that at the moment we are not yet ready for but it's something where which we are working for and specifically we are working close contact with the substrate specification and project substrate if you don't know about it it's in practice something similar to what arrow does for data representation substrate tries to do for query representation so something like a standardization of sequence that it's not human readable it's byte based so it's something for exchanging queries across computation nodes but it's it's tries to become the standard the factor for exchanging queries between client servers servers and servers and things like that and we are trying to work on having a way to exchange user defined functions as part of the query that invokes them thanks it seems like it's really cool but my question will be when should I use pyro and still use pandas my answer would probably be when you face a bottleneck like the main reason where people approach to pyro is because they need they find that pandas is might be too slow for what they are trying to do and and maybe optimizing pandas to become faster will require an effort that they don't want to invest because it might be days weeks or even more of maybe we're writing that computation directly in siton or even c and by using pyro you might be able to just replace one or two or I don't know three four five or whatever lines of code with the equivalency pyro and get the same exact result five or ten times faster and that that might be a reason why you slowly start introducing pyro in an existing project okay thank you thanks great talk so you've said lots of nice things about pyro are there any like you know blockers or reasons why I don't want to use pyro am I forced to answer no just joking yes obviously like with any library or framework out there it's not just happy things and so on for example one of the things where we are investing a lot of time recently it's improving pyro documentation because it's a project that grows very quickly with a lot of developers throwing features and capabilities into it and the documentation has been a bit behind from the point of view of being easy to consume and read so for example the reference is very complete and for many functions you have examples and things like that but if you are trying to start using a new feature that you don't know there might be a lack of tutorials or introductions to that feature and I think that that's great that's something that I really value is when users open issues about those problems because obviously for us as developers and constant users of pyro everything is obviously because we we brought it but for external users it might be not as simple and they could help us understand where there is space for improvements all right let's give a warm round of applause to Alexander