 So, I would like to talk you about two technologies that I think have the potential to shape the future of data engineering, and those two technologies are Apache Arrow and Substrate. Apache Arrow, for those of you that don't know what it is, it was born as an in-memory format to represent data in a way that it's understood by all libraries and languages. So it was a form of polyglot and diagnostic format to share data between different software. During the course of years, it evolved from just a format to be a library that allows to work with that format. So now it became a library that supports data interchange, a library that supports modifying the in-memory data, supports transmitting that format over the network, saving that format to files or loading back from files, any kind of IO, it provides a built-in compute engine that it's able to work on top of that format, and all kind of capabilities. So at the beginning, I would like to introduce you a little bit of Apache Arrow, because it's a way that we can start collaborating across very different languages. In the past, when you wanted to share some data between different libraries or environments like you wanted to share your Python data frames, Pandas data frames with Spark or something like that, those software had to go through a process of copying the data, converting it in a format that they understood, and then they allowed you to work over that data. With Apache Arrow, what Apache Arrow is trying to do is instead allow all those software to share the same data without having any cost of copying or converting the data, because they all understand the Apache Arrow format. So the beginning of what is the foundation of the Apache Arrow data, it's obviously our race. I hope no art developers got offended by the previous slide. So the foundation of the data that we can represent in Arrow, it's obviously our race, because most of the times for data engineering and data science, you care about columns of data. You want to run an operation on a set of multiple rows. So that is represented in our case as Python users as a PyArrow array. At this layer, the PyArrow is not too much different from what you might be already used with NumPy. So we can perform standard kind of operations like creating an array, that array can contain any kind of data like numbers, strings, or structures. And you might be asking, so why would I ever use PyArrow, given that many libraries already understood the NumPy format? Well, because NumPy was born without thinking too much about reliability of that data at the time. So the format is not very optimized for transmission of our networks or keeping the data fast when you want to work over it in memory. And for example, one typical case where NumPy does not perform super well is when you want to work with strings, or when you want to work with structured objects, or when you want to work with missing data. In all those cases, NumPy does not provide a built-in implementation that goes into optimize workflow. It provides a, it relies on Python mostly. So if you store a string, it will be a Python object. And that means that when we want to work over the data in the array, we will be slower because instead of having a continuous array of memory, we will have just pointers to Python objects. And so we cannot really implement any super performant algorithm that works to them in a vectorized way or something like that. While instead for PyArrow, or better, for the PyArrow format, everything is starting continuous arrays of memory. PyArrow tries to give that as a general guarantee for any kind of data. That includes strings. So for example, down you can see the way strings are saved. That means that if I want to look for a string in the array, I just can perform a linear scan over the data. If I want to perform a transformation, I can apply a vectorized function to all the strings or those kind of things. And also, differently from NumPy, it provides built-in support for missing data. So if you have a row of your data that is not available, you can flag it as missing data through a bitmap that RO keeps. That allows you to both flag as missing data, which you say, well, I could just store zero or something like that to signal a missing value. Yes, you could, but that's what happens if you add a zero as a valid data, or what happens if you want to know why that data is missing? For example, a very common case when working with surveys is that you have missing data, but there are multiple reasons why it might be missing. It might be missing because the person did not answer the question, or it might be missing because the person answered, I don't know, or it might be missing because the person said answered, I don't want to answer it. So by having both the values array and the validity bits, we can store, yes, it's missing, and as a value is missing because the person did not answer or something like that. Then, obviously, you say, well, but I don't usually want to work with just one column. I usually want to work with more complex set of data and understand the relations between the different columns or something like that. Great. So that's why I also provide support for column of data, mostly in the form of tables, but also in the form of record batch or multiple ways you can manage table of data depending on what you need to do. Usually when working as a user, you will end up using tables because those are the most similar to what you can do in something like a Pandas data frame or other libraries that you might be used to. The foundation of a table is not the array itself, it's actually a chunked array. That's because you might be appending more data, extending the columns or performing those kinds of operations, and the effect that we can use a chunked array means that those operations are zero cost because we don't have to reallocate the whole array. We just append the chunk at the end of the array. Here is a simple example of a table. In general you just create a table from a set of columns assigning a schema to the table and the schema is no more than names and usually types, but the types are already known by the array itself. On top of the tables, you could be wondering, great, I have arrays, I have tables, like what's the matter? Why should I use them? Well, first of all, because as we said, if you represent your data in RO arrays or RO tables or RO record batches, you will be able to share the data across different libraries without occurring any cost of conversion. And then obviously in case you don't want to share the data with something else but you want to run some very simple kind of exploration of very new data, RO implements a built-in compute engine based on the RO format itself. So at the moment we have seen how RO allows us to store data on disk. Usually people rely on the Parquet reader, which is very optimized for loading data from Parquet into the RO format itself. But if you want for various reasons, there is the native RO disk format itself that would allow you to do things like memory mapping your arrays to disk because on this format is equivalent to the memory format. So there is no cost of loading or conversion for that reason is usually very fast. But it will be bigger. Parquet implements a whole set of compression algorithm, dictionary encoding, another form of more effective and efficient storage. But by view to the fact that you can memory map the data on disk, the RO native format will allow you to easily work with data that is bigger than your memory because you just allow the camera to swap in and swap out from the memory mapped file. And we have seen that once we load data, we have it in the RO format, which allows us to share that data with many different libraries. And you might be wondering, why should I even share the data with different libraries? Well, one very simple use case that can give you an idea of why this might be helpful is if you have a pandas file, now pandas is based on Apache RO for those who don't know since version 2.0, it uses Apache RO for various use cases. But before version 2.0, if you wanted to load a parquet file on pandas, it was faster to load it in RO and then convert it to a pandas data frame. It was an order of magnitude faster, so not just a small improvement. And that's because the parquet model was more optimized for loading data in the RO format. And then there was no cost of converting that data from RO to pandas because pandas understood the RO format natively. Well, in reality, there were some costs of conversion, like if you had strings, as we have seen pandas relies on NumPy and NumPy does not really have a great support for strings, or if you had some kind of columns, they would require conversion. But for most basic types, like numerical types or things like that, they would occur no cost of conversion at all. And then we have seen that we have a compute engine in RO. So if I want to perform transformations or analysis on top of the RO data itself, yes, I can ship the data to pandas and perform all my transformation in pandas, or I could just run the same transformation in RO itself using the Acero compute engine which ships together with RO. Generally, the Acero compute engine is faster than pandas for most use cases because it's implemented natively in C++ with vectorized operations. Why most of compute functions in pandas are optimized because they are implemented in Cyton, but are generally not as fast as the Acero operations. And then we have seen that RO is not just a format for disk memory, but also for networking. So if you want to ship the data across the network, you can rely on ROFlight, which is a format to share data across networks. Here is a brief example of what you can do if you use the Acero built-in engine to perform operations on the table. Generally, you have the basic operation, like group by aggregate or filtering or joins, those kind of things. Pandas has a much more richer set of operations. And as I told you, it's generally very fast. So in this case, for example, we can see how finding the unique values in a NumPy array, this is a very common kind of operation. If you want to know how many people, how many occurrences of a value I have in an array, which might mean how many people answer the specific question of my survey, going back to the example of surveys, then you want to run a unique count. And doing that in NumPy will take one and a half seconds. Doing the same exact operation over the same exact data on RO will take 0.4 seconds, let's say. So you can see that it's usually three, four times faster than running the same operations on NumPy. And I told you a little bit about Flight. And Flight is a protocol that we can use to ship data. So you can easily implement a Flight 7, Flight High, and by relying on the classes that are built into PyR itself. And Flight is mostly based on GRPC. So it's a binary-only protocol where it uses GRPC for the metadata and it uses the RO format for the data itself. The reason why Flight is important is because we, as most of our RO, is optimized for performance. For shipping data across different libraries with very little overhead. And when I talk about very little overhead, I mean for real. Like in the case of shipping the data using the ODBC driver for something like five million records, or one billion records, you can see that the difference is huge. Like in the case of five million records, if I use ODBC, it will take something like 60 seconds. If I do the same exact thing using RO Flight, it takes like, I don't know, five seconds. I don't remember the exact value, but that's the order. And if we start talking of bigger and bigger data, that difference increases. The reason is because we have no cost of conversion ever. Like the data in the network layer goes without any conversion cost. The data out of the network layer into the memory of the receiving end goes without any conversion cost. And the data is generally stored in a very effective and optimized form. So we have seen that we borrow, we can load the data from this very fast. We can manipulate it very fast and we can ship it across the network very fast. So we have in practice all the building blocks we might need to work with the data. So when working with data, there are two things that we have to do. One is obviously having a way to represent and manipulate the data itself. And the other is being able to express the manipulations or the queries that we want to run on the data. So we know that all libraries, many libraries can support the RO data. So they can understand any data you give to them in the RO format. And they can give you back data in the RO format so you can then forward it to something else. But that would be incomplete if we were only able to ship the data without being able to tell the library what to do with it. So that's the reason why, above the wall RO stack itself, we also have the support for substrate. And substrate is a language that allows you to represent the query plans itself. So as much as RO is a lingua franca for sharing data across different software, substrate is a lingua franca to share queries across different data with different software. So it's a way to implement API-independent and language-agnostic form of queries. Imagine something like SQL, but where there are no dialects. Everyone understands the same exact queries. And imagine something like SQL, but it's binary. So optimized for computation and optimized for software instead of humans. And imagine something like SQL that is decomposable. So I can take one query and split it into sub pieces. And why that is important is because imagine a word where you send a query to a central computer engine and that computer engine splits the query into smaller pieces and sends, for example, the filtering part to the data on the other side. For example, some people are thinking of implementing support for basic substrate filters, just directly on the disk firmware. That means that when I need to load data from a file, I will just take the filter part of the substrate query and forward it to the disk firmware. And the disk firmware will already give me back the pieces of the file that only contain the rows matching that filter. Then I can forward it the rows to the computer engine, which will run the rest of the query. Or, for example, if there is a query that performs a machine learning operation and I want to use Spark ML to perform that specific part, that specific function, I can ship the data with arrow and the piece of the query with substrate to Spark and they will process that part of the query over my data. And all this is for free because we don't have to convert the data or the query in any other format because they understand substrate queries and arrow data natively. Well, I made you an example of Spark because it was easy, but the support for a substrate in Spark is something they're working on right now. There is an Intel-based project named Gluton, which is a fork of Spark that supports substrate, but the work of implementing substrate is now happening in the core repository of Spark itself. So in the future, we will be able to just use Spark with substrate. And when I have those queries, so when I have substrate and I want to ship those queries to different software, how can I do that? Well, that's the case where FlySQL comes into play. So FlySQL is a version of Fly, which is optimized for shipping queries and shipping the results of those queries back from the computer engine. So for example, you could use any driver that is able to speak Fly query and use it to ship the query in substrate format or SQL if you want. FlySQL obviously, as you can guess by the name, also supports SQL. And once you ship that query, the endpoint, the receiver will answer you back with arrow data. That means that we have both the benefits of substrate and the benefits of arrow in terms of performance and compatibility of the data. And to make it easier to work with FlySQL, because FlySQL is in practice at very low level. It's at network level. So yes, you could use it. It's not super hard, but it's also not very convenient. That's the reason why ADBC was born, which is the arrow database driver. And ADBC is in practice able to connect with any data race that supports arrow data, and especially if they support FlySQL. For at the moment, you could connect with ADBC to Postgres, to DuckDB, SQL Lite, and many more. And the reason why this is important is because if you go through ADBC, you benefit of all the performance improvements that you get when connecting through an arrow native format. So in practice, you gain the same exact performance that you did get when using FlySQL, as we have seen. Well, it might not be on par, because there might be cases where the Postgres protocol might be less effective, less performing than Flight itself. But for general cases, I think we can say that you will get as much performance improvement as you could expect from Flight itself. So the main difference if you are wondering between FlySQL and ADBC is that both are arrow native protocols. ADBC can use FlySQL if you want. But like ADBC is general, as a user, you don't know and you don't care how you are connecting to the database, like ADBC hides all that abstracts, all that for you, while FlySQL has to be implemented specifically for that database. So if you use FlySQL, you need to speak the exact dialect of that database. And not just in the form of the queries that you send, because the query dialect is something you will have to add there to also when using ADBC, but also in terms of how the database ships that the metadata, how the database ships back the pointer to the query result. And all that is not abstracted for you by FlySQL. As I told you, it's a fairly database-specific driver, so it's a lower level than ADBC. But if you use ADBC, you in practice get the benefits you use to get from ADBC or JDBC, but with the performance of an arrow native system. So to recap what we talked about so far, we can say that we have a way to connect to send queries to our compute engine. And the most common ways, as a Python user, you probably want to use our PyRow itself, which is tightly coupled with the acero from the engine. So if I send a query to a run over my data, PyRow only supports acero, so it will run that query on acero. And the benefit is just that you don't go through any query conversion costs, you don't go through any cost of shipping the query and the data back and forth. PyRow will directly share the, let's say, memory pointer to the data with acero, acero will perform your query, and share with PyRow the result. That means that for some cases it might be effective, but that means also that it won't be easy to scale it because it's a local engine that runs on your own system. It's not something that you can deploy remotely and deploy like maybe 20 nodes or something like that. The reason why I'm mentioning it is because frequently it's a very good way to work when exploring data. You have the same exact data that is in our format somehow, maybe because you loaded it from disk, maybe because you got it back from another application, and you can directly use PyRow on your machine to view the data, perform basic manipulation, filtering it, aggregating it, trying to understand what's the shape of that data. But when working with more production-ready environment, you probably want to use something like Ibis. For the people that don't know Ibis, Ibis is a front-end to perform common analytics on top of different compute engines or backhands. So for example, if I have data in Postgres, I can use Ibis to connect to Postgres, design my query, write it in Ibis, send the query to Postgres, and get back the data. Ibis supports natively getting back data in the RR format or in the Pandas format. So if you are used to Pandas, you can just get back your answers as a Pandas data frame. You always add Pandas data frame. And Ibis obviously supports also Pandas as a backend. So from the data you get back, you can perform further transformations using Ibis. That is convenient because it allows you to write code base that is diagnostic to your backend. Suppose that one day you are executing your queries on Postgres, and after one year that you are running on Postgres, you find there are limitations and you cannot further improve your application. And then you can shift it to something like DuckDB, and you won't have to rewrite any line of code because both of them support for Ibis. And how does Ibis communicate with those backhands? It's by using SQL or substrate for the backhands that understand one of them. So it means that the queries that you will write in Ibis will be compiled to substrate or SQL and shipped to the backend. How do they get shipped? Well, we are in the RR ecosystem, so they can get shipped using ADBC, they can get shipped using FlySQL, or in case of backhands that don't support any of the RR native formats, they can be shipped with a dedicated custom backend that usually Ibis provides. Inside Ibis there are many backhands, I think more than 10 of them or something like that. Once you have your queries in RR format and in substrate format, you can send them to any compute engine that is compatible with those too. So, for example, using Ibis, I could ship the query to a server itself if I want, or I can ship that query to DuckDB or to Metabellux or to QDF or to DataFusion. That means that I am able to choose depending on what is the solution that works best in my use case. Once those compute engines receive the query, they will be able to execute it on any data that is supported by RR itself. So, for example, you can run the query service in your parquet files or something like that, and you can have your parquet files stored in any place that is supported by the RR file system, so like S3 or things like that. So, we have seen how we have many building blocks that, if you think about that, in practice are all the pieces that you need to implement a database in common cases. So, one little example that I want to show you, it's just a game, it's not something serious that I'm suggesting you should be using in production, but it gives you a very concrete sense of what can be done using RO substrate and all the libraries that are part of the ecosystem. So, what you can see here, it's a small database that I've written in like 50 lines of code. So, for the database, we are going to use parquet as the storage format for our tables. We are going to use substrate as the representation of our queries. We are going to use SQL as the language that as a user you can use to define your queries. And we are going to use RO tables as the format that we rely on to get back the results of our queries. So, well, the first few lines of code are obviously very simple. They are just related to setting up a way for the end user to pass the query. So, we just accept the query as the first argument of our program, of our app. Then the next step, we want to understand which tables are available. So, on which files we can run the query and what is their schema. So, we just scan for every parquet file and we read the schema. Why we want to do that? Before we have an actual query to run, well, that's because to be able to compile the query to something that the computer engine understands, you will usually need the schema because the computer engine needs to know if the rows are an integer or a float. So, in case you run a minus then filter in your query, they know if they should use the implementation of minus then for floats or integrals. Because obviously, as you can imagine, for integrals, implementing something like a minus then it's very simple because they are precise, implementing it for floats requires a bit more work because the value you are passing might not compile to the same exact value that the computer engine has in memory, even though you've wrote the same number. There might be an approximation occurring when passing from the string to the actual in-memory format. And that might lead that the data you have in your parquet file might be slightly different than the data that you have float in your query itself due to the approximation that is native in floats. Once we have our query as the argument and our tables, what we can do is to parse the SQL and we can do that using IBIS. As I told you, IBIS is a tool to represent queries and by virtue of it being dedicated to queries themselves to representing queries, it also provides a parser for SQL because in some cases you might not be willing to write a query in the IBIS format itself. If you ever use SQL Alchemy, the IBIS query language is very similar but maybe you have an old query into a legacy system and you don't want to convert it to the IBIS format as the first step. You first want to set up everything and when you see that the system is working you can start slowly converting your queries to IBIS and that's the reason why IBIS has native support for SQL. So I can directly run a SQL query by passing the string to IBIS and for that reason IBIS also supports a compiler from SQL to substrate. So once we have parsed the query to IBIS we can tell IBIS substrate to compile the query that we parsed to substrate itself and once we have the substrate format for the query we can run it on the Acero Compute Engine which we get for free in PyR itself. So the only two things that we need to provide to Acero are the query itself and the table provider. The table provider is just a lookup function that will allow Acero to know if there is the name of a table in the query it will allow Acero to find the table, so the actual PyR table that matches that name. So it's just a lookup map between names to PyR tables. And usually in this case you can see that the way I do that work is very straightforward. I just look for a parquet file with the same name of the table. So in your query you will write the name of the parquet file instead of a table name and I read that table. So here you can see that I'm actually reading the data only for the tables that you mentioned in your query. So I'm not going to read everything in memory and then the query is on top of it. We only read in memory the single tables that we need. Once we have sent our query to Acero we can get back the answer in our format. That means that the answer will be another table with only the columns and rows that we specified in our SQL query. And in this case I also convert the table to pandas for the just reason of printing it in a way that people are more used to. So if we want to see these in action, given a few parquet files that I had, I tried to run it and the first case is a simple query that finds every person living in Canada and asking for the first name, surname, and title and limiting the results to five because I just want to explore how the data is done. And if I run my queries to the little code base that we have seen before it will obviously answer me with that data, with a preview of the data in a table format. But I can also run more complex things. So suppose that I want to do the general for the people that I have in my database, I could write a SQL query that just grouped by count to find the occurrences of all the people, of the general of all the people that are living in Canada and I want to know the counts. That's it. Those are two very simple queries but the purpose was to show you how with just 50 lines of code we were actually able to implement the working SQL database. The next step if you want might be to implement network support. So decoupling the client and the server by using fly SQL for example. And at the point you will see how you have a wall stack based on Arrow that never involves any cost of conversion of the data. And if I want I can easily replace those three lines of code run query with something different like meta velox for the people who don't know velox is a very optimized compute engine that meta brought for their own analytics and has been released as open source and it's going to be the foundation of Presto which is the database system for meta in the future. So that's all. I hope I gave you a good overview of everything you can find in the Arrow ecosystem. There is obviously much more than just the things that I've shown you and I hope you will go out and experiment with it because more and more libraries are transitioning to the Arrow format and to substrate so that they can share the data with each other and for example you can run XGBoost on top of your Arrow data. You can run many different libraries or you can use GOR to represent geographic data. The ecosystem is growing so this is the right time to start it. Questions? Amazing. Thank you so much. Wonderful. So do we have any questions from the audience? Thank you for the talk. I know some people who are running analytics on Amazon Athena which lets you store files in various formats on S3 and then run SQL queries or what have you. Do you know if because the formats supported include ORC, OBROP, Parquet which seem to be Apache projects. Do you know if there is any work to add the Arrow format or is maybe the Parquet format close enough that you could that it would be simple for Amazon to add or alternatively is there a path for experimenting with using Arrow or the IBIS stack to try running analytics using this instead of Athena? Well, there are solutions that are going to do the same things you would do with Athena on other systems. For example, DuckDB is a good example. It's a database design for running analytics on top of Parquet files. And Arrow library BLC provides support for many formats. As you mentioned, we have Parquet, we have ORC and we are growing support for things like Ice Bear. So we are trying to cover all the use cases that people might have their format into. Obviously, it's a different word. It means moving from a cloud system to something that you might have to manage yourself. But I think there are cases where companies are starting to provide cloud-based solutions. Like I think that the DuckDB people recently founded a company which is Mother Duck and I think that in their plans is to provide cloud environments for DuckDB. So yes, in theory, if you have your files on S3 and they are in Parquet format, you don't even need to touch your Athena deploy. You can just take a software that supports Arrow and they will connect to S3 and run the queries on top of your Parquet file. So you can start experimenting with an Arrow solution on parallel to your current Athena deploy. Okay, thank you. Wonderful. Thank you. So that's all the time we had for Q&A. I'm so sorry. And you can find the listener around on the cord. Just grab him and ask your other questions. Also, we had a request for your slides because it was really good to be uploaded on this cord. So there we go. Thank you so much again, everyone. Thank you.