 The Carnegie Mellon Vaccination Database Talks are made possible by Autotune. Learn how to automatically optimize your MySeq call and post-grace configurations at autotune.com. And by the Steven Moy Foundation for Keeping It Real, find out how best to keep it real at stevenmoyfoundation.org. Today we're happy to have Wes McKinney, he's the co-founder of Voltron. So the rather than me talk about Wes' background, we decided to reach out to an old friend of Wes and have him send a video. So fortunately this person can't be here today because of other obligations, but they sent a quick video that's going to describe Wes' background for everyone. What's going on with it? From Carnegie, co-founder of Voltron Data, creator of Apogee, arrow creator of Python Pandas Underground Admit. That was N64, GoldenEye, Speedram World, Rekker Holding in the late 90s. Now that's great s***. That's great s*** on Earth. I'm happy for you. Enjoy your day, stern up and many more if I come on up and cheers this cold old evening. Right, so with that, Wes, go ahead and share your screen and by all means get started. And for anybody in the audience, if you have questions for Wes as he gives the talk, please feel free to unmute yourself and say who you are and ask your question anytime. All right, Wes, thank you for being here. Awesome. Thanks. Thanks, Andy, for having me. I like giving these update talks to this technical audience since it's often led to productive collaborations after the fact. And it is true that I did spend a major portion of my youth doing GoldenEye 007 Speedruns in the late 90s, but yeah, it's a long time ago now. It does come up now and then. Chris was very impressed by that. You're a good friend. Yes, yes. Awesome. Yeah. Okay. So we're coming right up on six years of Apache Arrow Development. I assume that almost everyone on the call already knows what this project is, but I'll give a quick recap since many people watching the video online. Maybe won't already be, won't already be Arrow experts. I was at Cloudera when I helped create the project. This was in 2015 and it was clear at that time with the, there was the, you know, there was a confluence of different factors leading to the need to develop a standardized in-memory data representation to tie together systems. So I was coming at the problem from the Python data science world and I had a lot of frustration with like getting access to data so that I can work with it in pandas or use it in Python. And so I landed at Cloudera in 2014 with the mission of like providing a Python programming API for things like Impala and Spark and other big data systems and found myself in this like kind of quagmire of like how do we even connect these systems together? Like how do we get data out? How do we run user-defined functions without paying a huge penalty? And so at that time, and prior to that, there was this feeling of like, oh, well, Python's not a serious language. And so, you know, people are willing to pay a lot of overhead for the convenience of using Python. And so if you look at compare the Hadoop streaming API, you know, programming MapReduce jobs in Java versus like you could write MapReduce jobs in Python. But if you cared about saving compute costs, like why would you ever do that? And nobody really cared because you're like, oh, it's Python, it doesn't matter. So it costs a lot. Like, you know, it's not a real language. But obviously, like as, you know, to paraphrase the dude, new stuff has come to light. And languages like Python have become a lot more important. And I think we'll continue to see a lot of programming language diversity and so making creating an environment where you can connect together programming languages, data processing engines, database systems without with minimal overhead is going to be critical to having a better, more efficient, more efficient future. We have, we have founded a company this year called called Voltron Data. And, you know, we're building aero powered computing systems. So I'll give a few words about about about our company and how it came together at the, at the end of the talk. So, so we've been building this project it started out as a description, a specification for a memory format. So an in memory format, and then we developed a protocol for serializing and transmitting that memory format across processes. And over the last several years, essentially, we've created this expansive developer toolbox that solves all manner of problems like first order and second order problems that arise out of building applications that use the, use the arrow format so we've taken a really component based modular approach to building integrations without formats serialization tools compression memory management file systems like all, you know, any problem that you would bump into when you go to act to build a real system. So rather than, you know, leaving developers to cobble together like random open source, you know, different open source projects to solve problems, we wanted to provide a batteries included development platform to build fast analytic systems. The Arab format was designed for, for processing efficiency wasn't designed as a storage format. And it was intended to address the language interoperability and system connectivity challenges that that we perceived in the 2015 era and thankfully, a lot of people have shown up to say that yeah we agree that this is, you know, a problem worth solving. And it's, it's really great that we built a thriving developer community to build one solution, one robust solution to the problem rather than have like, you know, half dozen, you know, incompatible competing standards to do the same to do the same thing. So kind of the way the way one way I think about the project and how I describe describe it to some people is that, you know, we're, we're trying to do for for data analytics software, what what LLVM did for for compiler infrastructure. So rather than having a completely vertically vertically integrated system we have these modular components that solve that solve different problems up up and down the stack. And so you as a system builder can choose which parts of the, you know, which parts of the system the platform to use like which programming language like which pieces of software. So if you you want to do, you know, embedded query processing and rust like we've got to, you know, we've got a thing for that. If you need to compile array expressions for projections and filters like we've got a compiler that you can take and you can, you know, use the compiled expressions and a system that's written in Java. You know, and Dremio, you know, arranged, you know, to do that because they needed to accelerate, accelerate their their operators and in Dremio and Java using LLVM. So you aren't beholden, you know, to sort of take on all of the, you know, all the things that exist in the project. And there are projects that are using Arrow that are only using the protocol standards, like the, the C the C interface and so I'll talk a little bit about how that works and how we're seeing real world database management systems like like DuckDB adopting the C interface so that we can have systems that depend on each other and connect together while sharing no code at all aside from simple capsule data structure. So we're aiming towards this world where we can increasingly rely on being able to plug together systems and they may not, the systems may not be Arrow native themselves but you have the option of connecting via, you know, via Arrow whether that's like sending a single chunk of data or sending a stream, a stream of data either in process, for example at CFFI boundary, or inter process like through shared memory or sending data through a socket or over over something like gRPC. We've been really busy over the last five almost six years so brief, you know, brief summary of history of the project, we announced the project in the beginning of 2016. We started making releases at the end of 2016 it took us a while to, to have a piece of software that we could, that we could release. And since then, you know, we've gotten to a quarterly, roughly quarterly cadence of major releases, we declared Arrow the memory format and binary protocol for inter process communication stable in the middle of 2020. So that was, you know, more than four years into the process. And then we also moved to a semantic versioning system for the, for the libraries so now the protocol itself has a separate versioning scheme so we're still on the 1.0 protocol version. But the libraries are evolving are evolving, the version numbers are going up more quickly so roughly one major version per per quarter. And so it's a little bit confusing maybe but the 6.0 version tracks the 1.0 protocol version and so you can reason about like protocol stability between versions of the libraries to say like okay well if I if I have the 6.0 version of the C++ library and the 1.0 version of the Java library, you can have the confidence that you have backwards and forward compatibility at the protocol level. And so the libraries could be different versions and it's no problem. It's a little bit you know where where we're at and some of the work that's been that's been going on lately in the project and things that are happening in the that happened recently in the 6.0 release. There's been a significant amount of labor invested in the C++ and Rust projects to provide embeddable query execution components. The C++ engine is a is a complete is a complete query execution system in Rust, including a query parser planner planner and execution engine in C++. We're not building modular relational operators for query processing, but we do not. We do not are not building a SQL front end. And so we've been we've been working with working with with DuckDB around, you know, around that and so it's quite likely that that if this what we have in C++ to the extent that provide a SQL front end that is very likely that will go through DuckDB itself discuss how like technically speaking like how how we will make that connection happening so that we can also preserve the loose coupling that that we desire between these different software components. Of course, you know, building relational operators is a lot of work so we've been we've been busy building things like, you know, things like hash aggregations hash joins and the different operators that you would need to implement. You know, the standard database benchmarks TPCH eventually TPCDS. So we have near complete near complete support for TPCH and C++, whereas in Rust, there's been complete TPCH TPCH coverage in data fusion for quite a while. We don't the the arrow community is not is not a mono mono culture there's not a central governing body that's deciding what is the roadmap for the project so in a sense, the Rust developers are acting, you know, are acting on their own building systems that are top to bottom rust, you know rust native and so, for example, folks from from influx data working on the next generation execution engine Frank influx DB have been contributing contributing heavily to to to data fusion because data fusion forms the core of their next generation IOX query engine so they're moving from go from go to rust. But from like a development standpoint there isn't a great deal of like day to day collaboration between the C++ developers includes a lot of people on my team and the rust developers, although we do work on things like integration testing so that we make sure that systems that are built in either these languages can connect to each other and we and we've, you know, we've verified that we've verified that that the protocols are compatible. Go ahead. Yeah, should I should I think of data fusion is like that's an influx project that doesn't fall under the umbrella of arrow. Data fusions. Yeah, data fusions part of part of Apache arrow. It was started by by Andy Grove, who is now at it he's now at Nvidia. He's part of his day job but he, he was at a working at a different company. He donated data fusion to the arrow project, and then organically influx influx DB decided that they were going to build their next generation query engine and rust, moving from and then rather than build something from scratch. They said, you know, we want it to be based on arrow we want to take advantage of like all of the good things that arrow gives us. And this query engine is a good starting point so they decided to collaborate and build together rather than building like something full stack that's influx DB only. So probably they made they may be made their development process more, more complex by like introducing a like dependency between rust crates. But at the same time, there's like developers from all these other companies working on data fusion so they you know I guess you have to believe that in some cases open source can can make you more productive. But like, so this means there's like an execution kernel written in C++ and at the same time there's a whole another one being written in rust that basically repeats the same features. So how should I do this. Okay. Yeah. Yeah, I mean in every in everything's modular and so you know I think if it would be an awesome outcome if if data fusion becomes like the main analytics tool the thing that powers analytics in rust. And I see every reason like why it, you know why it will be like a that when people are doing analytics like data frame type processing right or SQL processing in any kind of rust application that needs to do embedded query processing. And they want met the memory safety and other like, you know rust developers really want everything to be written in rust. It's kind of like Julia, in a sense. So if, if, if analytics and rust becomes, you know, effectively arrow native from day one. That's not a bad. That's that's not a bad thing. Yeah, one so so one cool thing that we've been working, working closely with the with the ducty be folks on is, is building near. It says zero copy here I would describe it as very nearly zero very nearly zero copy. I know the ducty be folks are on the line so it may be maybe something, you know, anyway, it's neither here nor there, but one of the cool things that you know that we've achieved is is being able to run ducty be a top either memory resident data sets or dynamically generated, like streaming data sets. And so on so on the arrow side we feel like all of these different, like ways to get access to data, whether it could be like make an RPC call to some server which streams the data to you over grpc. And so you could attach that stream of data that you're getting over grpc to to ducty be and run SQL on it. So we have all these different places where data can come from and we can compose with, you know, high performance query engine like ducty be with effectively zero effectively zero overhead. So we've exposed these capabilities in in our, and allowing the, the using the dplyr API so you can have a data set that originates from the arrow libraries to for example, you know you can have a data set that lives in ducty be or, you know, in parquet files or ducty be also knows how to read parquet files but if there was some data that that arrow knows knew how to read but ducty be didn't, you could grab a kind of a streaming handle to that data, and then pipe that ducty be and write your write your query which could be in SQL or it could be using this dplyr API and let and let ducty be do its thing. Similar API as possible in in Python and we connect these connect these libraries together without sharing any code and so I'll explain and the slides how that how that connection takes place so that you can enable this. You know, zero copy data level composability without any code sharing. So very cool stuff. So, so very, very quick words about the, you know about the arrow column or format it's principally designed for for runtime use so you can put it on disk, but it isn't designed to be a storage format. So being said, I, I will express the perhaps controversial view that it's time to, it's time to design a better file format than than parquet, so one that is faster, a lot faster to decode and perhaps is designed with like more affinity with arrow in mind. You know, a lot of people on this call including myself have suffered a lot of blood sweat and tears dealing with, you know, with with with the parquet format, but something that we should we should think about. In the meantime, you know arrows intended to be used as a companion technology to to column or storage formats like parquet and RC. So has support for flat and nested schemas. In the form in a format to facilitate Cindy processing and one one feature that was important to a lot of us when we started the project was making sure that we can accommodate both scan and random access workloads. So rather than, you know, there are certainly data structures out there that are primarily used in a scan, you know scan based context but you know we. So if you need to access a cell in the middle of a large arrow data structure you can do that, you know, in with a constant time guarantee. I see Dominic has his hand up. One question about you mentioned that you deal with a lot of blood and sweat and tears and dealing with parquet. What specifically is the issue with parquet and how can arrow as a on this format solid. Um, the, I mean the, the decoding the decoding of parquet files is is relatively it's relatively complex. So there's there's there's several layers of there's several layers of encoding and the file that are designed to make the file as small as possible. And so this was reflecting the reality on the ground in 2011 when when the format was designed which was that a lot of data processing workloads were bottlenecked on the performance and latency of spinning desk hard drives. And so going through, you know, dictionary compression and then run length encoding of, you know, null and non null data and then additional general purpose compression on top of data pages, you know combined with like, you know, like thrift and like kind of messy, you know, you know, the metadata is like, is rather complex. So it's overall like there there been, it's a little bit of a, it's a little bit of a Frankenstein file format. And yeah, it's, you know, it's effectively the best we've got that's in terms of widely adopted technology. It sparks preferred file format. And so it's, it's definitely not, it's definitely not going anywhere, but it's, you know, it's it's pretty computationally, pretty, you know, pretty computationally expensive to the code. So I don't know like, I know I see harnesses on the line I know like harnesses suffer greatly so if you have any like thoughts or comments about that. I'm assuming also to that you have the same issues with iceberg and then the carbon data ones as well. Like there's, as you said, they're designed from 10 years ago. Well, iceberg is iceberg is built on is built largely on parquet. So it's like a metadata and planning layer on top of parquet uses parquet files itself to to store the metadata so to make like getting doing reading the metadata for a large, like you got millions of files and getting the, getting all the manifest of the data set in iceberg is involves reading a single parquet files my understanding. I think the kinds of format. We'll go to me just like actually Adam, do you want to say your comment quickly. Sorry, I was going to say the modern icebergs the latest ones are file format agnostic now. Question. Okay, so what is your prediction over here regarding some loss of compression, because again parquet format if you have in the storage format, then it's going to eliminate a lot of formal reformatting of the data every time we do IO. So we are willing to compromise on the compression a bit to get that. So what is your prediction is going to be 10% worse in compression was going to be like 100% or 3x. I'm not I'm not sure to be honest. Yeah, I mean but you know considering that that overall bandwidth to you know bandwidth to storage, you know will continue to be continue to get faster and faster. And I think that we will see more and more storage co processor support like if any of you've seen the work that is going on it. Carlos melt suns group working on on Seth and pushing down arrow processing into into Seth. So I think we're going to see kind of a, you know, it's almost like two steps, you know, one step. I don't know what the right metaphor is but you know we worked really hard to like completely decouple processing from storage but you know I think that you know the future will be some, you know some hybrid of the two. So doing some amount of lightweight processing in the storage layer. And, you know, if you have 400 gigabit or 100 gigabit networking, you know actually moving the data to the node that needs to process it is, you know, it is not. Yeah, the, you know, rather than being IO bound like it was in the past, it would be more compute bound. Yeah, so but the point over here is that in the modern system more and more we are really working on the data which is cashed in the cluster. Then 100 gigabit they're all going to 400 gigabit so the IO bottleneck is gone. Right. We have to spend all this CPU time converting the format from one to the other. Right. Compression is important because they're using expensive NVMe's and obtain all that stuff compression is important. Right. Right. Agreed. Agreed. Yeah. So, so I would hope that and this is, you know, something that that as an open source community we should work on together to develop, you know, the a compression scheme that is that delivers good compression ratios on on arrow data that can be decompressed a lot, a lot more quickly than, you know, a lot more quickly than a parquet. And I don't know whether it's like the compression ratio might be, you know, if, if, if on average the files are twice as big as parquet files like perhaps that is this that's an acceptable trade off like I, you know, I don't know that I'm the right judge of like what's an acceptable trade off, but I would be interested to know like what what is what what is an acceptable trade off and every system may have Yeah, yeah, may see things differently. So anyway, something we'd love to talk more about this offline so feel free to, yeah. This is something that we might be looking into. Let's take this offline. Yep. Yep. Alright, so go through this quickly. Skip talking about arrow metadata. So we have data types. Metadata brings semantic meaning to the physical memory layouts. We have a number of built in, you know, built in memory layouts which have been relatively unchanged since the since the beginning of the project. An interesting area. I'm about to start a discussion on the arrow mailing list is incorporating new or augmented column or layouts, taking into consideration, you know, innovations that have occurred and other column or processing types. So for example, DuckDB, Velox and UmbraDB all use a common string view data type with inline small strings, which has, you know, numerous benefits. So I think that's something we should look at adding to to arrow more formally, as well as adding, you know, run length encoding constant arrays. It has some stuff that enables reuse of data in nested in nested types and rearranging data in a nested array without doing any data copying. That can be pretty, you know, pretty beneficial in some in some workloads. And so rather than having, you know, I worry about columnar systems specializing because they have performance needs and we end up with like, you know, more fragmentation at the data level that can be desirable. So, so some things to, if you're interested and be happy to have your feedback. So, so I don't run out of time. So arrow has a binary protocol for inter process communication, it is streaming in nature. So the schema is negotiated up front, along with dictionaries we allow dictionaries to be replaced or appended to midstream and if you have a dictionary that's evolving, you could send a delta after you've already sent a number of column batches of data. And so on the receiver side, when a receiver receives a payload of arrow data, they can construct data structures that just have memory addresses to the to the dot and binary blobs. And so that enables us to do the, you know, much, you know, the heralded, you know, zero copy property. We've, in the last couple of years, we've been developing a kind of an out of the box RPC framework built on top of gRPC intended to make it easier to build arrow native data services where we've dealt with the particular details of taking our binary protocol and wetting it with data streaming push and put streaming puts and gets using in gRPC. So we've done some low level optimizations in gRPC to avoid. So the data is embedded in protocol buffers, but we we've done some, you know, depending on who you ask, you know, you could either view them as hacks or, you know, elegant, you know, elegant intercepting of serialization to avoid unnecessary memory copies but it works, it works fairly well. One thing we're looking at in flight that may be of interest to this group is developing alternative data transport laders. So in particular, looking at things like lib fabric and UCX. So flight performs really well on 10 and 10 and 40 gigabit ethernet. But if you were, you know, if you have 100 gigabit or 400 gigabit, you know, TCP is not, you're not going to be your best, best friend there. So, so we'd like to continue to use gRPC as the control plane, while having an alternative data plane for faster, for faster data movement. So this design for parallel, so parallel data access so you could have, you know, a get request, which the data is charted across multiple nodes so rather than data coming through a single coordinator endpoint. When you make a query to a flight service you get back a listing of endpoints to query. And so depending on the topology of the service. All the data might come from one node or it could come from, you know, an arbitrary collection, collection of nodes. So you might make one get request or 10 get requests, depending on what the, how the services arranged on the other side. So I think that we've been doing in the last year or so with flight is thinking about like okay we have, we have arrow got this nice columnar format, we have all these different systems that support it. What about, you know, slow, slow database protocols or slow database interfaces like ODBC and JDBC. So wouldn't it be nice if we could sidestep those middleware API so having to marshal data into the ODBC API or the JDBC API, and go and so we call this effort flight sequel to define middleware data structures to enable exposing full ODBC like semantics over over the flight interface so the data can come back. You execute a SQL query you can pull the result the result set directly an arrow format and thereby, you know, provide an alternative to to ODBC for example and database system so we're, you know, it's hard to do these chicken and egg problems are hard because you know we have to get now we have to get, you know, see see which database systems or database vendors might be first to implement, you know flight sequel and hopefully we can start. We can start a new trend there. Another pretty new standard for connectivity, which we've been using to great benefit and profit in duck DB is the C data interface. And so this enables two systems to exchange, either a single blob of arrow data, or a stream an iterator of arrow data. Using a set of simple C structs. So you can copy the C structs put them in your, put them in your application, your system. You have to write some code to like generate the to populate the struct, but you aren't required to take on any library dependency. So, so both duck DB and and meta, Facebook's new project bellox, both implement is C API, which enables them to to connect to in process to at C function call sites with pretty low overhead. These are what the structures look like. There's a spec that describes like what needs to go in the format field, you know how or how data types are encoded, and that sort of thing. But you know if the if you look, you can look at the the duck DB implementation and it's quite it's quite compact and tidy. And so I think that will provide a good, a good reference point for others to to implement implement this in their systems, so that you know we can see it used in in a real database system as a for connectivity. So, we now have support at least in the, the libraries in the arrow project, I think four different programming language libraries support support the protocol so if you have an application, which uses any of those libraries they can generate read and at C interface and then they can tap into these different either query engines or things like if you want to just use are the Gandiva expression compiler LLVM expression compiler. You could drop that into your application and use it with. You know, with relatively, you know, keeping me the details of the compiler relatively encapsulated. So, pretty interesting possibilities, possibilities there. Another problem that that we're spending some energy on of late is the problem of programming interfaces to to query engines. So, you know, in, in, in the past, you would see, you know, a lot of vertical integration between execution engines and the front end query interface whether that's more of like a data frame like API or SQL SQL API. And so there's all these different, you know, middleware libraries, which enable different language interfaces to talk to different, different computing, computing engines. So, in the same way that we've, we've worked with arrow to same way that we've worked with arrow to develop a standard for data connectivity between these systems. We'd also like to provide a middleware standard for connecting programming interfaces to query engines. So, as I said earlier, there's multiple arrow native or arrow compatible query engines that that that exists. And we would like to enable users to use their preferred programming languages or preferred programming interface to access them and not be as, you know, aware of the particular details of what's going on under the hood to enable better interchangeability of engines. Such, such as such as we can, certain certain details like, you know, data, like data loading and DDL type type things, we may not be able to make completely go away but at least the query part of specifying an operation to run whether it's in SQL or something else. We can have some measure of standardization interchangeability that would be really nice. So there's a kind of a parallel effort, not within Apache arrow but it's a separate open source initiative called substrate that we know we've been working with and the duck DB folks have been have been collaborating as well and working on an implementation. So Jacques Nadeau who I helped start was part one of the creators of arrow where I've worked closely with him for years. So he's been been working on and driving this. And so the idea is to have a portable language agnostic logical query plan. So you could generate it by parsing a, you know, parsing a SQL query so there's a calcite to substrate converter. I believe that Jacques has been working on. So that's one way to generate substrate, which is a set of protocol buffers. But we're also working on building integrations with like data frame APIs and languages like Python and are so that we enable alternative non SQL based interfaces. And so if a system knows how to accept substrate and execute substrate queries. We sort of get the, you know, the whole problem of like what SQL dialect do we do we do and do we have to carry around a SQL parser and analyzer to be able to do anything so substrate is another thing to implement but it's lower level that it's SQL and is at the level of logical, logical query plans. So we're going for this world of different API is going through substrate to talk to different different back ends. All right, I know I'm running running a long time so and I know we want to have a little time for questions but So there's a lot of work happening on query processing for for arrow. So there's an expanding network and I think this slide is not comprehensive but there's an expanding network of query engines that can speak arrow in some capacity. So whether systems that are built as arrow native, you know, from the ground up which includes projects like like Dramio and data fusion. There's projects that support arrow at at little to no overhead through the sea interface that includes that DB and Velox, and then there's other database systems, which now I need to add figure out all the names that need to go into this box that support either importing arrow or exporting arrow in some capacity. And so arrows been used for example in both snowflake and BigQuery as a medium of getting data out and into client APIs at higher speeds. Because then on the client side, you only have one thing to convert to wherever wherever else you need the data, whether it's something arrow native or need the data and pandas or something there's already like a pre built arrow to pandas converter. And we have been doing work in in C++ to build to build modular reusable relational operators so you can see this work ongoing in the arrow code base. So it's so so these are things like like aggregation joins sorts projection filter, you know the building blocks of building blocks relational query engine. We have a growing companion of array functions, which can be used for interpreted expression evaluation. And the intent with all this is to provide a batteries included toolbox for building for building data processing engines. So not only do we have kind of the lower level of the stack like the scanning, like the data with the data set interfaces like how to read parquet files how to address large directories parquet files and s3 and deal with like schema, schema normalization, that sort of thing. But we're now kind of building the middle processing processing layer such that if you want to build a system that that you can read and write data sets. And and do some amount of analytics on them that we have all that available out of the box in the in the arrow project. So, I have an example here of some doing some running some queries using the C++ API. And but I don't think I have time to cover it so I'll post the slides and you can take a look at it. And but I just wanted to say that we've been very active about exposing query processing capabilities in in R and in the near future in in Python so if you are an R user and familiar with the player, you can write queries with the player and address data sets that exist in a variety of different storage schemes whether file systems, so remote file systems local file systems different file formats. So you can compose a data set. As one step of the process and then query it using the standard dplyr API, or you could, if you could, if you insert the to duck DB function after one of these pipes, then it will delegate the query execution to duck DB using using a standard dplyr API so that's very nice stuff that's coming up in some coming up in the near future in the project. You know we're, we're continuing to do a lot of work on on the query execution side of things, in particular, we're looking to bring the same level of programmability and sort of data frame like interfaces to Python. That's going sort of in non SQL interfaces that's going through the I this project, which is another project that I started years ago at at cloud era and is has developed a life of its own and the intervening intervening six years. I think we're going to look at some, given that we have an expression compiler, we will look at some things like, like, just in time, just in time compilation of hot paths. So rather than doing interpreted expression evaluation everywhere that if you have the LLVM runtime built and Gandiva available that you can turn it on and use it to compile expressions and cash them. For and we haven't deeply studied the performance differences and when it makes sense to to use the the Gandiva compiler but that's something that will need to do some research to to determine. So folks on this call, I've never seen the IBIS project in Python. Basically, it's like the player for Python. And so it's a a a single user interface that can talk to many different database back ends so it has existing SQL back ends for, you know, for postgres and you know, click house and Impala, BigQuery, like, you know, Google cloud people have been building and maintaining the BigQuery interface and using it for a bunch of Google cloud SDK stuff. So it's a pretty nice tool and provides like a very clean relational algebra API for for Python that has, you know, type full kind of type validation strong typing. And it will, you know, bark at you if you build a relational algebra relational algebra expression that has some kind of some kind of problem with it, whether you're trying to apply an expression on a relation where they can't you can't resolve can't resolve a field or determine some other in validness to to an expression that you've written so Yeah, how are we So on the business side, how are we paying for all of this, all this development work. So for a number of years, I worked on the Aero in a nonprofit capacity. So funded by funded by our studio for, you know, the interest in having aero powered analytics available in our and better connectivity with languages other than our supported as well by by to Sigma. And, and a lot of other a number of other sponsors so that was, it was great. We spun out of our studio to form Ursa computing in 2020, raised some some money from from venture capitalists. And earlier this year we found an opportunity to join forces with pioneers from the, from the GPU analytics ecosystem. So blazing sequel and and leadership from the rapids organization at NVIDIA, who had built aero native computing for for CUDA. And so we are working to to build a unified computing foundation that is hardware optimized and aero native and and works works well across across user programming languages. The mission of the open source community driven mission of Ursa labs will continue on so Ursa labs is now Voltron labs and we are maintaining maintaining a dedicated open source team whose mission is is to continue to grow and support the development of the, the patchy arrow ecosystem kind of, as we think about now like the, the arrow cinematic universe as the, the, the project in its ecosystem continues to grow. So, well thanks for listening, hopefully there was some some interesting things here I imagine there's some things that would be good for us to dig into offline so if there's anything that's thought provoking be happy, feel free to reach out to me on, you can send me a DM on Twitter. I'm sure you can anyone on this call can can find my email address easily from looking at GitHub. And, and we also are hiring so if you're interested in working on any of these problems, you know, happy to talk with you about that as well. Okay, awesome West, I will clap my hands everyone else. All right, I'm sure there's a lot of questions. Raise your hand if you want and we'll call on you and go for it. Or unmute yourself to fire away. Surprising. Okay. All right, so I guess my, my question is that sort of a technical question is that everything you laid out I think makes sense. My question is like, if your vision is successful, and that arrow becomes the thing that, you know, this is the protocol everyone speaks and they're using the different building blocks that you're, you're developing that composed together essentially makes it, you know, makes a data spend with a system or analytical data system. And if everyone's using all these different features then what's the, what's the future of database systems that what is the distinguishing characteristic or distinguishing feature that would make one system better than another right it's all arrow underneath the covers. Is it just the UI is what's different or you're also obviously missing a query optimizer which I don't think is something you want to pursue. You go to Calcine and so it's the cost modeling right, like, is that the like, is that how you see the future or you still think there's different that won't be something that someone could add that they built the entire system using arrows toolkit that is missing other than the optimizer and the UI. I mean we're aiming for I mean we're aiming for for composability and modularity so. So things like the, the sequel, like the sequel front end query optimizer. You know, to the extent possible, we would like those those things to be to be modular to be modular as well. Of course, the optimizer needs to needs a lot of information about the, you know, about the data itself, and the characteristics of the query engine. So, so that's, you know, it's easy to say like modular modular query optimizer great. Calcite, you know calcite has done done a, you know, a pretty good job of providing providing that to application so. So I think that's a good model for what, you know what success, what success may look like. But to your point about what what's the distinguishing feature of, you know, a database systems. I mean I think, you know, ultimately from from the user standpoint, you want, I mean you want things to be simple in the sense that you can choose the choose the programming model that makes sense for you whether it's working in more of a, you know, more of a data frame like interface or continuing to work on work with sequel. And everyone's always grumbling about, you know, how much they hate sequel. And there's really been no, you know, it's like well what's after what's after sequel I'm sure we'll still be writing sequel in 30 years. But if we create the environment where, you know, new query languages could be, you know, could be developed in a way that's like modular and interchangeable. And like, that doesn't seem like such bad thing. I think another, I mean, I think another aspect is that by standardizing by by having more standardization at least on the protocol and like the interface between systems as as arrow that system builders can focus more on cost effectiveness. So rather than like having these walled gardens where people like are using your system and they're kind of stuck and so it's like migrating to another system is comes that like a very high people cost that when processing engines are more modular. If you could interchange or like upgrade, you know, upgrade, make upgrades at the processing processing engine level that reduces your computing costs. In the same way it's like, oh, I, you know, I buy a new, I buy a new computer and, you know, my stuff runs faster. So, essentially, you know, to have some some level of, you know, decoupling the user interface from the, you know, from the execution engine, what's going on in the hood, so that we get we get into this environment where the industry is more aligned on like, you know, rather than trying to like, build walled gardens and like build a moat around your, your walled garden that instead it's like okay well how can we, how can we reduce the carbon footprint of the amount of data that we're processing, because that's like a huge problem. I think your comment about like, could the next sequel come out from this is actually very interesting. I don't think I said I don't think it will I don't think sequel is going anywhere. I don't think so either. But the idea when you think about everyone who's tried to replace sequel, it's always been attached with like the bespoke database system they built from it, built for it right so like the, the, the object already databases and the late 80s, the XML guys the mongos, everyone tried to say okay here's my new language plus my new database system, but you just say hey I'm using arrow stuff and I can accelerate how quickly I can have a database system that you can focus on the UI UX part which could be you know a enhancement over placement for sequel. I think that's actually very interesting I have not thought about it that way. That's really good. Okay, anybody else any questions or comments. Anyways, Deepak. Now that you're building operators and I don't do any thoughts on testing them like how do you permute them and ensure they're correct and different plans beyond thoughts on building the test framework. I haven't been doing a person I haven't personally been doing a lot of the development work on them lately but I believe a lot of the, a lot of the test cases are there there's some some randomly generated tests as well as hand coded I think I think it would be, it would be good to move to, you know, more of a rather than having hand coded. I think if you look at, you know, duck TV went through the same process of they, they had a for a long while there was hand coded, hand coded I think they had the most unit tests for correctness, and they've moved. As far as I believe, correct me if I'm wrong and you know almost entirely to interpreted, interpreted test cases that makes it much easier to like generate, you know, generate and write like tons and tons of tests. And so, I think that we, you know, we will need to move in that direction. We'll need to move in that direction as well just to make it easier to write, write and generate lots of generate lots of tests. And we can, you know, we've got, we've got the duck TV integration so we we've got, you know, a system that is, you know, really rigorous and about correctness that we can use as a, as a as a check. Thanks. Hey, Dominic here. So, you know, but then, ultimately also focusing on GPU support. What's the story for arrow and GPUs. Well, I, you're familiar with, you're familiar with, with, with rapids and and kudia, which is, which is built, which is built on arrow. There's, there's a couple of minor deviations from arrow that are, for example, Booleans are not are not did packed on GPUs, because it makes sense to blow them up to bites, you know, for, for, for GPU reasons. But, but by and large, like the, you know, we consider it, you know, we consider it an arrow based, you know, an arrow based system. You know, we're interested in, in, in, in seeing, in, in seeing, you know, robust, you know, robust GPU support for, for arrow. You know, into the, you know, in the future. So, kind of the, you know, the form factor, you know, the form factor for delivering that programming API, kind of the, you know, the kind of packaging and configuration of libraries and tools, if you use, you know, if you use Kudiav, like Kudiav has its own Python, you know, Python library and set of set of pandas like, set of pandas like APIs. And so I think we would be interested in bringing, you know, bringing that under a common programming interface. So, like, if you look at the, you know, the, the IBIS, you know, programming model, or the dplyr programming model, so that you can, you know, sort of switch between, like, okay, I've got, I've got, I can, you know, send this query over to click house over here, or to run it embedded on DuckDB here, and you're not having to rewrite, you're not having to rewrite your code, you know, based on, you know, where the query is running and like what query engine is running it. I look forward to the day when we can do that with browsers. We can see very much with browsers. Yeah, yeah. Well, you can, well, you can, you know, I mean, you've got, you've got DuckDB compiled to WASM and running in browser and reading parquet files. So, you know, we're, you know, we're basically already there. I think there's like, you know, kind of some ecosystem, you know, the ecosystem has to become more mature and things have to become like, you know, just more, you know, just people, it's just something we can, you can rely on in the browser and have available ubiquitously. But, you know, wouldn't it be cool if, you know, Chrome shipped with Chrome shipped with the query engine built in. And yeah, so there's a CSV file here, okay, like I'm going to, let me run some SQL on that and you don't have to think about it. I think that would be nice. Yeah, what I'm thinking is if you could memory map directly from, let's say an R process to Oh, I see. Yeah, they can visualize the data there so that if I'm in Jupyter, for instance, I don't even have to copy data into the browser and just like directly reference. I see. That would be cool. Yeah, that sounds like that sounds like enough to give a like a Chrome security dev like an aneurysm, but I'm sure we could, we could figure something out. Okay, my last question is before we go quickly, is there anything that you found that people how people have used arrow in a way that you found to be surprising or unexpected. I don't know if it's just database people and everyone's really safe at this point. Um, I mean, I, I think, I think one thing that I, I, I didn't know what to expect, but I, it wasn't wasn't obvious to me that that people would build like bespoke like storage systems using like using arrow directly. And so, so arrow has been adopted in and like, if anyone's familiar with hugging face, it's like a, like an ML, like an ML framework. And so, so people have been building like proper, like storage and like and storage systems based, you know, based on arrow. And so, so there's no reason like, there's no reason not, you know, no reason not to do that. But yeah, we, you know, certainly it's, it's not like it's somewhat off off label use like we aren't encouraging people to start archiving data and arrow format and then storing it in in S3 buckets.