 So let's get started. Hello, everyone. I'm really pleased to be here with you to talk about Rust, Python, and Open Source, three topics that are really interesting in. So let me introduce myself first. I'm Florian Vale. I'm a staff data engineer working at Back Market. Back Market is a global refurbished devices, such as smartphones, a global marketplace for smartphones and laptops. And additionally to my role at Back Market, I'm also a maintainer in the Delta RS, which is a Delta library for the Delta Lake ecosystem. Before knowing more about Python and Rust and the combination of both, I would like to introduce a bit what is Delta Lake and what is Delta RS. So Delta Lake is an open source technology under the Linux Foundation. And it was created by Databricks, which are the creators of Apache Spark. And basically, Delta Lake aims to provide reliability, performance, and quality on data lakes. And you can think about basically a protocol with two main concepts. The first one is a scalable transaction log. And the second one is a scalable storage. So you have the scalable transaction log that keeps tracking all the data files, which are written in Parquet in the scalable storage. So it can be used on several cloud providers, such as AWS S3 or GCP with cloud storage, for example. So bringing availability to data lakes, it was to improve the scalability of the data and the metadata management, but also to break down walls between data warehouse and data lakes. It provides more features like the Z-order to optimize the data storage that we can find to improve the readability and the access for analytical purposes. Delta Lake was really GVM oriented. It means that it was really focused on the Java programming language, because it was written on top of Apache Spark, which is in Scala. That's why when you need to use Delta Lake, you have to be running an Apache Spark cluster and to use Delta Lake on top. So it was really focusing on the Delta protocol. And this design choice led to have multiple connectors in the Java programming language with a common ADOOP ecosystem, like IVE or the different libraries that you can find. Of course, when you need to run a cluster, sometimes you don't need to have billions of data. You need a small portion of the Delta tables. And maybe you want to trigger some actions that do not require to bootstrap like a big cluster with a lot of expensive resources. That's why at Back Market, our first need was to only access a Delta table only on the metadata provided and to trigger some action in Python. So we were not using quite much the cluster, and we don't have to spin up a cluster. We also have different scenarios when we are event-based and we would like to use only small resources like an AWS lambda that could handle the processing part. Or you can think about small use cases like data science topic that you want just to read and to access and to explore a bit, a small portion of your Delta tables. And I can deny that Rust was also the part of Delta Rust is Rust, and it was really an excellent choice for the data processing because it's really reliable and performance on the data processing storage. And also it was a way for us to open a gate to incorporate more data processing tools like Polars or Arrow, and also to provide a good way to create unbindings with Python bindings and to open a new gateway with other data ecosystem library like AWS Panda SDK or DuckDB for instance. And right now, Databricks and other source communities are thinking about re-implementing a bit the Delta canal to have a core function and to not reinvent the wheel to connect a lot of libraries. Delta iOS was built organically with not all those communities, but we have to re-implement a lot of part of the protocol, so re-engineering a lot of things. So right now, they are thinking about providing Delta kernel in Rust, but also in Java to make the life of development process easier to connect more connectors. And I would say we were not the only one. We have a lot of different Python libraries that are doing the same. We can think about PyDantic Core that re-implement PyDantic in version 2 by rewriting everything on the core in Rust. They have like a 17 times faster library thanks to this. And we can think about Polars, which is the last blazing data frame structure to improve a bit the exploration on this side as well. And there is also Arrow and Arrow data fusions that are really helpful to manage the in-memory data that you can process a lot. So it's really powerful when you need to collect, process, and analyze the data sets. And for the fifth consecutive years, Rust has won the hearts of the developers by being the most loved programming language. A bit of history. So it was written in Rust at first, and it was created by Scribd at the beginning in 2020. And Back Market joined the open source efforts to provide and to improve the Python binding. It was really important to have a well-defined, well-documented Python bindings to improve the connectivity with two other libraries. So the layers was already there, but we joined the effort to improve it. And right after re-reasing the first version and to improve the Python building, we saw a significant increase in terms of number of downloads in PyPy. It was really impressive because a lot of different Python data processing tools integrate Delta RS by the Python binding. So they really help and really increase the usage of Delta RS. So right now, we are more than 100 contributors. And a lot of contribution is done about Delta RS in Python and Rust as well. So best of both worlds, Rust and Python, how to combine them. But before, I would like to talk more about why not C or C++. Mozilla, with the employer grade on all, created Rust to mitigate the common challenge that we have when we are programming with low native programming language like C. It can be new pointers, buffers, overflows, and a lot of data-races conditions. So Rust was really the answer to face this common issue. But without promising the developer's experience and productivity. So that's why Rust was the best of both worlds in this area because it provides high-level API while providing efficient and reliable calls and runtimes. So it was more than developed by developers. If you see cargo, which is a package manager, it should be really well-structured and really helpful to build a Rust craze. And Web Engine Servo implementation was done by Mozilla Foundation. But right now, Firefox is 10% written in Rust. And Servo was another Web Engine application for the browser. But it means that you have a lot of potential in this area. And when we talk about Rust, we don't talk about the binding itself. We kind of think about the binding, the Rust core, but thinking about the binding as well. Because you can manage the memory and face this issue again and again about memory management. But you can think about having the same problem when you need to create bindings between two languages. You can face the same issue. You have two different models of the memory management. But when you are releasing or creating the memory in one layer, you need to think how to manage it on the other layer in the different manners. And Rust is really a pull in this because it provides a lot of bindings and libraries to help you to manage this kind of situation. And at the same time, you can develop in Rust directly and provide a good way and a good reliable software. Coming back to Delta RS, you see the first layer. So it's made with Rust. And basically, we use Parquet. So all the data were written in Parquet. That's why we fetch and collect everything. We read it. And we convert to arrow column now oriented in the memory. So it's really efficient because you have a zero copy mechanism. And it's also you have no serialization and deserialization at the same time to process your data. So it's really efficient. And we use Tokyo. It's just an asynchronous library to help the developer experience. So it's quite improved. There's no calls that you have to make. So you have the performances. You have the safe and reliable software with crates on the Rust layer. And you can run this crate where you want on your system. And it's not only to talk about performances with this Rust core as script as saving two orders of magnitude in terms of price, just to incorporate Rust layer, Delta Lake crates, to improve the layer and gestions. So it was really helpful for them to have this library. We integrate Object Store, which is a part to unify the API calls from different cloud providers. It was really helpful to have one simple place and to have a consistent experience in terms of API when you are dealing with different cloud providers. So it was really helpful to integrate Object Store in the Rust code base. And we see Python. So how we can combine Python and Rust. So we use PyO3 and Maturian, the combination of two. We'll help you to create the glue between the two languages. So PyO3 is really helpful in the Rust dependencies that you can create to have the Python extension. And Maturian on the other side will help you to define what is the Rust to use and how to use the binding, the underlying binding. So you see you have two layers. The first one is the PyPy library. So you can deploy. You can install it from your Python ecosystem. And on the other side, you can download the crate on crate.io if you would like just to have the Rust application. So you have the best combination of Python and Rust. On the last point, when you have other dependencies, you can, for example, add Panda and have a simple additional layer in the Python bindings. And you can plug it with different other data processing tools. So it was simple to integrate Delta Lake and to use it in other data processing solutions. Let's now focus on how to create the Python bindings. So it's really simple. You just have to install Rust on your system. And you have a virtual environment to isolate all the Python dependency. And the first step would be just to install Maturian in your environment. The next step would be to scaffold the project. So you will be Maturian new My Project. This is the name of my project. And you will have to choose between different bindings that you would like to use with your Python Rust application. So our focus is PyO3 because it's really helpful when you need to design an API with a lot of constraints, with a lot of complicity. And it also really helps you to define the glue between the languages. But you can think about CFFI or Recipiton. Recipiton is not really quite active at the moment. But CFFI is more complicated if you need more of the memory management between the two. But you can do it easily in the Python if your tool is really simple, like a simple command line command. UNIFI-RS is the library of the future, I would say, because it will provide a way to integrate more language with a common definition, not only for Python, but for Go and other languages. The next step, upon generation, you will see that you have your project. And the project structure will be separated into two different code bases. The first one will be Rust with the cargo file that will be the cargo, the package manager, or how to define the project. On the other part, we have Python. So you have a clear distinction between your Python application and your Rust application. So it's really a best practice to separate the file like this to make sure that you have two different layers. And you will be able to plug them on Python or to use simply the Rust application and the Python application on the other side. So let's take a look on the cargo to ML file. So you have the project defined. And you see there is two kind of different instructions. The first one is a create type, which is a C dynamic library. So it's a C convention that you can have a shared library when you're using Mac or Linux. So you can think about a DLL on Windows. And it will be created the glue for you just to make the call from Python to Rust. So it's really behind the scene, but it was automatically generated by Matura. And on the feature, you can activate different features with Rust. You can think about activating IBA3. If you would like to have a full compatibility in terms of Python language, for example, to have full compatibility with 3.7 Python, you can just add the extension and you will be compatible when you are creating a software with Rust. On the lib.rs, I just create a few examples on how you can create your bindings and how you can define your Python class. So this one is really simple. You have three different instructions. The first one is a macro for the class. So you define a class in Rust. Really simple with two attributes, A and B, for this example. And you can define as well as the Rust type, the definition of the method. So it's just pi methods on top. And you will be able to define the constructor and the different definition for your object. And at the end, you have the pi module instructions that will be your module. So you will be able to create like a Python module that you can import on your Python environment. So this kind of instruction is really simple and you will see it's a macro definition style with Rust. And it will be able to generate the glue for you behind the scene that we will see later. So this one is really focusing on creating a Python application, a Python structure that you can use, but you can also call a Python code. This one is just to have a call from the Python code base to the Rust, but you can do also calling different Python structure or method if you want. On the Python side, we see a pyproject.tomr. So you have like the structure defining for you the build backend as maturing. So it's really simple. You have maturing that dealing the side lib for you and making the glue for you behind the scene. On the just structure, just to push, to make a link with the Python defined in the project. So it's just an instruction to say, OK, I would like to have my library in this folder. Really simple. For the example, it was just for testing purposes, but you can see that you create a new project in Python and you just have to import my project, which is a name defined by your module. And I can use my Python Rust class. So it's really simple to import everything and it's made due to the libraries that be created by maturing behind the scene. So as an example, I just create one structure and I just sum up the different attributes, which is the result is free. And if you would like to improve this API, you can think about defining your own Python class to make sure that you can provide more availability, but you can import the Rust structure inside it and define and provide different documentation to make it simple for everyone to use. Just to sum up and to unfold the different phases, the first one is to generate the project. You have everything you saw before with the structure and everything is defined. You can work on the Rust software or the Rust application or you can work on the Python application and make the glue. The next step will be to launch maturing develop. It will provide for you a development purpose crate and you generate it like a shared library. On my case, it was Mac, so it was not a shared object. And this one could be used to test or to launch and to have the import working well, so you can define your test, improve a bit the development process on the CI or to test your Python project. So the next step, you can also, after generating this library and everything configured into the virtual environment, you can also launch the Python example. So at the end, we see three, which was one and two and the main definition is example.pi. If you would like to go further, you would want to build the crate, so it means you want to build the wheel with everything embedded into it. That's why you can focus on implementing maturing build and maturing publish. It will build the wheel for you and you can publish it into pi pi if you want. In this example, again, we see that we create a dedicated wheel for my environment, so it was a specific version of Python. So in your CI process, keep in mind to have a multiple process with different operating system to be sure to create the wheel for a lot of variety of machine and operating systems. If you want to take a look at what is generated behind, you can install expand, which is an additional dependency on cargo. And you will see as a code generation behind the scene. So I really like to develop, but this one is not really beautiful to develop. That's why it's really efficient to have the automatic way of generating the code by using the macro because everything is defined and we don't want to develop this kind of a structure. So we talk about Python and Rust. Don't get it up into the Python versus Rust shutdown, but basically, if you think about one use case, you can think about reading 1 billion raw data sets. And you can use Panda for it with a simple CSV format, and it takes three minutes. And if you activate the data rest and if you go to the data table with a lot of features, you can reach two seconds of performance. So that's why it's not only about Python and Rust, but it's also the data table and the zero optimization. But you can think about a lot of use cases where you need to access quickly to the data for a small portion. And you don't want to wait to spin up a cluster and wait a lot of time. So basically, my learning through this journey is to, when you need to think about creating a Python Rust library, you need to think about CPU or memory intensive. And you can create everything into Rust. But you have to be aware about the global and separator locks that can help the process. So that's why you can also improve it by making it fully asynchronous. You can use BioFree with AsyncQ and be able to call the Rust code base without waiting. So it can be one area of improvement. Also, you don't have to think too much about the global and separator lock when you shared the object between Python and Rust. But if you'd like to keep memory or to call from Rust the Python code base, you need to think about acquiring the global and separator lock. So you can have a memory leak issue or different topics that you need to look closer, actually for the simple use case, it's worse. But if you have different use case with more intensive memory management, you need to use smart pointers, which is another topic. Concurrent tests on data sets, arrow and pyro has really some library if you want to deal with a lot of data processing solutions. It's in memory, it's efficient, and you don't have to serialize and deserialize. And also, you can be in the same process. It was really a pull, for example, because we have arrow on Rust. We use it, we get the schema, and on pyro with the Python, we be able to read it and to use it in the Python library. So it was really convenient for us to benefit from the two libraries. For the Rust part, ownership is key. It's more about the language, but when you are designing something, the ownership and taking care of where is the variables and where the memory is and under is really nice and really convenient when you deal with a lot of processing with big data sets. On the data ecosystem and the friendly API, I think the best combination will be to have the very well-documented and structured Python class with really like a Sphinx documentation, really helpful, and it will provide a gateway to the Rust code base. But think also about the wheel and the usage by providing multiple operating system, good documentations, and also you can activate optional dependency. Optional dependency will be, I need only this part of the application on my Python, but I leverage the feature of Rust behind the scene, so it makes sure to have only the binaries that you want for your use case. If I take one example, it will be to IWS, I have the optional dependency in Python. I can activate a feature, an optional feature in Rust, and when I'm downloading the application, I am really focused on AWS and not GCP, for instance. So you can think about both when you are developing. And the last point, Rust community is really great. It was my first contribution to come back to the Python binding and start to implement quite a lot in this area, and I end up to work only on Rust because I've fallen in love with language, and the community is really welcomed and benevolent for with you because the learning curve is pretty hard, and you can take time to understand and to manage the concept of ownership or the different libraries, but it was really nice for me as a first experience with Rust. And that's it, we value any contributions under the hesitate to reach to us if you want to improve the Python binding, the Rust or any documentations. You have the Delta Lake Slack channel, and you can so follow the news on the LinkedIn page or directly contribute to the project. Thank you very much. Do you have any questions? Thank you very much for this interesting presentation. One question. So you said that Delta was sponsored or developed by Databricks at first and you're not associated with Databricks in any way. How easy was it for you to join the effort and become a partner of the library? Yeah, good question. Thanks. My first contribution was done on Delta iOS, which was created by Scribd. So my first entry point was Scribd and the engineering team. They were really nice and created like some libraries. So we joined the effort and right after we will be joined by Databricks teams, especially one developer advocate, which is Danny Lee. And it provided to me everything in terms of ownership, flexibility by providing all the key of Delta Lake. I was really surprised because you have like the full power of it, but I started to contribute more and more on this area. And yeah, the combination of two is really great because you feel like you'll be part on the open source community and you feel really well welcomed. So it was really nice as an experience. And they continue to do the same with Delta Kernel. They create meetings with us and they handboard more and more companies to help to define the future about defining the barrier between Delta Kernel and Delta RS.