 So it's my great pleasure to introduce Paul, who's going to talk to us a little bit about building Python extensions with Rust. I first met Paul, I was working at Bloomberg, but he's now a software engineer over at Google. He's also a maintainer of DateUtil, set up tools and a core developer, so he's ticking all the boxes today. So it's a great pleasure. Thank you very much, Paul. Is it someone, now it's better? Okay, great. All right, so I almost forgot to start my timer. All right, so today I'm going to be talking about building Python extensions in Rust. So this is the result of some work that I did because I was interested in maybe writing some Rust backends for DateUtil or other projects that I had and I sort of went down a rabbit hole where I ended up contributing like a huge number of like bindings to Pi03, which is kind of how it goes in open source. But I'm very excited about the possibility of getting a lot of Rust extensions out there in the community, but you'll see it's not going to be all rosy, I'm not in the Rust evangelism strike force. So I think we've probably all heard that Python is great glue language because you can write high level APIs that are very easy for people to use, but Python itself does have a lot of overhead, right? It sort of takes some, it takes a lot of the worrying out of the hands of the programmer, but at the cost of worrying about it itself at run time, which is why Python is great for orchestration tasks like system orchestration or for just calling into other languages like compiled languages like C or C++. And in fact, a lot of the great libraries that you know about are in fact just sort of glue around some other compiled fast languages, right? So NumPy or SciPy or Mapplotlib, these are all calling down into C or C++ or Fortran, but they're giving you that great powerful API. So just as a quick, just to dive into what it means to write an extension, here's an implementation of the sieve of Eratosthenes, which is an algorithm for finding all the prime numbers less than or equal to a given number. And the way it works is you just make a list of all the numbers up to that number, and then you start, and then you start with two, and then you cross off all the multiples of that, and then you continue doing that for all the prime numbers you found until you've gotten to the end of the list, and then you have the list of all the prime numbers. And you can see this is a pretty short implementation, it's pretty easy, and it works for at least five and 20. And this is what it looks like if you write it in the C API. It's obviously much longer, but that's not really the important part. What's more important is that it involves a lot of details that Python doesn't make you think about. So for example here, I'm manually managing how the, like some exception handling, I do that here and here. I'm also manually managing the memory, and I have to manage the reference counts of all the Python objects that I create. So this is kind of a lot of complexity, and it didn't really all fit in one slide. You can see this is actually quite a terse way of programming, I have this for loop that's on one line, because on a smaller screen than this, this is all that fits on one slide if you want readable text. So why would you do anything like this? Why would you subject yourself to it? It comes with a lot of performance benefits. So if we compare the Python version that I wrote to the C version that I wrote, the C version takes 526 microseconds, and the Python version takes 18.6 milliseconds to calculate all the numbers up to 100,000. So that's like a 35 times speed up, and once you get past some of the overhead of just calling the function itself, you're getting into numbers of like 40 or 50 times speed up of the hot loop. But it has these downsides that I mentioned, right? With C, you have to manually manage all your memory. You have to allocate it, you have to free it, you have to do all the reference counting, otherwise you get memory leaks. So you have to keep track of what returns you a new reference, and then decrement them correctly, and then you can get seg faults if you decrement them too much, but you can get memory leaks if you increment them too much. And it's not a memory safe language, so if you have an array in C, you can just, it's really just a pointer to the beginning of the array, and then you just tell it how far off to go. And you have to keep track of how big the array is, and in fact, when I was first implementing this, I made this little mistake, which is that to save a little space and conceptual time, I had the sieve start from 2 and go to the end. So the length of that is n minus 1, but you're so used to saying, oh, I just iterate from 0 to n. So I accidentally said, oh, i equals 0, go to i less than n, but that's longer than the array. And so when this reached the end, it wasn't an error, it wasn't a seg fault, it just read some random piece of memory and then turned it into a Python object and returned it to me. And I was like, well, that's weird. Like, 87 is not a prime number. Oh, that's a really bad example. But in any case, it's not the next prime number after 5. So this is something that I wish I was protected from, and it is something I'm protected from in Python. So where can we look next? Well, this is kind of what Rust is all about. It's a systems programming language that is memory safe. This doesn't come up quite so much in this talk, but it also pays a lot of attention to concurrency. So its model really pays attention to preventing data races and making you think about these things a lot. But it's also high performance. All this kind of safety features come at compile time and not at runtime like they do in Python. And also, it's, I think, an appealing community for Pythonistas because we're used to this world, which is a great and inclusive community and has this huge package ecosystem where you can just pip install whatever you want. And in Rust, they have a similar thing. They have a package manager called Cargo, and you can just depend on things and then you Cargo build and whatever their equivalents of things are. So this is something that I think would be very appealing to people in the Python community. I don't have time to go into exactly why Rust is a great language, but I just wanted to highlight some concept just to show you, give you a sense of the different ways you approach programming in Rust versus Python. So Rust cares a lot about ownership, what owns resources. And this has to do with when they're allowed to clean up references. This allows you to avoid having a garbage collector or a reference counting. So in general, one thing can own, like one variable or name can be bound to a resource and can own that resource. And then many things can have read-only access to it or one other thing can have write-only access to it. But you can't have multiple things that can write at the same time. So here what we have is we have something where if you look at this function, I assign this vector to V, it allocates some resources, it binds them to V, and then they're owned by V. And so when V goes out of scope, you're allowed to clean up those resources. You don't have to pay attention to when it gets freed. The compiler is doing that for you. It also is going to enforce that these rules about ownership. So here what I'm doing is I assign V to something, and then I pass it to something else, which also wants to take ownership of it. And that's fine. So what happens is I assign this vector to V. The vector moves into this thing that I've also called V in the take ownership function. And then because it's owned by the V in that take ownership function, when that goes out of scope at the end of this function, the resources are cleaned up. And if you tried to use them, it would be used after free. And in fact, that's what I'm doing here. I'm trying to use them, and I try and access this, and it's an error. But the compiler is going to catch this. It's going to use its borrow checker, and it's going to say, hey, you moved this here, but then you tried to use it. And it's actually a very informative message. I just explained in words what that is, and I had the comments, but I didn't really need the comments because the compiler generated all those comments that I manually annotated it with. So it doesn't always work as cleanly and nicely as this. But generally speaking, the compiler will tell you what it's trying to do. It's quite good about that. OK, so that's essentially as much as I'm going to tell you about Rust. The rest of this, let's just assume that you're perfect advanced Rust programmers. So I'm not going to explain anything else. So this is like the draw circle and then draw the rest of the owl, right? So here's my implementation of this in Rust. So this part of it is just pure Rust. It has nothing to do with Python. And in fact, we'll reuse it later. And then this part is something that just takes that and translates it into Python. So you can imagine that if this were a function in a library, it would be quite easy to write Python wrappings for it. And this is using PyO3. PyO3 is a library that provides bindings to the CAPI in Rust. And it uses this thing. It's called a procedural macro. And you can just imagine it's like a decorator. And it's just going to do some transformations on this to make sure that all the reference counts happen and all the types are set correctly. And what I've done here is, first I've just created this vector in Rust. I've returned it. And then this part says PyList new. And that's going to construct a Python list. And that's the same as in CAPI creating a list and allocating it and doing all this stuff. But the lifetime of this list is managed by the lifetime of this Py object. So you can see this has comparable speed to C. So I've written this function. And then I import it. And it does what we expect it to do. There's no 87s. And it's also pretty much just as fast. So if we run this in pure Python, we get 23 milliseconds. And in C, we get about 700 microseconds. And in Rust, we get about 700 microseconds. So it's much tighter. And it's also memory safe, right? So how does it work? Because C, it's binding to C. And C itself is not memory safe. So the way it works is that PyO3 operates in two layers. They have the FFI layer, which is a whole bunch of unsafe code. And then they have the safe Rust layer. So the FFI layer looks like this, where this is from the daytime bindings that I wrote. And in the C API, there's this struct called PyDateTimeCAPI. And the right side is what the definition is in C. And then I basically have to copy this and say, give me the exact same memory layout of that. But in Rust, and I'm going to say that it's a C function, and it has all these C integers and things like that. And so I have all the functions. And then I declare all the data structures. So this is the same data structure as a daytime in Python. And then I also have all the little macros. And I have to re-implement all the macros, because you can't just use C preprocessor macros in Rust. And then all that gets wrapped in the safe Rust layer. And this is the layer that you would be using if you're writing your Rust extensions. So we have this little marker that says, hey, all this stuff in here, this is unsafe, which sounds like kind of a cheat where you can just say, oh, do all this unsafe stuff. But it's still memory safe. But actually, this is a very useful feature in the sense that if you're trying to audit this code and you're running through all this and you want to find memory safety bugs, the place you're going to look is in all the brackets that are surrounded by unsafe. And as long as these are sort of localized, you know that you can ignore 80% of the code when you're just looking for these memory safety bugs. So we have this unsafe stuff, which calls the FFI layer. And then it returns something that has a lifetime that's managed by Rust and has all the right memory safety features. And that's what you use. So then you would have this thing, it's PyDate. And then we have some trait that is called PyDate access, which gives you safe Rust functions that access all these things. And so if you want to make a module, it's quite easy. You have a function, and then you sort of decorate it with this thing that says Py Function and that will construct it as a function. And then when you want to make a module that's importable, you call this PyModule thing. And this is saying, basically, it's going to return, it just initializes this module. So I'll just add all the functions and then those are what gets exposed to Python. So here I have this thing that's seconds before. It tells you what the date was, some number of seconds before. And I can pass it this integer, and it gives me the right answer. But you'll notice here, I've declared this i64, which is a Rust type. It's not like Python Integer 64 or anything, right? But here I passed it a Python object, which is an integer. Because PyO3 is doing all the right things to convert one to the other behind the scenes in these procedural macros. So it makes it quite convenient for the end user to not have to think about this. And you can even do things like manage the exceptions. So you'll notice I don't have anything about checking for nulls or anything like that. I just return this Py result. And that's what's returned from the constructor because the constructor can raise errors. And if I get something that would raise an error here, instead of panicking or segfaulting or anything like that, it just returns error result. And then that percolates out to Python and raises the correct error. You can also make classes. Rust is a different model of inheritance. Well, it doesn't really have inheritance. It separates out the data in your classes from the implementation. So we do a similar thing where we have the data portion. And we declare that a Py class. And then in the implementation, we call these Python methods. And so this is the new for this point. And then this is some function I'm going to call norm. So point is just x and y. And we want to know what the norm of that vector is. We add it to our module. And then I can import it just fine. And it works. I can create it. And then I can calculate that the norm of something 3 and 4 is 5. So perfect. Like, you can create classes. You can create modules. So that's the CAPI approach. There's another approach, which is to create FFI bindings. So here's another way that you can do something similar, which is I use exact same Rust implementation. But now instead of wrapping it in something that creates Python objects and things like that, I create this CFFI, which is to say that we'll call the outputs of C compilers and stuff the lingua franca of what all languages are going to speak. And we'll just generate something that's compatible with that. And so this is some super unsafe thing. It's an external C unsafe function. And it's public. It's like, oh, no, don't ever write this. And then it's like, oh, check out this mutable pointer. That's very scary. And then this is even worse, right? What I've done is I have this vector. And so I convert it to a mutable pointer. So I just have the data that it represents. And then I say mem forget, which means Rust, please stop managing this memory. Like you've allocated it and stuff. Now I'm going to hand it to someone else to manage this memory. So it's fairly scary. But the big advantage of this is that when I hand this off, I can hand it off to Python. I can hand it off to Ruby. I can hand it off to JavaScript. I can hand it off to whatever, because all those languages also speak C. But I also have to hand it this thing deallocateVector, right? Because I'm the only one who knows how to allocate and deallocate vectors, because I'm Rust. And but I'm handing it memory that needs to be deallocated eventually. So I have to end up passing it this function, and then also another function to let it handle the memory. Then we pass it on to what's called Milksnake. Milksnake is a library that came out of Century. And what it is is it's for writing Python bindings for CFFI. And this goes the other way, right? Anything that has a CFFI output, Milksnake can bind to. And under the hood, it also uses the library CFFI, which is the more bare bones version of this. So how this works is now I have to allocate some of these C types, and I allocate them in Python. And then I pull out this array, like this C array of C types. And I get out just some contiguous memory thing. And now I have to convert all that stuff into Python objects. And so I do that by just using this list comprehension. And now when that's done, I've copied all the memory. And I call that deallocateVec. And I do it in this try finally, so that even if this fails, I don't leak the memory. And then I return the list, and we're done interacting with Rust. So this is fairly compact. It's not so bad. And it did come with some advantages. And it also brings all out of the same speed. So if you look, if I run this in C, I get about 800 microseconds. If I run it in Rust, it's like 700 microseconds. This is very variable. I wouldn't say that Rust is faster than C because of this. And for Milksnake, I'm getting something of 1.2 milliseconds. So it's a little bit more overhead in this particular example. That's not always going to be the case. But essentially, these are all comparable in speed. And it's really about convenience. So when we're looking at this so far, I've just been saying, oh, here's all the different ways we can do this. But the one question is, should we even do this? Is there maybe a better way to do this? Probably the best contender for that would be Scython. Scython is super convenient when you're writing wrappers for C or C++ or you're just trying to get a little bit of a speed boost. You write it essentially like it's Python. And then behind the scenes, this all gets compiled down to C. And even if you're not doing anything fancy, like using C++ and the vector type and stuff, you still get actually a decent speedup. And so here, with this simple implementation that doesn't bring in any Rust or any instability or different package managers, I actually get, still, I go from 23 milliseconds to 3 milliseconds. So it's almost 10 times speedup, just with this fairly simple implementation compared to C, which is maybe about three times faster. But maybe you need three times faster. Maybe you don't need it three times faster. So what should you choose of the things we've talked about? Unfortunately, I don't think there is an answer. I can't just be like, yes, pi o 3 or never use any of these. They're all terrible. What I could say is I can just give you some trade-offs. So in terms of speed, looking at this chart, I would call all of these roughly comparable. If you're looking for really good speed, probably milk snake, pi o 3 and C are going to be, well, they're faster than the scython that I write. So I have these caveats. One of one is that I didn't pick some function that is like a representative task for everything that you would generally do as a programmer. I picked something that was going to demonstrate how you have to deallocate vectors in when you use the FFI thing. So that's not exactly a representative task. If you're doing something that doesn't involve that, these numbers could be very different. The other thing is I spent some time optimizing these, but I'm not a super expert scython optimizer, because most of the time, if I need a little bit of speed, just using scython is enough. So take that with a grain of salt. I would definitely say all of these will give you some speed boost. I have found that writing the Rust stuff is a little bit, the stuff that I just naively churn out turns out a little bit faster than the scython stuff that I've written. But if you do decide to use Rust, should you use this FFI or the API? Well, the pros and cons here are if you're using FFI, you get that portable interface we talked about. It also has a smaller Rust dependency. Pi o 3 brings in all these bindings and stuff to the CAPI with the FFI, you have almost no dependency. It's basically just whatever Rust you write. It can also be faster with PiPi for certain types of interface, because PiPi can, I don't know the details of this, but their just-in-time compiler can make better optimizations when it's using CFFI than when it's binding to the CAPI, because it has this interaction layer in between. The downsides are this brings in a runtime dependency on Milksnake and CFFI, which you may not like. There's no support for Python-specific types, so for me, if I wanted to use date times, I have to write all the stuff that creates a date time in Python out of some C memory. So if you're just returning lists of integers and stuff, not a big deal. As you saw, I had to do the memory management in Python, and I also had to write a lot of unsafe Rust directly in my functions, which I didn't really have to do in PiO3. For the API, it's basically all safe Rust. It doesn't have any runtime dependencies, and you're working directly with Python objects. And it has some other nice things, because it knows that it's doing Python for you. But the downside is I've had to change these slides several times, because the API is fairly unstable. I think it still requires nightly Rust, which I haven't found to be a big burden, but some people can't use nightly Rust or don't want to. And there are also certain places where they just haven't really thought too much about optimizing the speed. So I think a lot of these cons will hopefully go away in the future. But what I would say now is if you want something that's going to be in production like tomorrow, I'm not sure that I would choose either of these Rust options. Maybe I would choose FFI, especially if you're using PiPi. But if you're willing to deal with these kinds of things and try and help the ecosystem, it may not be a bad idea to invest in the API. It's also another thing you have to think about is the task that you're trying to do. If you're just trying to write a wrapper for some library that's already written in Rust, the FFI stuff is probably going to be one of the faster ways to do it. And it allows you to write this API, this wrapper, that will hit a whole bunch of different languages. Whereas the C API stuff, the only advantage you really get there is the runtime that you don't have the runtime dependency. OK, so I've said that there are some drawbacks to all of these. I think that there are some pretty significant opportunities for improvement here. So this is my call to action. For the FFI approach, one of the big problems is I had to write all this crazy stuff. But you could write a procedural macro that generates this from inputs, just like the Pi03 case, except not Python specific. I think that could work. And if you're ambitious and you want to try that, you could also, if you're not really big into the Rust side of it, I think there's also a lot of improvements that could be made on the Python side. Where this is probably a fairly common pattern where you're going to have to allocate and deallocate things, why not have a library that has wrappers so that you can just decorate your Python functions with this thing. And it will say, oh, I understand that this is like a Rust vector and it's used in this specific way. And here's how I turn it into a list. And then there's Pi03. This is a screenshot I took yesterday. And you can see the latest commit was 23 minutes ago. So this is actually a very actively developed project. They're very welcoming of contributions. And if you ping me, my GitHub is there, I will review your pull request for sure. So this is something it can use a lot of improvement. And it's rapidly improving. And so on both sides, the first things, you would have to take ownership and create your own library. And if that appeals to you, there's definitely opportunities there. And if you're really just trying to get your feet wet, it'll be a very valuable thing to do. And it's much lower, I guess, burden on you. All right, well, that's my time. So thank you for coming to my talk. Thank you very much, Paul. We don't have time for questions in this talk, but I'm sure Paul will be available to chat after the talk.