 So, uh, welcome to another talk. Uh, Sophia Davis is going to be talking about picking into Python's C API. So we're all here because we like Python the programming language. Today, I'm going to talk a little about Python the C program underlying that programming language by walking through how I learned the basics of making the C, a C library callable from Python code and vice versa. So here's a screenshot of the first time I sign-faulted the Python repel. I'm Sophia. I'm an American software developer and I'm right now in Amsterdam. In the summer of 2014, I attended the recurse center, which is kind of like a writer's workshop for programmers in New York City. And this talk comes out of a very down the rabbit hole kind of project that I worked on while I was there. Um, code and soon a link to the slides will be on my GitHub. So, uh, let's get started. This is the story of how I shaved a yak. Because probably if you find yourself breaking out the Python C API docs, you started out with a different problem. Um, one that you thought you could solve using tools provided by an existing C code base. Um, so for me, this code base was a hash table implementation. So this is probably review for most of you, but just in brief, a hash table is a very powerful data structure for storing key value pairs. Um, for associating keys with values such that every key maps to one value. Where Python people, we tend to call them dictionaries. Um, they're so powerful because they're very efficient. No matter how many key value pairs you have in a hash table, the average time complexity of adding a key value pair, looking up the value associated with the key, or removing a key value pair is constant. Uh, so they go one. How does it achieve this amazing performance? Well, under the hood, a hash table is just an array. And we're going to call each index in this array a bucket. Every key value pair gets put into one of these buckets. And how do we know which pair goes in which bucket? Well, that's where the hash of hash table comes in. A hash function is just a mapping of any arbitrary input to a fixed set of values, like the set of all positive integers. When we want to put a key and a value in our hash table, we pass the key through a hash function to convert it to an integer. And then we use that number modulo the size of the array to determine which bucket the key value pair should go in. Um, work similarly for lookup and remove. We calculate the hash of the key, go to the bucket associated with that hash, and either lookup the value or remove the key value pair. Um, so here's a picture thanks Wikipedia of a phone book stored as a hash table. Um, so we calculate the hash value of each person's name and use that number to determine the bucket in the array to put the phone number entry. Um, but what happens if the hash values of two keys result in them being put in the same bucket? Well, that's called the collision. And there are a couple ways of dealing with it, but one way is just to store a linked list at every bucket in the array. Um, so every item that gets assigned to that bucket just gets tacked on to the linked list. Um, again, at the Wikipedia example here, we're using a hash function that results in, uh, John Smith and Sandra D. Um, both being assigned to the same index, 152. So we've just started a list containing both entries. Um, but if lots of items end up in the same buckets, then our hash table starts to look like just a bunch of linked lists. And the performance of linked lists is not nearly as good as those of a hash table, especially when looking up or removing items. A lookup or removal on a linked list, um, in the average case involves traversing the list, which is a big O of N operation. And as we add more and more items to our hash table, it's inevitable that more and more entries will end up in the same bins, the same buckets. Consider a hash table with an underlying array of length one. No matter what hash function you use, all items are going to be stored in that one and only bucket, and that's rapidly going to be a long-linked list. So in order to keep our average performance constant, um, we'll occasionally increase the size of the underlying array and redistribute our keys. Uh, then provided that we're using a decent hash function, the number of collisions should decrease because we're spreading out the same number of keys as before among more buckets. And how do we know when to resize? Well, if we keep track of the number of items currently in the hash table compared to the length of the underlying array, then we should resize when the proportion of items to length reaches a certain threshold. We'll call this the maximum load proportion. Um, so we've talked about three variable properties of hash tables. We've got the size of the underlying array, the hash function, and the maximum load proportion. And all three can affect the performance of your hash table. Um, for example, initial size helps determine how often you might need to resize, and that's going to be a costly operation. Uh, the hash function impacts how many collisions you might have, and more complicated hash functions will take longer to evaluate. Um, the maximum load proportion will play a role in how long those linked lists might get before you resize, et cetera. So in order to explore how these affect performance, I wrote my own hash table implementation. And it enabled the user to choose the maximum load proportion and the initial size of the underlying array. Um, my library provided functions to initialize a table with the given properties, to add, look up, and remove key value pairs, um, of integer float and string type, and finally to free the memory malloc to store the data structure, so the array, the linked lists, all your strings, data, whatever. Um, and I also wanted to explore how different hash functions would affect performance. So this is the signature of the add function in my C implementation. It accepts a hash argument, uh, because my idea was that the user should do their own hashing of the keys, and then they'd just pass that hash value in when adding, looking up, or removing an entry. And my library would find the appropriate bucket for that key value pair based on the past in hash. If the user chose not to pass in a hash with their key, then I, my library just used this hand rolled hand function, hash function, excuse me. So, um, if it's an integer, use the integer. If it's a float, round it down and use that integer. If it's a string, just use the length of the string. Um, that was inspired by the hash function that no joke, an early version of PHP used to store function names in the symbol table. Good times. Um, so it's, it's basically an awful hash function. And, uh, next I set off to do some hardcore bit shifting and string manipulation in C to experiment with writing my own hash functions. Just joking. If I were going to experiment, I'd rather do it in Python. Wouldn't it be nice if I could write cool hash functions in Python and then just call them from my C hash table code? After all, Python is so nice and easy to write, and I'm a lot faster writing Python than I am at writing C. But, you know, under the hood, Python is actually just a really big complicated C program that processes those strings of whitespace sensitive code that we write. And, thankfully, there's a well-documented API for bridging the gap between Python the programming language and Python the C program. Um, it's as easy to use this API as including a simple line in your C file. And then the magic begins. So, um, my goal is to call a hash function that's written in pure Python from inside my C hash table library. Um, little disclaimer, the C API did change substantially-ish between Python 2 and Python 3, and all the code in my talk is Python 2. Um, so, I started by wrapping everything I needed to use a hash table inside my hash table library inside of a struct. So, I've got a pointer to the actual data structure and some of those other properties associated with hash tables like the current load, the initial size. Um, I also have this py object pointer to hashfunk. Um, so, at this point, I had the data that I wanted in my Python type, but I needed to implement that API telling Python how to manage objects of this type. Um, it starts with this py object head thing, which is a macro imported with the Python header. Um, it expands to the bare minimum that you need to create a Python object, which is a reference count. Um, I chose to ignore that at the time. And a pointer to, um, this py type object, which is just a struct of function pointers defining how Python should manage objects of the hash table type. So, that's things like the class name, um, how to print and make a string representation of your object, um, how to initialize, delete and free the memory allocated to hold objects. I'll come back to these a little later. And there are a lot more that I left out. Um, so, at this point, I had my basic type defined, but I needed some way to use this type from within Python code. Um, so I created a module to contain the hash table type. Um, in order to initialize a module, you need to write a pymod init function that has the name init, your module name, so mine's init hash table. Um, when a Python program imports a module for the first time, this is the function that's run. Again, I've left some stuff out, but of note are, um, this line, which initializes the type, and it fills in more of that, uh, py type object, which, I think some compiler specific functions. Uh, here we initialize the module. And this line adds my new type to the module dictionary so we can actually instantiate new objects via the class name. Um, packaging, there are a couple different ways to package Python modules, but one simple way is just to write a setup.py file, um, telling Python the name of your module and what C files your module needs. Uh, so this is the entire contents of my setup.py file. When you run, uh, Python setup.py build, um, it creates a build sub directory and puts, uh, a, um, compile file containing your extension that can be dynamically loaded into a Python program. So on Unix, this is a shared object file. I work on a Mac, so my module was named hashtable.so. On Windows, this would be a DLL with a .pyd extension. So if you start up a Python interpreter or run a Python, uh, program in the same directory as that .so file, then you can type import hash table and do hash table stuff from Python. So that was cool, except my program kept segfaulting. And I was forced to look at a section of the API docs that I had kind of been ignoring, which was the section on reference counting. Um, one reason why Python is so nice is that it it's a pretty high level language, and it handles a lot of things for the programmer, for example, memory management. Um, when you use data in a Python program, Python takes care of dealing with the OS to ensure that that data is stored in memory. However, if Python only added to your program's memory, eventually the program would run out of memory, so it needs to know when it can remove data once that data isn't being used anymore. Python the C program uses a method called reference counting to know when it can safely free objects. So that means it keeps track of the number of other things referring to a given object, and when that reference count drops to zero, Python cleans up the unneeded object by calling the deallocation function that's defined for its type. So here are two tools that can help us understand reference counts a little bit. From the sys module, we have the get ref count function, and from the GC garbage collection module, there's get refers, which returns a list of all things that currently own references to an object. So here I've written a function, show ref counts, and all it does is find the objects that own references to the argument named an object. It prints out how many there are, and optionally it can call itself a lot of times, and it can actually print out extra details about exactly which objects own references to an object. So let's look at how this works. So in a Python shell, we're gonna start by importing the module, the tools that we need, which are the sys and GC modules, and also that function I just showed you, which I saved in a file called ref counts. So we'll start by instantiating a new object, and we'll see what the reference count is on that object. Two, all right, and if we assign another variable name to that object, then we'll look at the reference count again, and it's three, cool. So what exactly is referring to this object? We'll use that get refers function. So we've got this dict here with the two variable names that we just made, and they're both referring to object, object at memory location thing. So what exactly is that? Well, it's the local namespace. So there it is again, cool. So what happens to the reference count on an object if we pass it as an argument to a function? Well, we'll use that function that I just showed you. So first we'll call it, we're gonna pass in our object, and we'll have it not, we'll just call it once. And we will show the details about the refers this time. So we've still got that local namespace, but there's something new, this frame object in there that now also owns a reference to our object. And if we call it again, this time we'll have it call itself recursively a bunch of times, and we'll turn off the overwhelming debug output. So we see that each time we call the function, the reference count increases by one, and if we were to look at the details of the refers, we'd see another frame object being added to the refers with each call. Cool. So if you're going to write a C extension and work with Python objects, then first you need to signal to Python when your program starts working with a certain object. By triggering Python to increase the reference count on that object by one, thereby keeping it in memory while you represented by that one need it. Otherwise, if the reference count drops to zero, then Python will free the object. And when your program tries to access that object, which is now in a piece of memory that has been released back to the OS, I guess, your program will crash, hopefully, or just like weird shit will happen. You also need to take care of telling Python when you're done working with a certain object, by decrementing its reference count by one. If you don't do your part in decrementing that reference count, its reference count can never decrease to zero, and it will never be cleaned from memory, so that's a memory leak. The Python API provides two macros to communicate when you're starting to work with an object and when you're finished working with an object. Calling PyAncref on an object increases the reference count by one, and calling PyDecref decreases a reference count by one. Also, if the reference count has reached zero, it triggers a call to the deallocation function for the type. So what happens when you forget to PyAncref an object that you need to work on? Remember that PyType object structure is a function pointer that defined the Python API for my type? This is what I had defined as the deallocation function for Python to call when the reference count of a hash table object reaches zero. It does two important things. The first one is some printf debugging, so we can see when it's being called, and it also calls the free function that I defined for my object via the PyType object struct. And so that free function also has some nice printfs, and then calls the free table function provided by my initial C program to free the memory mallet for that hash table. Initially, the set method on hash table objects returned the hash table object itself. See here, that turns self. Now, there are a lot of rules and exceptions about in which situations the caller versus the callee is responsible for PyAncrefing arguments, and I barely scratched the surface. However, I think I caused a problem here, because if a C function returns a reference to an object, like self, then that reference must be owned by the function, i.e., the object must have been PyAncrefed inside the function. But I had left that out. So let's see how this affects my program. First, we'll use that setup.py script to build our module. Okay, compiler output, great. I've got another window open here to the build sub directory, so there's our .so file. And if we start up the Python repel, then we can import hash table in my module, and we can instantiate a new hash table object, and we'll start setting some values. So not very creative, but rounded down. So remember that the set method right now returns the hash table object itself. So this is the string representation, the wrapper whatever being returned to the Python repel. Each bracket square represents one of our buckets, and each star represents an item in the linked list at that bucket. So let's set some more values, two maps, four. All right, so, uh-oh, those are the printfs that I put inside my cleanup functions. We didn't tell Python to increase the reference count on that object, on that hash table object, but all the other referrers must have released their references to that object, and its reference count has dropped to zero, and the cleanup functions have been called. So if we try to do anything else with our hash table object, like set another key value pair, then Python's like false. So, um, let's add that pie ink graph back in there, and look at the demo again. Um, I just ran the build step, and so we'll import the new version of my module, and instantiate a new object, and start setting some values. So we'll start with pie again, and there, a star representing pie, great. Two stars, looking good. No seg faults so far. Um, we'll keep going with this squares thing. It resized, that's great. We've still got three items, four items. Um, we'll try another, a string, uh, because strings are evil, um, and it resized again, so looking good, no seg faults. I think that solved that problem, at least. But, um, the other type of mistake you can make is forgetting to call pie dec graph when you're done with an object, and I also ran into that issue. So, uh, remember that my Python hash table type struct here, it contains a pointer to another Python object, namely the hash function, um, used to hash keys. So, here's a snippet from the initialization function for, of, of hash tables. Um, we do some other stuff, but I set the object's hash funk attribute, either to, um, the hash function passed in by the user when the object was initialized, or to the Python's built-in hash function, just as a default. Um, I also increased the reference count on this hash function object, because I need to tell Python, like, hey, I'm going to be working with this function object for a while, please don't clean it up. So, we say that each hash table object owns a reference to the hash function object. Um, conversely, in the deallocation function for my hash table type, I tell Python to dec, decrement the reference count on that hash function object. Um, so here's a simple demo. We're going to look at the reference count on Python's built-in hash function. Um, so I have this function here, do hash table stuff, and all it does is initialize a hash table using, uh, the built-in hash function as the hash function. So, our hash table object will own a reference to the built-in hash function object, hash. Um, it prints out the reference count on that function object, and then this program will just call do hash table stuff a couple of times. Uh, so let's look at how that reference count on the built-in hash function changes. Um, I've run the build step, and we're going to run the program. So, initially, we start out with just three reference count, uh, references to the built-in hash function object. Each time we enter do hash table stuff, and instantiate a new hash table that owns a reference to the built-in hash function, the reference count on that function object increases by one, uh, to four. Um, each time do hash table stuff completes, uh, the hash table that was initialized inside of it goes out of scope, so the reference count on the hash table drops to zero, which triggers the de-allocation function for hash tables, um, this function, which triggers a pie-deck ref on the built-in hash function object. So, uh, after calling do hash table stuff a couple of times, we still just have a reference count of three on that built-in hash function object. But, let's just say that we had forgotten to decrease that reference count, and we'll run that demo again. I just ran the build step. Um, so, this is without that pie-deck ref. Um, initially the reference count is still three up there, and each time we enter do hash table stuff, we instantiate a new hash table, which owns a reference to the built-in hash function object. The reference count on that function object increases by four, by one, to four, to five, to six, because each time do hash table stuff completes, the hash, its hash table goes out of scope, reference count on hash table drops to zero, and it's the allocation functions it's called, but nowhere did we release that reference that we own to the built-in hash function. So, after calling do hash table stuff a couple of times, um, the reference count on the built-in hash function object has increased from three to six, even though the object that, the objects that owned those last three references have themselves been freed. Um, so this is a memory leak. Uh, those three extra references were owned by objects that Python has now cleaned up. They no longer exist. We've lost our opportunity to signals to Python that those references aren't needed anymore. The reference count can never drop to zero, so Python will never remove that function object from memory. Now, we're talking about the built-in hash function here, so it's not like we really want it removed from memory, but imagine a more memory intensive object and a longer running program that created tons of these objects that could never be cleaned up. Eventually, this type of area will cause a problem. Um, so I went back and added that PyDec ref line and I rebuilt my program, and after all that, I finally had a module that worked like well enough. So, I wrote my very own Python hash function. Um, if the item to hash is an integer or a float, then we'll do this one thing I found on Stack Overflow. And, otherwise, we're clearly hashing a string, so I did this other thing I found on Stack Overflow. And, um, so I've also included a print statement so we can see when this function is called, it's prefixed by the word Python, because this is Python code. And I also went back and added more print statements to my C wrapper module and the original C library. And those are also prefixed by their origin. Um, so now let's look at how the program works. Um, so I've started an IPython repl here with that awesome hash function loaded. And, uh, we can import the module. And, um, so we can see some print statements that are, that I put inside that init hash table, uh, function. Um, so I can instantiate a new hash table object. And, um, as a hash function, we're going to use that awesome Python hash function that I just wrote. Um, so, via the C, the wrapper module, we see that it's the, the underlying C library that's actually doing the heavy hitting of like, malaking space for my data structure and getting everything set up. Um, so we can look at that new object. It's empty. Great. So we'll set some values. Um, I was on a, on a roll with squares here. So we see the, the C module, um, the set function there is, is executing the Python code that I wrote in that rep, in the repel. Uh, it's calling the Python. That's so cool. And then it's also further, you know, dealing with the underlying C library, which actually does the hard work of like adding that item into the right linked list and things. Um, so we'll set some more values. Cool. Still works. Uh, strings always break everything. So, hey, look, so the underlying C library knew that it needed to resize. So that was, that was done. And it's bigger. Great. Also, bonus, it works as a hash table. We can look up values. Um, we can remove the value, the key value pair associated with pi. Um, and now if we try to look up the value again, then, uh, the library, the underlying library looks for the key value pair. Can't find it. And the, the module takes care of returning none to the repel. Um, when I finally quit the repel, uh, then again we see the, the, the deallocation function is called from the C module, but it's the underlying library that's actually doing the heavy hitting of like walking through all my data structures and freeing stuff. Um, so I've got the Python repel, uh, calling the functions from that I wrote in my C module and that C code executing the hash function that I wrote in Python in the repel and it's just all working together and I thought it was pretty cool. Um, but, uh, if there are any questions. Are there any questions? Well, first, thanks for sharing. It looks great. Could you clarify, um, is this a typo? I don't think so. At some point you had some PyDecref, uh, macros and in other code I think I saw py underscore x decref. Is that a typo or are those different types of references? They're, they're two different types. The PyX the x decref one, um, don't throw an error if the pointer is none. Whereas the other one, if, if you can't, you can't de, yeah, if you actually have a pointer there, then it won't, it won't work. Will it be hard to make, uh, syntax like in, uh, native, uh, Python dictionary, you can initiate values with brackets and so on? Um, I'm not, I'm not sure, but I know that like all the double under methods are like related to C functions, so probably that would work. You just add it to that py typed object struct function of function pointers, but I'm not sure. Anybody else? Thanks for the talk. Well, very well structured. I enjoyed that. The answer to this question is probably no, but did you get a chance to experiment with any other ways to do this? Let me go back to my yak slide. No, I just shaved it. I thought, I thought, I thought no, because I've been exploring this. Oh, really? And there's like five or six different ways to do it. Okay. And this is one of the ways that's why I was interested in the talk, but this is a very good example. So I do, I do actually, I do like this as well. So maybe you have time for one more. Okay, nobody's home. Thank you very much again.