 So, the next speaker will speak about Python data structure implementation about list and dict. It's Flavia. Please. Welcome, Flavia. Can everyone hear me? Cool. Well, hi everyone. Thanks for being here. It's quite impressive actually. So today I'm going to talk about data structures in Python and how C Python implements lists and dictionaries. So before we get started, I'm a software engineer working for Yale. It's a company that connects people with great local businesses. So first, if you take a look at this tiny, bad and actually not real-life example, you can see that whether we want to add an item to a list that has like zero elements, one element, a thousand elements, the time it takes to add a new element is quite constant. Same goes for dictionaries. So we'll focus on how C Python does that. We're going to focus on C Python 3.6. That's the last, like, the last version as of today. Like, lists and dictionaries are quite used. Used a lot, but they seem pretty easy to use, but actually they have, like, lots of hidden and real cool ideas behind them. So try to, like, dig into them today. They're made up of lots of lines of code and lots of comments. Like, if you want to know more about them, at the end of this presentation, you can just, like, go to the Python website and look at the code. It's really well-commented and it's really, like, sometimes very good-written. And so let's get started with lists. A list, basically, is a sequence of values in Python as objects, indexed starting from zero. Sorry. They provide constant type, constant time insertion, random access and linear time dilation. And so how does C Python does that? Internally, they use something called a vector, which is an over-relocated array. So even though your list may have, like, say, five items, it's possible that it has space for, like, say, eight items, actually. And so the invariant that the list that the vector must, like, comply to is that the actual length of the list has to be, like, lower on equal than the actual capacity. So a few ways to create a list, like, I'm sure, like most of you know about all of them. When you create an empty list, its size is zero and the, like, capacity of the vector is also zero. When you create a list, like, say, zero, one, two, three, four, and you know, like, the size when you, like, declare it, the size would be, like, five and the capacity would also be five. So now let's play with some, like, example. So let's say, like, during your free time, you want to help connect people with great local businesses. You want, like, put these businesses into categories. So let's create a new empty list of categories and add a new category to it. So you're using, like, the append method. So what's really happening is that we have an empty list whenever we want to append something, we call a function that's called resize, with, like, the current size of the list, sorry, plus one. After that, we set the last value to be, like, food for this example. So what resize does is it takes as input a new size, checks whether the new size falls between, like, half of the capacity and the capacity itself. So to make sure we, like, comply to the invariant. If that's the case, we just do nothing. We have enough space to add the new item. So we just, like, say, okay, you can go on. But if we don't, like, this condition isn't met, we need to compute a new capacity of the list. And so the pattern that this capacity follows is, like, this one. So new size of red plus the new size plus, like, some rest. And then once we have the new capacity, we call, like, reallocate to, well, reallocate the list with the new capacity. And so this capacity pattern might seem quite weird. So, like, this was this, like, 0, 4, 8, 16, et cetera. But what we can see is that because of the, like, new size of red, the gross rate is about, like, 12.5%, like, when the list goes beyond, like, a certain amount of items. And so this, like, 12.5% is, like, trade-off between space and time. Like, we want to have constant time on, like, insertion and average. But we also want to, like, not waste too much memory. So, like, that's just trade-off that has been made by the Python developers, like, that seems to work. One thing to note about appen and, like, let's say you want to, like, create a list with, like, when you know, like, it's content when you declare it. Its size would be five for this one and its capacity would be five. If instead you decide to, like, create an empty list and call appen, due to the gross pattern, the size would also be five, but the capacity would be eight. So even though the two lists contain the same items, they're not the same and they don't have the same memory footprint. Now, what if you want to remove an item from a list? So, let's say you have, like, this four items in your list and you want to remove them. One of the ways to do it is to call the pop method. You can give it an index, which is going to be, like, the index of the item you want to remove. There are two cases. The first one is, like, you want to remove the last item in the list, and the other one is if you want to remove, like, another, an item that's not the last one. So if you want to remove the last one, it is pretty easy. You just, like, call the resize function that we mentioned before and we're giving it, like, the size minus one. And so, like, it's going to take care of reallocating if you have to, but, sorry. But if you don't want to remove, like, the last index, like, the last item, we'll have to move all the items, like, after the item we want to remove. We'll shift them, like, one slot to the left. What's really happening is that it's, you could, like, use, like, slicing to do it. It's exactly the same thing. And what happens under the hood is that a call to mem move is done, which just copies memory from a chunk of memory from one source to destination and then calls resize with the size minus one. So let's go with an example. So we have this list and we want to remove tacos. Who want to do that anyway? So we have size equals four, capacity equals four. We pop the item at the index one. First, we're going to call mem move. It's going to shift, like, bar and dentist, one to the left. It's going to look like this. But now we have, like, two dentists and the size is still four and the capacity is still four. And what happens next is, like, they're called to resize. So resize is going to take care of removing the last entry. So dentist. Now we don't have dentists anymore. The size is three. Three is between two and four, so the invariant is respected. And so we can just, like, finish, and, like, delete the list as size three and then as room for one more item in the future. A couple of miscellaneous, but list. If you want to use the list as a queue, don't. Like, using append to enqueue and pop zero to dequeue. C-Python has a collection called a deck that's exactly what you want with constant time insertion and deletion. Second is that slicing is really powerful. I won't get into the details, but, like, these are some of the things you can do about with it. It's pretty powerful and almost nobody uses that, so maybe that's kind of useful. Next is, like, reference real scheme. So whenever you create a new list, you need to create a new reference to it. And whenever you delete a list, you need to delete its reference. This is only using Malak and free. Calling Malak and free takes some amounts of time. So to speed this up, C-Python has, like, a list of free references that have been already created. So whenever you delete a list, you don't remove the reference. You keep it. And if you want to create another list, you're going to reuse this reference. So here you can see that B and D have the same reference. That's because D reused B's reference. And this works for up to 80 references. After that, it's just a normal, like, Malak free flow. So that's it for the list. Now let's move on to dictionaries. So dictionaries, I'm meant to store key value pairs. Here are some ways to create new dictionaries. So the use cases of dictionaries, like, they're quite a lot of them. And maybe you wouldn't think about all of them. So the first one is keyword arguments. So each, like, when you pass keyword arguments to a functional method, each key gets written about once and read about once. Next is class methods. They are also stored in a dictionary. Attributes and global variables. Built-ins. When you want to do uniquifications or remove duplicates from a list, for instance, or count things. And any other use. So you want to do whatever you want. You want to write keys to read them. And you can sometimes have deletions, too. So a bit of history about dictionaries. So implementation changed quite a lot over time. So there's shiny new implementation that's inspired from the implementation of PyPy's dictionaries. They are ordered. So you can have, whenever you want to call keys, values, or items, you have them in the order of which you have inserted them. They are also memory efficient. They're tried to use keys when possible. I won't have time to get into much of the details, but there's a pep for it. And so they're introduced like split tables and combine tables. For today, I'm going to focus on combine tables. But they share pretty much a lot of things in common. So what did they give us? Average, constant time, insertion, lookup, and deletion. So everything's super fast. Like, how can you do that using arrays? And so we have keys. Additionally, we have keys that are Python objects. Arrays have indices that are integers. So we need to find a way to transform objects to integers. And that's where hashing comes in handy. So basically, hash function is a function that's used to map data from arbitrary size to data from almost fixed size. So in Python, the data from arbitrary size is an object. And data from fixed size is like a 32 of 64-bit integer. Depends on the architecture. We'll suppose today it's going to be 64 integers. So we could talk about hash for hours. So just like every almost all non-mutable objects in Python can be hashed. There's a hash built-in that gives you the hash of an object. And basically, it's like an integer. And if you want to hash 42, this gives you an integer that I represented here as a bit string. So it's just like 64 zeros or ones. Same like with 1.61, this gives you this number. And if you want to hash, never give any give. This gives you this other number. You don't have to know how hash functions work deeply to just use them. One thing to note about hash functions that the input space is of arbitrary size and the output space is fixed. So there might be collisions. This one is like minus 1 and minus 2 have the same hash in Python, in C-Python. And so having the same hash doesn't mean you have to be the same value. And the thing to note is that similar values often have dissimilar hashes. So if you want to speak English as a hello, the hash is going to be like this number. But if you now want to speak German as a hello, the hash is completely different even though they're pretty similar objects. And one thing you must have is that hashing the same value again and again should give you the same hash, like hash functions that have to be deterministic. So if you want to hash hello once, twice, a thousand times, it should give you always the same number. So now we have a way to transform dictionary keys to hashes and then to indices of arrays. But can we actually represent dictionaries as arrays? The answer turns out to be yes. And actually, in C-Python 3.6, using combine tables, dictionaries is actually two arrays, one array of indices and one array of entries. So let's get back to another toy example. Let's say you want to have people connect people to great local businesses. And you want to, like, you have categories. And you want to know how many of each business you have in each category. So you create, like, an empty category dictionary. What happens, really, is that C-Python creates, like, these two new tables. The first one is the table of entities, and the second one is the table of entries. Each element in the entries table is a structure containing the hash of the object. It contains the key and its value. And in the indices table, the one on the left, is just contains an integer, like that maps to an entry in the entries table. So you can see on the left, I've represented indices in base 10 and base 2. It's going to help us in the following slides. And so when you create an empty dictionary, the indices table gets created as an initial size of 8. So even though your dictionary is empty, it takes some space. And so now we have our key. We want to have its index in the indices array. The indices array is of size 8 at first. So we're going to have the index in the indices table of, like, the key X. We hash this key. So this is a 64-bit number. And we take the last three bits of it. Y3 is because we have eight entries. And to index eight entries, we can use three bits for, like, counting from 0 to 7. And so let's, like, add an entry to our dictionary. So we want to have food. This gives us this number. We take the last three bits, which turn out to be 0, 0, 0. What happens is that we add a new entry in the entry table. So we put the hash, we put the key, we put the value. And in the indices table, we go to the index 0, 0, 0, and say, OK, so this maps to the entry index 0. Now let's move on and add tackles and bars. Let's say, like, the hash of tackles, like the last three bits turn out to be 0, 0, 0, 1, sorry. And for bar, it's going to be 1, 0, 1. So let's have tackles. We add a new entry. We go into the indices table, 0, 0, 1. This maps to the entry number 1, index 1. Now we add the third entry. For bar, the hash is 1, 0, 1. So we go into the indices table. 1, 0, 1 is the fifth. It's 5. So we go into the indices table and say, OK, so 1, 0, 1 maps to entry index 2, which is bar. So now let's say we want to have another category. And let's add dentists. And its hash turns out to be 0, 0, 1. What happens is that we go into the indices table, 0, 0, 1. We want to add dentists, but there's already something. There's already tackles, and we want to keep tackles. So we need to find a way to solve this problem. So what's happening, actually, is that hash collided. And so there are several methods to resolve these issues. CPython uses open addressing. And so how open addressing works is basically you have your index or your hash, and you compute the new one following a certain rule. And so the rule used in CPython is 5, 10, index plus 1 modulo the size of the actual dictionary, which is 8 first. So what's cool about this rule is that it traverses each integer in the range of 0 to size minus 1. It doesn't have to be like 0, 1, 2, 3. It can be like 4, 2, 0. As long as we see all the integers between 0 and size minus 1, we're fine. And actually, the implementation is a bit more sophisticated than just this. But for the sake of simplicity, we can use this one. This works really good for now. So let's go back to our example. Inserting that we have our hash colliding at 0, 0, 1. So in base 2, and base 10, this gives us 1. 5, 1 plus 1 modulo 8 is 6, which in base 2 gives 1, 1, 0. So now we go into an indices table. 1, 1, 0 is free. So we can add our new entry. So everyone's happy. We have both like tacos and dentists. So we can just keep going. So now what if we want to look up in the dictionary? So how many food items do I have? We look at the hash of food, which turns out to be 0, 0, 0. We go into the indices table. So 0, 0, 0 gives us entry index 0. So we go to the entries table, look at the index 0, and oh, we have food. So 4,000, so that's it. Now what if we want to know more about dentists? Like we take the hash of dentists, which is 0, 0, 1. We go into our indices table, 0, 0, 1. This maps to entry index 1. We go to entry index 1 and oh, tacos isn't dentists. So that's a problem. And so what it means is that the hash and the key don't match. So maybe that we just jumped it before. So hash might have collided. And so we might have ended up in another slot. So let's just follow the rule and 5,000 in the x plus 1 extra gives us 1, 1, 0, which maps to entry number 3. And so that's dentists. So we're happy we have our dentists. So now what if we want to know more about music? We don't have music in our dictionary. So how does Cipainton actually do that? We take a look at the hash of music, which turns out to be 0, 0, 0. We go into our table, 0, 0, 0 gives us food. Food isn't music. So well, let's just jump. Next one is 1, 1, tacos is still isn't music. So let's keep jumping. Maybe we jump twice or maybe more. The next one is 1, 1, 0. It gives us dentists, still not music. Let's keep jumping. The next one is 1, 1, 0. And well, there's no entry map related to 1, 1, 0. So that means that we've never seen music. So we can write an exception, like a key error exception. And there's no music in our dictionary. Next, what if we want to delete an item from a dictionary? Let's say we want to remove tacos. Again, who would do that? So hash is 0, 0, 0. Can we just say, OK, 0, 0, hash is 0, 0, 1, sorry. Let's just remove the entry index and remove the content in the entries table. And that's it. Actually, we can do this, because otherwise we couldn't access dentists anymore. Because if we remember, when we inserted dentists, we jumped from tacos. So if there's no tacos anymore, we can't know where dentists is located. So to solve this, what we do is instead of removing the entry index, we just keep the entry index, remove the content in the entry, and just leave a dummy key. The dummy key is just here to say, OK, well, there used to be something here. It's not there anymore. So if you stumble upon me, just keep jumping. And so now we can access dentists anymore. We still can access dentists. A few cavities about the current state. A slot is fun, but it's rarely enough. And if you want to add entries to the dictionary, your tables will be getting fuller and fuller. And so the lookups are going to be slower. And dummy keys make this even slower, because they take an entry for basically nothing, because we deleted the item. So what we want to do is sometimes resize the dictionary. What we need to have is at least one slot, one empty slot, because otherwise we could be jumping over and over and infinitely. CPyton goes even further and defines an usable fraction of the dictionary, which is 2 third of the size. So initially, with a size of 8, we can have up to five entries in our dictionary. So if we take a look at the current state, we have, well, the length is 3, because we have food bar and dentists. The size is 8, because we can have up to 8 indices. And the usable fraction is 1, because we have four items in the entries. Like dummy keys take some space. So let's keep on moving and just add a new category and delete one category. This gives us this space, this state. The length is 3, the size is 8, and the usable fraction is 0. So what if now we want to add another category? We're going to have to, like, we don't have space anymore. So we have to resize the dictionary. How it's done is that we have another resize function like specific to dictionaries that take as input a minimum size. The minimum size is computed as twice the length of the current list, length of the dictionary, plus half the current size. And so we have this minimum size that we want our dictionary to have. We compute the actual new sizes, which is going to be the next power of 2 of this size. Why the next power of 2? It's because we want to be able to truncate the hashes. So take, like, the last 3 bits, the last 4 bits, the last 8 bits, which don't have to be, like, powers of 2. We create a new empty dictionary with the new size. And for each of the entries that we have in our dictionary, we just insert them in the new dictionary. And then we remove the old ones. So with our example of length 3, size 8, this gives us min size equals to 10. So the new size is going to be 16. So instead of having, like, those last 3 bits, like, take the last 3 bits into account, we're going to take the last 4 bits into account. And so now we have a larger array, so we can fit more items in. We also have more free slots. So lookups and insertions are going to be faster, because, like, we won't have to jump, like, we're less likely to have collisions. And we don't have dummy keys anymore when we rebuild the dictionary, because if you remember, dummy keys are inserted only when we delete items from a dictionary, but, like, now we just have a brand new dictionary, so there's not dummy keys that take, like, space and time for nothing. On the other hand, well, dictionaries take more space. CPython just trades space for speed, so, well, let's just trade off again. Couple of me silly nails about dictionaries. They have the same reference we use scheme as lists, so we can save up to 80 references. And if you want to know more about split tables, I said that combined tables were made of two arrays, two arrays, like, one of indices and one of entries. In split tables, we have three arrays, one of indices, one of entries, and one of values. And all the dictionaries that share the same keys have the same, like, indices and entries table, and they have their own values table, so that, like, all the keys will be in the entries table, and all the values are going to be in the values table, and so they all share, like, the entries table. If you want to know more about it, like, the PEP412 is, like, really well written and explains everything we need to know. And, well, that's it. You can find the slides on the first link, and here are all the references that I used for to make this presentation. I really encourage you to, like, take a look at those C-Python 3.6 code, like, even though it's C, it can be, might seem hard to understand at first glance, but really, like, it's really well documented and, like, really easy to read, so just give it a look. Thanks for listening, and I also want to thank all the FOSDM staff for organizing everything, so...