 Hi there, yes, my name is Konstantin, I'd like to start with just a question, please raise your hand if you heard about 16 data structures before this conference, okay, great. Then I think how much, someone of you, do you like to like play with data, play with data structures, maybe it's interesting of you, yes, that's fine. So for me, 16 data structures and Python in one sentence were like ridiculous thing just a few years ago, but thanks to CIFFI and Python advantages and PyBind 11 and C++ actually, we have this talk today. But let me start with a quick introduction and telling you who I am and how I end up experimenting with these data structures. I work for Curator Labs, it's a networking company famous for being able to mitigate denial of service attacks, distributed denial of service attacks, never showing the captures. To do that, we have like two layers of things that going on. One is online streaming algorithms filtering and the second one where I do my job is like processing data afterwards, experimenting, finding some patterns and so on to improve our filters. And I found that 16 data structures may be useful here. So what they are. The most famous 16 data structure is bit vector. It's just like usual vector but with Boolean values. And in most programming languages, one single Boolean value will have at least one byte in memory, often even more. But actually in this one byte, we can have eight different values for the different Boolean values. Let's look how it fits the Python. Here we have like 120 bytes per just seven array, an array of seven elements, zeroes and ones. Actually, it's not so bad because there is an overhead for least, we can extract this size of empty list and we'll still have 56 elements here and that's reasonable because it's one element having eight bytes and this is just a byte word in 64 bit operating system. But actually, we can force Python to have one byte per element here just using byte array or something like that, like array type B. But there is not, it's not good enough, looks like, right? So we can have bit operations and like shifts and bits and so on to set up all our values into pack them into just one byte. And we have similar process to unpack it. It's okay. But it kind of looks like it would be faster in C or maybe in assembly because if you have not so small array of data, it would be like, well, painful to address like gigabytes of data in this manner. And still, can we have even less space for this data? Generally speaking, theoretically, if we don't know distribution and we have to assume it's random bits or just uniformly distributed, then no. But let's consider like this example we have the previous pattern and repeating one megatimes. Actually it looks like we can simply compress it with that leap or something. But to access the last element of the resulting array compressed, we have to uncompress everything. And depending on what we use to compress it, we will have to actually uncompress everything and maybe store it in memory that's not helping us. But there is need some, but there are 16 data structures and they can do that. They can give us access to some elements or all of them and or some operations on this array without decompressing it. So they are quite useful. So where the 16 data structures may be used. Well, trees, basically that's where I use them. Like you're building the tree and I will show you one where you have lots of elements, but not so much leaves, for example, or not so much non-leaves, I'm sorry. These types of trees could be, for example, in feature, they could be in feature learning, in machine learning, it could be a compact feature representation. These trees could be used in pattern discovery. There are lots of applications here. One of them is, for example, DNA analysis and indexing in databases and so on. So I took SDSL library by Simon Gogh and other contributors. It's a highly compiled, time-parameterized C++ library. So we can't simply use it in Python with CFFI and even with PyBind 11 you will have to spend lots of time to import it. But it highlights lots of publications and it has lots of things we can use. And still it's in GPLv3. So what I did is a subset of this library ported to Python. Of course it's interactive and you can have doc strings and completion in iPython and so on. And I released Gil where it's possible. Have any one of those of you faced thought or effect of generating C or C++ or maybe awesome code with Python? Yeah, quite a few. But what happens here is some sort of the reverse. We are using C++ and the process I'm calling PyStation8 to manually instantiate lots of templates and export them to Python. So basically we use Python to generate, we're using C++ templates to generate Python module. So let's start exploring the library. The first and biggest probably and the most interesting for beginners part is integer vectors. They are usual integer vectors. Maybe not all the API are copied from Python lists but they are more or less like would you expect. But they have bit compress function that will compress it to, in this example, to three bytes per element. So this not so large array will have very small footprint in memory. Well, there are two types of these arrays in library. One type and one class is dynamic class where you can use bit compress and others quite a few of them don't. You can use the last ones if you know beforehand how much bits you need per element. And they have like one byte less memory overhead. So here is example of using that. You may know that you can use just one array, simple array to represent a tree. Index here of array is a kind of ID for the node. And value at that index is a parent ID for the node. So if you have not so much leaves but lots of near root nodes, then you will not ever have large values here. And this means that you can have very compact representation for this tree. Okay. But what if we would like to get rid of this property of mutable and maybe some kind of find a benefit of it? Let's take an example of 10 mega values and array with repeating pattern. So it's 80 megabytes in Python and those few bytes of our head of least are just invisible here. The same size will be in our library, in my library without bit compression but of course with bit compression we can have a little less, like 25 megabytes. But we can go further and we can apply some delimiting codes on deltas and we have array that is much, much smaller here. And we can still access every element. And you can construct it faster if you use integer vector from PySDSL. Well, I tend to collect all the classes in arrays and dictionaries. So you may notice integer vectors were in the dictionary and these ones are too. So you can kind of experiment with them. You can just take all the classes from some category and give it some data, measure how fast it would take, measure how long it would take operations and so on. So integer vectors compressed with variable codes is another option. You can have, like, let's take sparse array. So it's almost all elements are zero except one. But this one is large. It's large, it's so much, it's have so much of a large value that it can't be bit compressed because it still will require 64 bits. Here we can create a so-called VLC array that will have just one and a half megabytes and you still can have access and so on. Well, and again, I have these classes in a category and there are, you can do some kind of benchmark or whatever, just be, I would like to stress that different kinds of classes will have different kinds of data, will have beneficial implementation with different kinds of arrays. So just test it, look what's best for you. And there are different kinds. I can't cover all of them, but you can find them in such arrays and you can find doc strings for them and experiments for experiment and all kinds of integer compressed arrays are here for your pleasure. Bitvectors is the next big part. It's really big part. Of course, bitvector with one is just usual in vector and of course there are ways to compress it, not like integer vectors usually will do because we know it's just bits. And there are not one way, but there is not one way, but there are lots of them. And of course, the grouped, you can just again test it against your data and so on. And there are quite a few of them, variants here to play with. And again, just test it on your data, see what's best for you. And the next part of this library is just part of bitvector's part actually because this is support operations on these bitvectors. There are two very important bitvectors in data structures operations. They call bit rank and bit select. Bit rank is like generalization of pop count operation. Who knows what pop count is? Yeah, it's a way to count all the set bits in the array or in the byte, but rank is generalization of it. It allows you to account how much bits are set in part of the array and it allows you to not just count once, but zeros, one zeros and different things. And select is like a reverse operation. It shows you at which point rank changes. So if you know the rank, you would like to find at which bit it was set. You can find it with a select. Those operations are not free, but the cost in mutable bitvectors is already paid both in terms of memory and time for building the array. But for bitvector, you have to construct it manually. Well, here we have a few options. Well, there are two types of classes. Here one is, and they will work better on different kinds of data. And when you construct the object, you can just call it to see the result. So again, you can do different benchmarks looking for better ways to feature data. But rank and select can only work in mutable vectors with patterns of length one, it means with ones on the zeros. But with mutable bitvector, we can have patterns of length two. It's one zero or one one, the total four of them. And of course, you can have different kinds of operations built on top of that. OK. Another option for this is to create bitvector interleaved classes. This allows you to interleave information about select and rank data inside data between data parts. In this case, it would be cheaper to create such an array than a usual array and two supports. And but still you may want to try to see what's best for you and for your data. OK. And the next part is the wavelet trees. It's probably the most large part in the library for now. But it's going to change in the future. So wavelet trees is a tree that have bitvectors on the nodes and it breaks this original sequence into these subparts. But actually, you don't have to know it to use the library. And let me show some examples. Here we can open a file. It's a readme file from the wrapper. And we can just extract first line. Well, more or less the same way you will do with the array probably. All possible Python features are available, but so. But then you can do something that you can do in Python faster. You can, for example, count the number of lines. It means find all the elements. In Python, you usually will have to iterate over, do some map probably, or some filtering, and some, and so on. And here you can have something that will work probably three or four times, no, 1,000 times faster. One more example is let's try to find at which line first equal sign appears. So first, we'll use select operation to find the symbol, and then we have to find how much n symbols were before it. Well, of course, rank select operations are a little bit, have a little bit different meaning here, but it should be obvious if it's not. Just look at doc strings. They are here for your help. As I said, there are quite a few trees available. There are two types of them. One's are for bytes, and the other types are for integers. Probably it would be better to split them into arrays. Maybe it's going to happen. But let me go to the last part. It's compressed suffix arrays. Compressed suffix arrays are sorted arrays that contain all the suffixes of your data. But again, actually, you don't have to know what they are to use them. They provide three operations much more, but you can use, for example, those three. One of them is extract, so you can have a sequence. It shouldn't have contain zeros, by the way, when you construct it. And you can extract original sequence out of that, out of the suffix tree. Then you can count elements. You can count one element, or you can count patterns. It's no big difference. And you can find all the occurrences for some pattern, and all that happens really fast. There are not so much suffix arrays for now, but it's going to change, I think, because suffix arrays that are based on wavelet trees actually could have different wavelet trees as a base. And there are other parameters that could be exported to the Python in the future. So about the future, well, what I would like to see in the future from this library is more dynamic compilation, I think. For now, when I work with the library, I have to add all the parameters somehow manually, and maybe use Python C++ features to multiply them with some different things. But still, it's kind of limiting and time consuming operation in terms of compilation. But if you had an option to just write every small snippet of C++ code, like run it through some magic and got new module for Python that have this class with all the features exported to Python, that would be great. Well, there are lots of things that could be improved, and maybe it will be improved. But what will happen depends. And depends on you. On your activity, I just posted this library in the GitHub. And it was just a few days ago, while it has become public. I'll show the link one more time. For now, I'm the only one person in this report. So it would be great help for you if you join me there. In any way, any kind of feature requests, let's say, would be prioritized if you are contributing something. Here's the link. Thank you. Now I think it's time for the questions. Thank you, Konstantin. So any questions? So don't be shy. Please come here. I just was wondering whether all these data structures have some support for serialization and deserialization. So one, for example, could use them over the wires and over the network and reconstruct from there and so on? It's not quite tested there, but there should be a pickle support. Thank you. I have another question here. Hi, thanks for the talk. I was just wondering if you benchmarked your library against things like NumPy and the various things kicking around in PyPy for prefix and suffix arrays, because there are things like pack bits in NumPy for encoding bit arrays to byte arrays. And there are various implementations of kind of directed acyclic graphs, like kind of dog and gad hag and things that are used by the kind of people doing genetics and word games and things existing. And they will be wrapping C code already and breaking out the gil. So I was just wondering if you'd. Well, not quite, but actually, as the library is built on top of very benchmarked C++ library, and most of the code is running without changes. It should be pretty fast. But I know for sure there are some PyBuy and 11 delays here, some contribution from PyBuy and 11 that may slow down some things, unfortunately. For example, when you use, if you have limited use cases, of course, you can use, for example, bit array module that will probably do most of the things with just bit arrays. But if you like to construct some unusual data structure, then probably this library is the only option you have. Because, well, those operations exported and the core of the library would like to extend it. It's quite unique, I think. And for NumPy, BitPy, I think they don't have this feature to pack to three or less bits, but only for two bytes. So it's not quite the same thing. OK, thank you. I'll check yours out. So any other questions? OK, so once again, thank you.