 Francesca Altet will talk about out-of-core column layout databases. He is a creator of PyTables, a developer of Blaze and a performance enthusiast. Give him a warm welcome, please. So thank you very much, Oliver, for the introduction. So in my talk today, I am going to introduce you to out-of-core columnar datasets. And in particular, I will be introducing big calls, which is a new data container that supports in memory on disk columnar chunk compressed data. Big calls seems like a strange name, but you can think of it like a big columnar. And the final LZ stands for Lempel CIF codecs, which big calls use a lot internally. Okay, so just a plaque about me. I am the creator of tools like PyTables, Blossk, now big calls. And I am a long-term maintainer of NumExpr, which is a package for evaluating NumPy expressions very quickly. I am an experienced developer and trainer in Python, because I have almost 15 years of experience coding full-time in Python. And then I love high-performance computing and storage as well. So I am also available for consulting. So what? We have another data container, right? So, yeah. In my opinion, we are bound to live in a world of wildly different instances of data containers. The NoSQL movement is an example of that. We have a wide range of different databases and data containers, even in Python. And why? This is mainly because of the increasing gap between CPU and memory speeds. That if you understand this fact, you will understand why this is so important. So the evolution of the CPUs is clear that the CPUs are getting much more faster than memory speed. And this is creating a gap between memory access and the CPU is mostly doing nothing most of the time. And that has a huge effect in how you access your data containers. If you want more details, you can see my article, why other CPUs are starving, and what you can do about it. So why columnar? Well, when you are querying tabular data, only the resting data is accessed. So that means, basically, means less input output required. And this is very important when you are trying to get maximum speed. So let me show you an example of that. Let's suppose that we have an in-memory row-wise table. This is the typical structured array in NumPy. It is stored like this. So, for example, if you are doing a query, you have the interest on column is the second one, the integer 32.1. So due to how computers works with memory, you are not accessing only the column, the interesting column, but you are accessing also the bytes next to this column. This is for architectural reasons, right? So typically, if this is in memory, you are not accessing, you are not bringing to the CPU just n rows multiplied by four bytes, but you are accessing, you are bringing to the caches and multiplied by 64 bytes. And 64 is because it's the size of the cache line, typically, in modern CPUs. So we are bringing 10 times more data than is strictly necessary. In the column-wise approach, if you store the data that is in the same column sequentially, you will be only bringing to the cache the exact amount of information that you need. So this is the rationale behind why column-wise tables are interesting. Now, why chunking? So chunking means that you store your data in different chunks, not in a monolithic container, but that means more difficulty handling that data, right? So why bother? Well, the fact is that chunking allows efficient enlarging and shrinking of your data sets, and also makes on-flight compression possible. So let me put you an example. When we want to append data in a NumPy container, for example, we need to reserve to do a malloc in a new location, then to copy the original data in the original array, and then finally copy the data to append at the end of the new area. So this is extremely inefficient because of this gap between the CPU and memory. Now, the way to append data in because is chunked, so if we want to append the data, we only have to compress the data because because is compressed by default, because containers, and then you don't need the additional copy, because you basically what you are doing is adding the new chunk of chunks to the initial list. Okay? So it's very efficient. And finally, why compression? Well, the first reason for compression is that more data can be stored in the same amount of data, of media, sorry. So if you have your original data set, and your data set is compressible, and let's say that you have a compression ratio, you can reach a compression ratio of 3x, you can store three times more data using the same resources, which is great. But this is not the only reason. Another reason is that if you deal with compressed data sets in memory, for example, on this, whatever, and you have to do your computations, typically they execute in the CPU cache, you will need to transfer less information if your data is compressed in memory. Okay? And that could be a huge advantage. Now, if the transmission time of transmitting the compressed data from the memory or the disk to the cache, plus the compression time, we can do that time, the sum, less than the time that it takes the original data set to be transferred to the cache, then we can accelerate as well computations. Okay? This is a second goal. And for that we need an extremely fast compressor. Okay? So BLOCK is one of these compressors. Okay? So BLOCK, the goal of BLOCK is bringing data much faster than a memcpy memory copy can work. Okay? Here's an example where the memcpy is taking, it's reaching a speed of seven gigabytes per second, and then BLOCK can reach a performance of 35 gigabytes per second. So BLOCK would be interesting to be used in big goals. In fact, it is part of big goals. So goals and implementation. One important thing, an essential thing I would say in BLOCK, in big goals, sorry, is that it is driven by the keep it simple stop it principle in the sense that we don't want to put a lot of functionality on top of it. We just want to create a very simple container, a very simple iterators on top of it. So what big goals is exactly? So as I said before, it's a columnar chunk, compressed data containers for Python. It offers two flavors for containers. The first one is CRA and the other is CTABLE. And it uses the powerful BLOCK compression library for on-the-flight compression, the compression. And it's 100% right in Python and also Scython for accelerating the interesting parts. So for example, the CRA container, which is one of the flavors of big goals, is just a multi-dimensional data container for homogeneous data. So it's basically the same concept that NumPy, but all the data is splitted in chunks just to allow this easy to append. And also to allow compression as well. So the CTABLE object is basically a dictionary of CRAs. It's very simple, okay? But as you can see, the chunks follow the column order. So queries followed on several columns will fetch only the necessary information. And also adding a removing columns is very cheap because it's just a matter of inserting and deleting entries in a dictionary, Python dictionary. So persistency, CRA and CTABLE can live not only in memory, but also on disk, okay? And for doing that, the format that has been chosen by default, it's heavily based in BLOCK, which is a library for compressing large dataset that Vality Hennel has been working on for the last for the past years. And tomorrow and Sunday, he will be giving a talk on the PyData conference. So big goals and the goal of big goals is to allow every operation to be executed entirely on disk. So this persisting thing allows big goals and operations to be executed entirely on disk. And that means that all the operations that you can do with objects in memory can also be done on disk, okay? So you can add very large datasets that cannot fit on disk, on memory, you can do these operations on disk, or even queries. Okay, so the way to do analytics with big goals is, as I said before, because men's strives to be simple, okay? So big goals basically, it's a data container with some iterators on top of it, okay? And there are two flavors of iterators, then iter and where, which are where it's the way to filter data, for example. And there is the blocked version of the iterators, where instead of receiving one single element, you will receive a block of elements, because in general, it's much more efficient to receive blocks and to work with blocks. And on top of that, the idea is that you use the iter tools, for example, in the standard Python library, to use these building blocks. Or if you need more machinery, you can use the P tools, the excellent P tools on side tools packages in order to apply maps, filters, group buys, sold by, reduced by, joins, whatever, on top of that. This is the philosophy of big goals. Also, I recently implemented big goals, if you cannot create big goals from existing data containers, then you are lost. So I created interfaces with the most important packages when you are talking about big data. So for example, by default, big goals has been always based on NumPy, but there is also support for PyTables. So for example, you can do indexed queries, for example, using PyTables, just store big goals and produce HDF5 files with that. But also, you can create, you can import and export data frames very easily from pandas, that we give you access to all these backends as well. Okay, so let me finish my talk with some benchmarks with real data. And in particular, I will be using the MovieLens dataset. And you can find all the materials for the plots that I am going to show in this repository. So let me show you the notebook. Basically, what I did is a notebook. So this is the notebook that you can find in the repo. And here is all the parsing processing and everything. And here are the results. So you can access to this, go to this repository and reproduce the results by yourself, if you like to. Reproducibility is very important, as you know. So the MovieLens dataset, it's basically people that rate movies and that they, there's a group of people that collected these ratings and created different, different datasets. There are three interesting datasets, one with 100,000 ratings, one million and 10 millions. The numbers that I am going to show are the biggest one, the 10 million ratings. So this is the way to query the MovieLens dataset. So typically, what I am doing here is using pandas, basically for reading the CSV files and then produce a huge data frame containing all the information from the data files. Then the way to query in pandas is like in the recent versions of pandas, you can use the dot query, which allows you to use this simple way to query the data frame. And for example, in the big calls Ctable from data, I import the data frame and create a new container, which is a big calls container. It's a Ctable container. And then this Ctable container is queried through the word iterator, as I said before. So you can pass exactly the same query than pandas. In fact, these queries are using numx behind the scenes, so they are very fast. And then you are selecting, you are saying to the iterator that we are interested just in the user ID field for the query. So here we have a view of the sizes of the datasets. It turns out that this dataset is highly compressible. So we can see that pandas takes around a bit more than one gigabyte and a half. And the big calls container for the same data frame, it's a bit larger, in fact, without compression. But if you apply compression, your size or the size of the dataset will be reduced to less than 100 megabytes. So that's a factor of almost 20 times. So that's very interesting. But perhaps the most interesting thing about that is the query times. So pandas, you know pandas because it's extremely fine-tuned for getting high-performance queries, right? In fact, pandas, the data frame, it's column-oriented. It's column-wise container in memory as well. So it's a perfect match for doing a comparison. So the time that it takes pandas for doing this operation, this query, is a little bit more than half a second. And for the big calls without compression, we can see that the time it's maybe 60% less or something like that. And the most compelling thing, in my opinion, is that when you are doing the same query, but using the compressed container, the time that it takes is less than using the compressed container. And this is essentially because the time that it takes to bring the data compressed into the CPUs is much less than the time that it takes to bring the data uncompressed. So the last, the upper row, the upper bar, means that big calls is on disk. But using compression, it is a little bit slower than in memory case. But it's still faster than pandas. And this is probably due to the fact that the big calls container, although it is stored on disk, the operating system probably has already cached that in memory, right? So it has a little bit more overhead because of the file system overhead. But the speed is very nice as well. So this has not been always the case. So for example, when I run, re-run the benchmark in a laptop which is three years old, for example, which is the one that I am using for the presentation, MacBook Air, we can see that pandas is the fastest. Okay? Then when big calls is a little bit slower, but when you're using the compressed container, it has an overhead. This is because BLOCK is not as efficient running in all architectures. I mean, new CPUs are very fast compared with older ones. And that gap, I mean, that increase that we are seeing here in my older laptop, my Linux box, we are going to see this kind of speedups more and more in the future. So compression will be very important in my opinion in the future. So let me finish with some status and overview of big calls. I released version 0.7.0 this week. So you need to check it out. So we are focused on refining on the API and tweaking knobs for making things even faster. We are not interested in developing new features probably, but just in making the containers much faster and also the data editors. Also, we need to address better integration with BLOCK. I am in contact with Valentin in order to implement what we call superchanks. So every chunk right now, it's a file on the file system when you are using persistency. And when you have a lot of chunks, that means that you are wasting a lot of fine nodes. So the idea is to tie together different chunks and to create these superchanks in order to avoid this overhead. And the main goal of big calls is to demonstrate that the compression can help for performance even using in-memory data containers. And that's very important because I mean, I produced BLOCK like five years ago. And although my perception was that compression would help in this area, just five years later is when I am starting to see actual results with real data, that this promise is fulfilled. So we would like you to tell us about your experience. So if you are using big calls, tell us about your scenario. If you are not getting the expected speed up or compression ratio, please tell us. You can write to the mailing list there. Or you can always send bugs, patches, please file them in the GitHub repository. You can have a look at the manual, which is online, big calls dot BLOCK dot org. Then you can have a look at the format that is using big calls by default close pack. And the whole BLOCK ecosystem lives in BLOCK dot org. So thank you. And if you have any questions, I will be glad.