 Good evening. As Miran said, I would like to share my views with you on how to store and analyze large data silos, having an account of how the modern computer architectures are evolving. So, a few words about me. I am a physicist by training. I am a computer scientist by passion, and I do believe in open source, and the proof is that I spent a long part of my life doing open source development. Probably the project that I invested the most is Pi Tables, where I spent almost 10 years with it. Although my current pet projects are a blouse and because I am going to talk quite extensively about the last ones during my talk. So, why open source? Well, in my opinion, there is a big duality between dreams and reality, and many times, we, the programmers, think that some things can be improved, right? The thing is to try to find time in order to implement that. I under of the opinion that, I shared the opinion with Manuel Ultra, for example, that the art is in the execution of an idea and not in the idea itself, because there is not much left on an idea. So, the open source allowed me to implement my own ideas, and it's a nice way to fulfill yourself while helping others as well. So, I am going to talk, first of all, to introduce the need for a speed, because the goal is to analyze as much as data as possible using the existing resources that you have. Then, I will talk about new trends in computer hardware, because I think that seeing the evolution of the computer hardware is it's very important in order to design your data structures and your data containers. And I will finish showing you big goals, which is an example, just an example of data container for large datasets that follows the principles of this newer computer, the forthcoming newer computer architectures. Okay, so why we need to speed, of course, let me remind you the main strengths of Python. Of course, I think one of the most important things about Python is that there is a rich ecosystem of data-oriented libraries, and most of you will know about NumPy, Panda, Psyculearn, a lot of different libraries. And also, Python has a reputation of being slow, but probably most of you also know that it is very easy, well, it is feasible at least, to identify the bottlenecks of your programs and then use or make C extensions in order to reach C performance using excellent tools like Cython, Swig, or F2Py, or others. But for me, in particular, the most important thing about Python is interactivity. Okay, the ability to be able to interact with your data and see your filters, what's the result of your filters, the result of your queries almost in real time. This is the key thing about Python, for me. But of course, if you want to handle big amounts of data, and you want to do that interactively, you need to speed, right? Because if not, this is a no-go. But the signing code for the storage performance depends very much on the computer architecture, and that will be the main point of my talk today. In my opinion, existing Python libraries need to invest more effort in getting the most out of existing and future computer architectures. Also, let me be clear about the meaning of my talk. I mean, I am not going to talk about how to store and analyze data in big clusters, or farms of big clusters. In my opinion, this is not exactly the niche of Python. I mean, the real workhorse of Python, it's being able to work on big servers, maybe, but especially mostly in laptops, okay? A lot of people is using Python in their own laptops, and my goal is to try to help them in order to work with more data using laptops or big servers. But trying to optimize for laptops or servers doesn't mean that this is going to be a trivial task, because these laptops, modern laptops, and other servers are very, very complex beasts, okay? We have to leverage, or we have to understand how the architecture is designed, how to access memory, how the different caches work, a lot of things. So let's have a look at the current architectures and see how these architectures should be leveraged in order to design new data structures. The new trends in computer architecture are mainly driven by the nanotechnology, okay? I think it's very interesting to see how Richard Feynman predicted the nanotechnology explosion as soon as, like, almost 50 years ago. So I think it's nice for you to check this talk. Anyway, so I think the most important thing with memory architecture nowadays is the difference in speed, the evolution in speed between memory access time versus CPU cycle time. We know that the CPUs are getting faster and faster, and in fact, the speed grows up almost in an exponential way, not exponential, but close, so the Moore law. But in contrast, the memory speed is increasing very slowly, very, very slowly, and this is creating a big gap, a big mismatch between CPU speed and memory speed, right? And this is a very important key on the evolution of the architectures. So if we see the evolution of the architectures, we can see that in the 80s, for example, the architecture, the memory architecture of the machines, of the computers was very simple, okay? Just a couple of memory layers, the main memory and the mechanical disk. Then in the 90s or 2000s, vendors realized this problem between the mismatch between the memory and the CPU speed, and they started to introduce the two additional levels in the CPUs of cache, okay? And nowadays, in this decade, it's very useful to have up to six layers of memory, okay? So this is a big change in the paradigm, and it's not the same thing to program for a machine in the 2010s than for a machine in the 80s. So in order to understand how we can adapt better to the new architectures, it's important to know the difference between reference time and transmission time. So let me explain. So when the CPU asks for a block of data in memory, the time that it takes from the CPU request until the memory is starting to transmit, the data, it is called the reference time, okay? Others call it latency as well. And then the time that it takes once the request has been received and the transmission starts to start until it ends, it is called the transmission time. The thing is, if you have a big mismatch between the reference time and transmission time, you are not doing an optimized access to your data. So the interesting idea is that the reference time and the transmission time should be more or less in the same order, okay? But of course, not all storage layers are created equal. That means, for example, that in memory, which has a reference time typically of 100 nanoseconds, we can transfer up to one kilobyte in this amount of time. But for solid state disks, where the reference time is 10 microseconds, we can transfer up to four kilobytes, okay? Using the same time. And for mechanical disks, this block, typically the reference time is around 10 milliseconds. And the transfer, yeah, the transfer block, transmission time allows to transfer you up to one megabyte. So the thing is that the slower the media, the larger the block that should be transmitted in order to optimize the memory access. And again, this has profound implications on how you access the storage as we will see soon. Let me finish this part with some trends on the storage. The clear thing is that, as we have seen, the gap between memory and permanent storage, hard disks, is large and it's growing right now. And that means that in order to fill this gap, vendors are not creating just SSD devices that have the same interfaces than typical hard disks. Vendors are starting to create or to put solid state memory in buses like PCI. And also, new protocols or new specifications are being started to be introduced in order to put all this solid state memory in laptops. So in your own laptop, we'll be able to access solid state memory at PCI speeds. Which is very different to access solid state disk via the SATA, the traditional SATA bus. And also, the trends on CPUs is that we are going to see more cores, of course, wider vectors for doing simple extraction and multiple data. And we are going, well, we are seeing already integration of the GPUs and the CPUs in the same time. And these are the trends that we should have in mind in order to define or in order to produce our new data containers. So what I'm going to do is to show you an example of implementation that data containers that leverages these new computer architectures. So big calls, it provides data containers. It's a library that provides data containers that can be used in a similar way than NumPy, Pandas, Diant or others. And in big calls, data storage is chanked, no contiguous, and chank can be compressed. And there are two flavors. One is CRA, which is meant to host homogeneous types and in dimensional data, multidimensional data, and then C-table for the heterogeneous types in a columnar way. I am going to skip some slides because I am a little bit short of time. Don't be worried. The important thing that I want to transmit will be the consequences of using these containers. So I am not going to explain in big detail the difference between contiguous and chanked. The only thing that is important is that chanking is nice because it allows to efficient enlarging and shrinking, compression is feasible. And in addition, chank size can be adapted to the storage layer. Do you remember that depending on the storage layer that you are going to use, the chank size should be different, right? So a chanking storage allows you to fine-tune the chank size for your own needs. So it has other advantages, like appending is much faster. You don't need a copy when you are doing an append operation on a B-calls container. Less memory travels to CPU. And also, the table container implemented in B-calls is columnar. So columnar means that the data in columns are next in memory. So when you want to fetch a record or a column, the only information that you need to transfer, so this is the case of a table that is row-wise, okay, stored in a row-wise fashion. And if you are interested in this column in 32, for example, you are going to transfer much more data into the CPU just because of architectural reasons. This is how computers work right now. On a memory column, on a column-wise table, if you are interested in just one column, you are going to grab only that column and transfer it into the cache. So that means also less memory travels to the CPU. Also, why compression? Well, the first thing is that it allows to store more data, either in memory or on disk. But another goal is that if your data is compressed, maybe, maybe it would be better to have this data compressed in memory or on disk, transfer the compressed data into the cache and decompress it. And maybe the sum of this transmission time and this the compression time could be faster in some situations than the time that it takes to transmit the original data set into the cache. And that's the goal of BLOSC, which is the compressor that uses big goals behind the scenes. BLOSC, the goal is to be faster than a MMCPY. It uses a series of techniques that I'm not going to describe, but basically leverages new architectures. In this case, for example, we can see BLOSC decompressing up to five times faster than a MMCPY. I'm not going to describe how BLOSC works. There are other talks about this. And the main place to use BLOSC is basically to accelerate the input output, not only mechanical disks, but especially on the solid state disks and main memory. BLOSC, it's a library, made in C, and it is widely used, and especially it is being used, for example, in OpenVDb and Houdini, which is a library for producing 3D animation movies that are maintained by DreamWorks. Thank you. There are a series of projects using big goals already, for example visual fabrics, BigQuery, which is meant to produce out-of-core group buys, but on disk, not in memory, because big goals supports both containers on disk and in memory. Also Continuous Blaze is using big goals, Quantopian also has, they are very excited about using big goals. I'm going to skip that. I mean these are plots where people are showing how big goals can beat Mongo or HDFI for their own use cases, of course. And I'm going to close the talk. We're saying that, well, just to say that there is a data container that fits your needs. And this container should be already out there, okay? So my advice is always for you to check the existing libraries and to choose the one that fits your needs. And sometimes you can get, you can be surprised and depending on the data structure that you are using, you can get much more performance. Not because of the algorithm, but because of the data structure or data container. Also you should pay attention to hardware and software trends and make informed decisions about your current development, which by the way will be deployed in the future. So it's important that you are conscious about the new computer architectures because you are going to use them or your application is going to use them. And finally, in my opinion, compression, well, I think many people have seen that already. Compression is a useful feature, not only to store more data, but also to process data faster under the right conditions. Okay, so let me conclude my talk with my own version of a code by Isaac Asimov, which I was a huge fan when I was a teenager. So it is change, continuing change and inevitable change that is the dominant factor in computer science. And in my opinion, no sensible decision can be made any longer without taking into account not only the computer as it is now, but the computer as it will be. Okay, so thank you very much. Questions? Okay, yeah. I also have an announcement. I will be talking about continuing. We get a question. Okay. Yeah, just maybe because there are some graphs about this, but just to know, for example, it was some comparison with Mongo. There are some, like, for example, BLOSC, very snappy. So I don't know to which reference it is, just like because I haven't heard so much about it before. Yeah, if you what you know or what you see as difference for similar patterns that were tried by other persons, just if you have some comparisons, if you, if there are some advantages of the technologies you presented. So you mean that Mongo is using a snappy, right? Yeah, for Warrior Tiger, for one storage engine, for example, they are using snappy or Zlib, but snappy is faster for compressing, for example. But there are other things. There's also RocksDB for, there's used many things. That's a good question. And because, as I said before, it uses BLOSC behind the scenes. BLOSC is, I said that it is a compressor, but it was an oversimplification. BLOSC is actually a meta compressor. So BLOSC can use different compressors. And in particular, it can use snappy behind the scenes. It can use Zlib, LC4, which is the kind of new trend in compression because it's very fast and compresses very well as well. And it also has support for BLOSC LZ. So you have a range of compressors that you can use in order to tailor or to, yeah, fine tune for your applications. Yeah. Maybe silly question. I've just been to a talk on Numba and they claim to speed up Numpy and stuff like that. Does because work with Numba? Yes. I mean, yeah, because it's only providing the data layer, the data structure, right? On top of the data structure, it provides very few machinery. It just provides some some, for example, the some function, but very little. So the idea is to use vehicles, for example, and on top of that, you can put Numba, for example, for doing computations. But you can also put Dask, for example, which is a way to do computations in parallel as well. And you can put what because it's providing a generator interface so that other other layers on top can leverage that. You are not bound to use because infrastructure, because machinery for doing computation, but it only provides the storage layer, so to say. I get a related question. Can you use PENDAS with vehicles as the storage engine? Sorry, can you repeat? Can you use PENDAS with vehicles as the storage engine? With us? A comparison with PENDAS. No, not a comparison, but can you use PENDAS, like all the API of PENDAS, and still have vehicles as the storage engine? At the storage engine, yes, exactly. That's another application, for example, yes. And for example, I've seen some references by Jeff, I don't remember his name, the current engineer of PENDAS. He's trying to see, for example, PENDAS can support different backends, like SQL databases or HDF5, and because can be another backend for PENDAS itself, yes. So it can be, but it isn't now? No. Okay. I mean, there is no... I mean, in my knowledge, there is no backend for PENDAS yet, but it could be done. And of course there is? No? Okay. So this was everything. Thank you very much. Just let you know that this was the last talk of this room. Now there will be Lightning talks at quarter past five, and that's everything. Please go into the app guidebook in your mobile phones and read the talks you are attending to. Okay. So before leaving, just let me a quick reminder. I will be driving a tutorial on Wednesday. I will be talking more about all these data containers and doing comparisons between big calls, PENDAS, storage layers, NumPy, things like that. If you are interested, please come with us. Okay. Thank you.