 Hello everyone, I hope you all had a great lunch before this and We just had an amazing lightning talk session and after that we had victors and gave a keynote Talk and it was so great to know about the present and future of python and where the community is headed now After all of this, I hope you all are relaxed had a good break We have a very special talk lined up for you from our platinum sponsors the issue We have her with with us and I would like to add her with to the stream right now And we still have a couple of minutes to go. So it is Yeah, hey, I wait can you hear me now? I can hear you. Can you hear me? Yeah, I can hear you clearly Feel free to introduce yourself. We still have a couple of minutes to start So if you want you can go on with the introduction and more about the issue Okay, I actually have a slide for that. So maybe I'll do that now So I'm Arvid Bessen. I work at the D. Shaw group I'm in our New York office. We also have a big office in Hyderabad And I'll talk about what I'm going to talk about in the talk, but D. Shaw Yeah, I have like this this one slide here that sums it up. So we are a global investment and technology development firm We're known as a pioneer in quantitative investing We've been doing that for 30 years and What does quantitative investing mean while we use a combination of quantitative and qualitative tools uncover independent hard-to-find sources of return across global public and private markets Okay, in case you're wondering what that means. That means we take a lot of data That's a quantitative part to run some algorithms on them that tells to us where to invest and Do that across? Yeah, the world global public and private markets Public markets in this case our stock markets private markets would be So I guess half a minute left So I'm just gonna circulate the news around if you're going to start a session on that is stage so if If you are on the deli stage right now feel free to reach out to your friends on Zulit tweet about it and Ask everyone to just join deli stage right now. We're going to start with the section and It's about time so I'm gonna share my screen Yeah, you have a big screen right now Okay Let's get started All right. So this talk is about array time travel with versioned HDF 5 This is an open-source project. You can find the code on get up links right here and It was created by the de shore group, which is where I am working on at working at my names are at Besson and Yeah, let's get started. So Now I already showed this slide. So the de shore group were global investment and technology firm What I'm talking about today version HDF 5 was done in conjunction with a firm called coincide coincide is a data science and analytics consulting firm specializing in open source software They are committed to building the open source data economy by connecting the people and organizations who participate in creating value from data and De shore Collaborates with coincide on numerous other open source projects not just version HDF 5 Including for example number desk and Jupiter And you can Google them. They're all really cool Python projects. There might be even a talk about some of those today All right, so what are we going to talk about today? First, let's get a handle on the problem that we're trying to tackle When we do computations at the shore we do is we have data that comes in and then we Have this sort of flow graph of how the data flows through the system. We compute something on the data And then eventually you get some result back The problem is that data gets updated every day So let's say simple example you look at stock prices, right? So you get new stock prices every day the stock market rates In fact, you get stock prices every second or every millisecond How do you handle that data gets updated while you're running computations on them? That's the main focus of this talk The way we are going to tackle that is we're going to first take a step back by looking at HDF 5 Which is a file format for numerical data and h5 pi which is a Python layer on top of that That exposes that numerical data as NumPy arrays And then we're going to build on top of h5 pi and we have this library called versioned HDF 5 That's in the title of the talk that wraps h5 pi in a way that allows us to track the evolution of data and Finally at the end to hopefully Versioned HDF 5 allows us to solve our data flow Evolution or updating problem All right, so let's look at an example for a data flow graph. I hope you can see my mouse Data flow basically means you have some data that you get from somewhere, right? So you get your stock prices or you get your Temperatures from all the various places around the world or you get your Latency measurements from all the computers that you own or you get your web page impressions All this data you get from somewhere and then you want to compute something on that That's you combine data one and data two you compute something and probably get some result that That's I call that intermediate result one That you might want to store on disk as well And then you compute a little bit more on that and eventually, you know after many steps after combining many Data sets you get your final result I call that res dot h5 here the dot h5 is the HDF 5 file extension But you could store that anywhere. It could be stored in any file format. You could be stored in the database You have roughly the same sort of procedure right you data comes in you compute intermediate results on the intermediate results You come before and eventually get some final results Now what happens when data gets updated? If you look at your computation like this graph, you can see oh If let's say data one gets updated. I only need to rerun computation one Which means I don't need to rerun computation four Which means I need to rerun computation five and then I get my new result Don't have to run computation two and computation three. That's of the nice part about thinking about your computation as sort of a flow graph where the data flows through your computation but What are the problems? When you get these updates? Well, we have two problems that I want to talk about today The first one is how do we achieve consistent results of the data changes during the computation? Let's go back to our previous slide So let's say you know data one gets updated right that means we run computation one Which means we get the intermediate result one, which means you run computation for a computation four Reads data four. What is it basically? It's dependencies are one two three and four But let's say while we're running computation four Somebody updates data four right so you get new whatever it is new data from your satellites or your Mobile phone network or whatever you're getting If data four gets updated that means you also need to Recompute this part of the graph right so computation three in the mean result three and then we arrive at computation five Computation five combines this result with this result But if you look closely you see that data four comes in through two paths and This computation four could potentially have an old You could have used an old version of data four while this path could have used a new version of it data four So we will get inconsistent results Because data four is not used in At the same sort of Point in time consistently All right, that's problem one if data changes we get inconsistent results the other problem that We're trying to tackle is Let's say, you know, you run your computation and you run again the next day and then you update again and then day after that Then somebody comes to you and says so the numbers you produce two days ago are wrong How did that happen? you know we We lost some money because you suggest the wrong traits in our example. That's what we do at the show How do you explain results if the stored data has already changed, right? It's gone. We don't we don't easily reconstruct the data Well, the solution is well-known in computer science literature world. It's a temporal database Unfortunately, that's not something if you go to Google and say buy a temporal database Unfortunately, that's not something you can just do. There are very few commercial vendors that saw you temporal database It's very much still a research topic on SQL server Microsoft SQL server, so probably the closest but they Don't have a true real temporal database either So what does a temporal database do a temporal database records how data changes through time So what's time? Well, there's actually two times and They are normally called valid time and transaction time Valid time means when a Fact became true in the real world think about a time series, right? If you have your temperature measurements, I Don't know where you guys are but where I am. It's now 18 degrees Celsius right and yesterday it was 16 degrees and tomorrow will be 15 degrees So you look at this this Temperature and you can see how it changes to time. That's the time in the real world But then and you record those times And their temperatures in the database or in your file or in your your store whatever it is at a certain Time as well. Normally, there is a lag Right, so you have some some data that you measure right now. It says at this point in time it is this Degree out there But then it comes into the database a little bit later. So let's look at the example, right? So we say the temperature or Whatever it is at 16 or 245 is one and Then takes a little bit to you know go from the sensor With the internet or database and that's 16 or 251 we record the database. Yep the temperature was one Then we get another data point right at 16 or three and 20 seconds we get oh The temperatures now three and then again it takes a little bit to go into the database So this is the valid time axis That's what tells you this is the time in the real world the temperature at 245 was one the temperature at 320 was three This is when things landed in the database now if I query my database and say what was the temperature at 16 Over three Let's say I'm zero seconds. Well, then I will give me this one back. What if I ask it? What was the temperature at 16 or three and 30 seconds? It will give me this this three back But I can also combine that with a transaction time my query. I can say what was the temperature at Let's say 16 Over three and 30 seconds But did you know about this before 16 or three for example and The reason why we do this combined query is because there are situations like this where you know a Data come a datum comes in late at 16 or three and 12 seconds the temperature was to write so now the Value changed in the real world goes from one to two to three, but this was only recording the database at a much later time so This kind of database is called a bi-temporal database because it has two time axes the tracks when data is sort of When the data point is true in the real world and when it entered the database That's kind of what we want. We want to record how our data changes and most of our data has the time axes This is the stock price at for this time. This is the temperature at this time Okay, so this is what we want. What do we have? Well, what we have is we have NumPy arrays. We're using Python We're using NumPy. We store those NumPy arrays on disk but For now, let's just think about why are we using NumPy arrays? Well, they're great. They're fast as index access our developers love them because they're very very familiar to them And for homogeneous data, let's say all integers or floats all dates. They're very efficient There's a couple of options to store NumPy arrays on disk. I listed some here. We are using h5 pi Which is a wrapper around HDF 5 HDF 5 is an open source Yeah file format with an associated high-performance library that's used in the high-performance computing The h5 pi exposes that high-performance file format as a NumPy array Cool. So we have that we have a race. How do we use that? Well It's a relatively simple. You just open a file For writing in this case and then you can just store any NumPy array that you want in that file and you can store in fact More than one NumPy array in the file By giving it a name. So I create a so-called dataset foo with this NumPy array Then I can read that foo dataset back out and I can slice it And I can print that NumPy array and I can write more than one I can also create nested data sets that's h5 pi has a sort of Idea of groups which are like folders. That's in the bar folder. I create two Two datasets bass and boo And I can store all my NumPy arrays in there. Cool. So we have a race covered. How do we version them? Well, here's the trick. So This is what we're going to use the feature of hdf5 that makes our versioning work hdf5 does not store NumPy arrays just in a contiguous way like they are stored in the memory But NumPy is a or a race in general. I contiguous blocks of memory Once we have you know million ins and they're all right next to each other in memory What you can do with hdf5 is you can break this up So if in this case I have a two-dimensional array, I can break it up into these little chunks here. They're all the same size and hdf5 stores each of those chunks individually Why is that better? Well, it's better because if I don't want to read the entire array. I just want to access a subset then I Just need to look which chunks contain that subset and I only read Although only that into memory Only the chunks that I'm actually that actually care about okay chunks and Versions how do we combine them? Well, we want to version an array Versioning means we want to record the transaction time need to be able to handle Pens inserts updates and deletes or depends means new data goes to the end Inserts mean you insert something in the middle and then everything shifts to the right updates mean You have some data that changes, but the array itself Stay the same and Deletes means, you know some elements disappear And then things shift to the left As you probably know from experience with the raised just, you know in general when you're doing computations with them Inserts and deletes are kind of expensive because the old data moves other to the right or to the left So you need to update a lot of things After after the point that you modified Well pants and updates are kind of cheaper or good because we only touch new data Unfortunately, it would turn out for our version HDF 5 that that same will be true So let's see how we How we do versioning on HDF 5? Well, we do use the idea of the chunks that we talked about so Version as a HDF 5 has a feature that is called virtual data set that says I Can create a data set that doesn't have any data actually stored on disk. It just maps its chunks To the chunk of some other data set Let's say we create our version zero and we create some raw data which contains the actual data So you have chunks in this case from zero to ten twenty four ten twenty four to two four two thousand forty eight and so on We just you know, we map one to chunk one chunk two to chunk two chunk three to chunk three I haven't gained much what we have just the cheap virtual data set, but What's the what's the benefit of this? Well, if we write to chunk one and chunk two What we can do is we can actually write copies of chunk one They are modified. I call those chunk one point one and chunk two also modified I write chunk two point one and Then I can create a new version. Let's say version one where Of my virtual data set that points chunk one This chunk one point one and chunk two to chunk two point one Well, chunk three still points to the old data So we have two virtual data sets v zero and v one like hold them here They both point to this raw data set and They're all Accessible all the time because the raw data set is immutable. We never change anything Cool, so Summary, how do we version? How do we do versioning? We create a virtual data set for each version and each version is a view on the raw chunks Okay, so how do we use this? Well, here's a little demo so Same syntax as before we open a file with HDF 5 or with h5 pi Then we wrap this file handle in what we call a version HDF 5 file And now we can create version zero and we do that by creating this by having this context here Where do we say with stage version and we pass in the name? I called it v zero, but it could be called anything you want and then in this block here you can create your data set and Can assign whatever you want you can see what I did there this Decide the increasing numbers from zero through ten So how do I create the next version? Very similar so open the file wrap the file version HDF 5 file Create this context for version v1 manipulate your data set in that context all The changes in that are all the mid Changes you do it within that context are accumulated Then once you exit this context block or once you exit this width Block it is saved to the file So let's see moment of truth. Can we get the old data and the new data? Yes, we can so we open the file We wrap the file handle and the version HDF 5 file we can get the current data by saying first version file of version file that current version Data set name one of those, you know slice all of it and indeed this is The current version that we wrote in the previous slide Could also say vf of v1 or we could also Access things by timestamp. That's what I'm doing to get the old versions, right? So you can get the old version either by name. So I said vf of v0 and That's what is printed here or vf of some timestamp And that's printed here as well So you can get both versions And if you write many more versions you can get all of them As well, okay One thing that we need to talk about is How do we reuse chunks if you move things right if you insert or delete things How do we we use chunks? How do we figure out which which Virtual data is point to which raw data set the way we do that is by using content hashes are using Shah 256 It's very similar to what get does in the git version control system you Look at each file you hash the contents and see is there already an entry for that hash and if there is then You can you can reuse that, you know, it didn't change How does it look for us? Well, if we create a data set, right? We have a virtual data set v0 in my So the original version mapping to a raw data set And I picked a chunk size of just three because otherwise it doesn't fit on the slide Normally, of course, you would pick something a much bigger That's in this case chunk sizes explicitly specified as three Unchunk another chunk. This is the raw data set But we also keep a map of content hashes those are 256 bits mapping to those slices here and If I then modify my data set, let's say I insert seven seven seven that position three So my version one now looks Looks like this you can see the arrows right zero one two Then the next one goes here seven seven seven followed by three four the order sort of switched around This hash map allows us to discover that we can reuse data. We can reuse a chunk That's kind of a neat trick that we that we exploit Yes, there are some collisions, but they're extremely unlikely the same way that git has to worry about collisions And we will implement that in a future version. Hopefully soon Okay, what's the performance of this? Well, remember the slide where we where I said that inserts and deletes are bad and depends and updates are good if you do mostly a pence and Mostly updates Then the performance penalty over just writing the file straight It's not that big if you do a lot of you know things where data moves around You have a big performance well, but this is like a 10 20 times slower than this straight HDF there is a lot more information in this blog post that linked here at the bottom so Almost done we have versioned arrays, but what did we want? We wanted by temporal data wanted by temporal tables in fact Well, what's a table table as you probably know from from experience with pandas data frames or similar libraries It's just a collection of columns and the columns are just a race So Let's build our table out of column arrays and then when a table changes when we update something You can do that by using transactions. That's we Modify all the rays that are in the table in one version And then we can get this kind of table We update it to version one by updating all three arrays in the table Simultaneously as well, so they all need to grow to size six and then we write the new data and then you can get Version tables and you can see my example. I picked something that is actually turned out turns out to be by temple I have a valid time in here. This is the stock price for Google and this is the stock price for Apple And this is how they evolve through time and they can look at version zero and version one and You can see them simultaneously All right, do we solve our problem? I think we did so We were worried about consistent computation if the data changes while we are computing You can do that by at picking a transaction time Let's call it T at the start of the computation then querying all the data with that T Can we explain results if stored data has changed yes we can right because All the data and all the old versions is still there. We just need to pick the right transaction time. We can query it the prior transaction time All right, that's it. So version HDF 5 can version number arrays. It's a drop-in replacement feature complete high performance and it's open source you should try it out and contribute and If you have any questions, I'm happy to take them now Okay, see how this is work. What is where I see the questions? Hey, I with yeah, that was great. So we have questions from the audience number one Interesting like using a pen only logs. Basically, we are indexing the data sex rather than scanning complete arrays, right? Yes, sir, they append only log is a very good comparison, right? So you you have and sometimes you have a transaction lock where you depend all the the things that you modify it's a little bit different from a Transaction log in a database it transaction on database will also say what you do just store the content and Then yes, what we do is we we use this sort of Indexing of you know, you have this virtual data set that maps all the indexes of the virtual data set to the raw the log or the raw data that Then we store and because that's you know over one where the indexing is this cheap Just need to say okay. I want to index five thousand so that maps to index. Let's say three thousand The virtual to the raw data set Get that and that's that's cheap and fast Okay, so we have a second question does using HDF five make sense with SSDs I Think it does I don't I mean to be honest. I've never used it with anything other than than SSDs it's The way it works internally and I'm not an expert in HDF five so I just sort of I work Two abstraction layers higher than that the way HDF five works internally is it's uses a b-tree architecture where you You have all those chunks right builds a b-tree on the chunks so you can insert in the middle and and all that and So it knows which chunks to load from where and I think it actually works much better with SSDs because with spinning disk you would need to make sure that You don't seek. That's the problem with spinning discs always Need to read as much data as you can While making around the around the data around the disk SSDs you can do this seek where you follow the b-tree and let's say Here's the data. Here's the other data. Here's a chunk that I want to load Got it. Just one last question. What are the these typical chunk sizes? Typical chunk size is Normally, we try to tune for the operating system buffer size so I think Most operating systems have sort of one megabyte or two megabyte buffers. We use it on the next So you try to sort of choose your chunk size so that depending on the data size, but if it's a Floating point number that's eight or double. That's eight bytes. So you would pick, you know two to the 20 divided by eight as a chunk size And if it's, you know, it's a 32 bit integer you would pick To the 20 divided by four That's what you're trying to do. All the attendees will be present on Zulip So be free to head to deli stream. You'll find 2020-stage-deli so head there feel free to ask any question you would like to and I would really present there if you want some other resources regarding At the session that we had feel free to ask So thanks a lot and closing session. Oh, sorry closing remarks No, that's it. I'm heading over to that deli stage right now. So ask me any questions. Thank you. Thank you