 Yeah, this is this is the presentation. And as I said, there's a there's a link in Slack to a Google slides version of the same thing. So you guys can follow along with that if you want, rather than sharing the screen, it's up to you. So, yeah, I've been hosting for 10 minutes now. Now I'll introduce myself. My name's Mike Smith. So I'm a software developer at the European Molecular Biology Laboratory. And I've worked for an organization called the German Network for Bioinformatics Infrastructure. So I'm really a software developer and kind of maybe a data scientist, but I'm a programmer by trade rather than a scientist. And so the topic of this, this, this talk is working with on disk data formats. And I was really excited when Charlotte and Michael approached me to talk about this. It's not something I've ever seen on the program for a single cell workshop or anything before. But I also wondered whether anybody else would be excited about this. You're excited about the methods, how to interpret your data, how to analyze it, how to make biological conclusions about it. But maybe you've never really thought about how we should store that data on disk. It's a bit more of a dry topic. And so I'm hoping that with this slide, I can at least motivate you as to why you might be interested in this session if you're not automatically interested. So perhaps you really do just want to use software. You want to know how to analyze things, but you don't care about this particular tool. In which case, maybe this just helps you understand some of the options that you see in the software you use, right? If it says, I'm going to save it in a sparse format or you make a choice between whether you use an HDF5 file or something along those lines. Then hopefully this kind of high level overview explains a little bit about why you might want to make those choices and how to make them. And why they're somewhat important to you. Also, the kind of things that you might encounter is that you get hold of a file in a particular format, but you don't really know anything about how it was generated, quite where it came from. Maybe it's got an unusual file extension or something and you don't know how to open it. And that could come from a collaborator or in some supplemental material where the methods section is less than useless and doesn't tell you how to get there. And so hopefully kind of a little overview of the things we do and why this might be like the tools you use to look inside a data might help you understand this kind of file. And then sometimes I also want to sort of emphasize that you don't even know you're using on disk data right there are certain tools which will work completely seamlessly with Particular file formats on disk and you as a user might never really understand or really comprehend that you're using them and that's great. It's seamless. It works really nicely. But you might also encounter unusual errors that reference things that you weren't expecting. And so having some idea of how this works might be useful to you go forwards. And then I guess finally, if instead of being a pure sort of biological user, but you're in fact a methods developer, you want to look into these data sets. And you want to either write new statistical methodology for them. Or you want to pull out information that nobody else has looked at, but you know is produced by the scientific equipment and things. Then it's really useful to be able to understand these file formats and if you're interested in performance of how well things will run how quickly they run then having a really good understanding of how these formats work. Is really important to impacting the performance of those kinds of things. So if you fall into that category. It's really useful to have a solid understanding of how the data on this And to just kind of Emphasize I guess these last two points about the kind of sometimes you don't know you're using on this data. And also about leftist development. This is kind of just a little schematic of one of the software stacks within a bio conductor project. So you guys have already seen single cell experiments here and maybe you've used some of the tools that fall at this end. But actually underpinning a single cell experiment could be an HDF five array object which interacts with a package which interacts with a C library, which then interacts with a file on your desk called an HDF five file. And there are multiple authors of these packages always within this stack. And if you fall into the category up here to be there a user or a developer of these things. You probably Don't know who to talk to down here or necessarily even care that you're facing your work upon all of these things. You just wanted to work. But I kind of just wanted to emphasize that sometimes you can be Utilizing these kind of on on this things in ways which you aren't necessarily expecting and hopefully this introduces the topic to you guys. So I guess now we should move on to before we really talk about on this things. We're going to talk about just general strategies for representing count matrices. And so yeah with single cell data. It's all about the count matrix right you guys know that you probably know it better than me you guys work with this more frequently than I do. But the count matrix is the big bulk of data in single cell experimental data certainly for the most cases that's true. And it has one particular very special property which is this sparsity right and it kind of pervades everything to do with the single cell analysis right so I'm sure you guys as I say no better than I do with the biological interpretation of things right whether it's truly zero expression or just a technical artifact. That's something that the sparsity impacts how you do that impacts the statistical analysis right all the previous lectures you've talked about things you know already how you model things. And this is all related again to the previous step about is it truly not there or is it an artifact of the experimental technique. So those are kind of the analytical side of things but this also affects the software that we manipulate these data with. So I think there was a question on the first day about sparse matrices so I guess at least some people are well aware of these exist. And we're going to talk about them in a bit more depth in the next few slides but people can make choices about whether they represent things in sparse matrices or dense matrices and that kind of thing within software. And that's again related to the fact that these count matrices are exceedingly sparse and there are properties you can make dates. And then I guess the final point really is that this also influences how we can store and transfer things. Because of this sparsity we can exploit that and make certain choices about how we access the data or move it around to be more efficient than we might be in a completely naive case. So yeah we're really going to focus on the count matrix at least for the first part of this talk and strategies for using those. So let's just actually consider when we're not really talking in on disk here I'm just going to talk about in general ways of representing matrices. And I guess the obvious is this 2d grid right everybody's seen this and you'll understand why I'm coloring it in a little bit in a few slides time. But this is a moderately sparse matrix, I kind of wish I'd made it sparser by the time I get four slides time. But this is the obvious representation right we have rows and columns, there are values in there, quite a lot of them as zeros. You've all seen this before this is pretty small for single cell data I think you would agree, but it works in our little cartoon form. And this is the kind of standard dense matrix if you just read if you create a matrix in our, or in Python or any other language this is pretty much how it will be represented there will really be an entry for every element in there. And then it takes up a certain amount of space in memory of that kind of thing so we just have a little think about how much memory this might consume. So in this case our little cartoon has only got five rows but if we think about a sort of typical human single cell experiment with talking more like 30,000 rows for 30,000 genes. It might be different if you have multiple organisms or your caesarean transcripts or something but let's let's go with an approximation 30,000. And then each of the elements stored in here is an integer so that's four bytes so for one column, you need 120 kilobytes which is nothing right we don't even think about whether you can assign that anymore in a computer we probably didn't think about that 25 years ago either. It's pretty small so for one column it's totally fine. But in our dense matrices that those requirements scale linearly so if you now need 100 columns suddenly we're at 12 megabytes of memory, which is also fine right that's considerably less than your laptop has it's no worries whatsoever. Every time you open a tab in Chrome it will consume considerably more than that. But if we jump up another couple of orders of magnitude so now we're at 10,000 genes, suddenly we need 1.2 gigabytes. And we're starting to enter the point where this is a bit more challenging to manipulate and look after so. We've been creating instances in Renku with four gigs of RAM. This was just about working there but if you use a language like R it's pretty easy to make a copy of something suddenly we're actually at nearly two and a half gigabytes. And if you've got anything else going on in your machine, it might be tough to do an analysis with 10,000 cells which is not a particularly big experimental thing. And then if we if we missed out one row here but if we were to do 100,000 columns right 12 gigs. We couldn't do that in our Renku engines here and I think most of us would probably struggle on our laptops to do that. I have 16 gigs of RAM so you could read it in but you couldn't really do much with it, at least not naively. And then if we kind of take it to the largest sorts of experiments that are coming out now so a million cells. Suddenly you need 120 gigs of RAM to do this and we're not talking anywhere. Nobody's desktop has that you might have one or two machines in your institute that can do it. But you're probably competing with some other researchers for those resources and things. So it starts to get into the realm where just doing this in memory is kind of an infleasible. At least if we employ this kind of naive dense matrix strategy. So I guess one thing we can consider and is pretty typically done is instead some sort of sparse representation. And I guess the most obvious way of going about this is to throw away all the zeros. So in this case that's what this vector at the top here has done and we can ignore the coloring just for now. But what's stored in this vector here are just the non zero elements from our matrix. And then we can store to recapitulate the original matrix we can store the indices of where these values came from. So for instance, and we're going to use zero base notation here. So the language gets tricky, but these these first two elements here, they reference the first row and the first column and then value 10, right? And the same goes for this. And so from these three vectors here, we can recapitulate our matrix. This isn't a great example because we actually use more values in these three matrices and three vectors that we use to represent the whole matrix here. And so the more sparse it is, the better this this compression would work. But we can actually do better than this relatively simply as well. So you'll notice that in the column indices, there's lots of repeating elements, right? And these these repeating elements match the colors, which might match the columns, right? There's three different ways of encoding the same data here. But yeah, the these column indices has lots of replication in it. And we don't actually need to do that. It's going to be slightly more efficient without making this much more complicated. So we can actually because we don't need that replication going on here, instead of making these the actual indices and referring back to the original matrix. We can make them refer to these row indices instead and basically say which elements in our row index vector is the first one for each column. In this case, the yellow block right tells us that the first element here. Okay, so I will come to Leon's question in just a second. Because I had to look it up yesterday to make this example. So just for now the So the first element here tells us that the first element in this is for row one of sorry for column one, then the second element here tells us to jump to the third part here and all of these and so this continues through and we have to have one at the end which marks the very It's the very last element in our in our vector. So you end up with one more element in this column pointers that we have columns in the matrix. So there's a question from Leon which is how does this handle empty columns. And the answer is that it will still have a it depends where they appear but basically you repeat the value in here. And nothing appears in the row indices. It's a there are probably other ways of getting around this but you can essentially If you would just have a completely empty. Well, if you have one intimate matrix you would just put a, for instance, there was nothing in here at all. This would stay this would This would become zero. Let me write an answer for you in the slack. But yeah, essentially you end up repeating the values in here. To handle empty columns but don't worry about it it can handle them. So does this make sense to people put your hand up if this makes sense so you can untick play clear everybody. And now say yes it is because I'm going to ask you to recapitulate things in the exercises. So we can go through it again later. But I kind of just wanted to show that you have this this past representation to represent things and whilst I've not done a particularly good example here because we use this 25 integers stored here. And in our example here, it's was this 14 and then 28 and then 34. So we've actually made it worse. But hopefully you can see that the more zeros there are the smaller these two will get right because these are the same length is a number of non zero elements. So if there was only one element in here each of these would only be a length one. And so if the sparser gets the better this is representing your matrix in a more compressed form. And actually this is summarized on this next slide here right so the worst case scenario where it's not sparse at all, we end up with three vectors, all of which are the same length as the entire matrix. And we end up with three times the storage so that would be a terrible situation and you really would never want to do that. But in the best cases is pretty much equivalent to one row which is that you would fill in the column index thing full of zeros. And there's probably a clever way of just representing a zero and then the number of times it is or something you would put an exception in for the very best case. But in naively it would be equivalent to just one row of the matrix which would be pretty good compression. So yeah, we're going to focus a bit more on that in one of the exercises and try and to recreate the algorithm I've just talked to you through that. But I wanted to explain this because it will be a repeating thing that we see in these kind of on disk data representations. And so yeah that brings me to the what about on disk everything I've been talking about here it's just been theoretical about how to represent matrices. But at some point you want to save or share these counts right so what we've been doing so far has been. Maybe this is on disk maybe it's in memory or whatever but now we're being explicit you want to save these somewhere and you want to share them with either your collaborators or you need to take them from your machine that produced the original counts or whatever. And given to some somewhere else to do an analysis. So you need to think about how they're going to be stored. And then you also occasionally as we said in our sort of example of a million cells, sometimes the data are just too big to ever be read into memory right you want to work on them on your laptop. So you don't have 120 gigs of round. And so there needs to be some strategies for both moving them around, but then also allowing people to kind of jump in and access subsets of the data or work on it in a way that is feasible on a laptop that isn't a supercomputer. And so this, this sounds like a relatively simple straightforward problem, except that there's no consensus in the field for how to go about doing this. And I'm sure if you guys have seen data from more than one manufacturer or run it through multiple different analysis tools and things. I've counted this even from one manufacturer you can get the data in different formats and things. So, just a few different examples here so the MEX files which is stands for market exchange format. They come from 10x cell range of software. So they use this kind of sparse format to the three, three vectors. They're actually three columns in a text file. And then they zip up the text file and give that to you. But you can also from the same software output an HDF file, which also uses this sparse format so it stores the same kind of the three columns but in a completely different file format. And then I guess you guys on day one used 11 which has its own output format. I think it was in the file is actually called a cons underscore mat.gz but it uses their own representation of this sparse three column thing again. But it's different it's different from the two above. And then there is kind of the alternative strategy where, for instance, if you've ever encountered a loom file. It does the same kind of job it moves count matrices around for you along with some extra data, but they actually mandate that it has to be a dense matrix it doesn't use this sparse representation tool and uses other techniques that we're going to talk about a bit for making that efficient. And then something like the single cell experiment, which I guess you guys been counted and we had on one of the early slides. So this is the fire fire conductors object for representing single cell data, they can actually use amongst other things. HDF five files that are either dense or sparse, and they can use the same things in memory as well. So you kind of you can be very flexible in that framework. But what it does mean is that is there's no clear cut choice from the community about what the best way of moving this data or manipulating it is. And we've also reached the point where I struggled to use the correct language here so I talk about HDF five files, and that's a particular file format, but then we also talk about, for instance, loom files as being a file format, but they are HDF five files. I tried to see if there was a technical proper computer science terminology for when is it sort of a file format and when is it a file scheme or something. There doesn't seem to be a consensus there so we just call everything a file format. But I guess it's kind of important there's this distinction between low level, very general purpose file formats. So the very first example on this slide the text file would count as that right you can use a text file to pretty much do whatever you want. It doesn't have to be anything to do with single cell data and the same applies in HDF five files that is a fairly general purpose file format that you could use for all sorts of things, but people build on top of those and say this is a specific layout of those file types for manoeuvring in our case count matrices. So you can have a text file and if you mandate has three columns and they represent these three things the data and the row indexes and the column indexes, that's a new file type. And the same with loom files right they are one specific well structured and hopefully well documented layout of an HDF file. So yeah I find that sometimes a bit. You need a little bit more language to explain that this is not just an HDF five file but it's of a particular type of structure. And yeah as we see even two different three different manufacturers here different projects and that kind of thing reduce HDF five files that have different internal structures and even to the point where they represent the matrices in completely different ways. So hopefully that's kind of an overview of how we store these matrices and now the rest of the talk we're going to focus most almost explicitly on HDF five. So I'm actually the author of a package that can read these in our things that's why the focus is on this, and it's probably also the most. It's got the most market share I would say in in single cell data sets but there are plenty of other projects and efforts and things where people are exploring other file formats but they all kind of have the same concepts in the background as to why they're better than just dumping it into a text file and zipping it out. So yeah with that I will introduce the HDF five format to you guys. I think you have a question from Simon. Maybe Simon you would like to elaborate a bit on that question. Yes, go for it. So where is where is the question. I can see the chat but not the, is it in Slack. Yeah, you have it in Slack channel. I can read it for you otherwise if you want to. Let me put it on my second window. That's all right, I can ask the question. I was just wondering I mean, following your logic you could also get rid of the second. I mean one of the two indexes know if you use linear indexes for your matrices. It's a little bit more compact. Sorry, let me go back to so to whereabouts is this here. Yes, you don't need to have row and columns you could have only one of the two. Yes and you said first one is zero and then the last one is five. Yes. Okay, so, yeah, you're right. You could, you can consider this is just a one single so I guess you still need to know something about the structure of the number of rows and columns. I guess you could run into limitations where Yes, so you could have the difference of indexes in this sense now. You want to have an overflow. Yes. So I don't know whether that so this isn't the strategy that's normally used, which must mean that there is a good reason why it's not the strategy that's normally used because you're right you could. So what you're actually going to say for those who aren't following between the gaps in our conversation here is that you if you just have a single linear index you could easily run out of basically the number of elements you could reference. So if your matrix is big enough, you can't use an integer to find the right space in it, whereas if you have rows and columns you get more flexibility that the size could be larger. But then sorry, is it Simon, is that what I'm talking to. Yes. Yeah, yeah. So as Simon. Simon said you could actually just store the differences between things which hopefully then that wouldn't be greater than the total number of or the largest integer you can represent. I wonder if that impacts how easily you can get information out because you have to start in the beginning and keep working. You'd have to read through it from the beginning the whole way. I don't know if that would be a, because you always need to work out the difference to find out a particular element you'd always have to start in the top left hand corner and calculate. Yeah, I don't know whether why that would be where that would slow down, but all I can say is that that is the, that is not the kind of standard so if you create a DGC matrix in in the matrix package in our instance. I think that's not one of the options that uses, but I think, does that, does that go any way to answering your question. Yeah, I was wondering exactly this you know why is this not an option or a typical strategy, and I guess this. Yeah, I think because if you the performance would be. Yeah, there's too much but I guess the performance would be really bad if you needed to get something near the, let's say end so the bottom right hand corner of our matrix here, because you'd have to keep working all the way through to find that particular element. That would be my suggestion as to why. Yeah, thanks. So, yeah, so what I'm just going to now introduce the, the HD of five file format so you guys have probably seen them at least at some point, because a lot of these tools use them and certainly I'm going to introduce the and data format yesterday, which is one particular example. I didn't actually mentioned on the previous slide but that uses the HD of five format to represent its data. And so the name stands for hierarchical data format and unsurprising it's it's version five. In those files like the hierarchical part of this mean it basically indicates that inside the file. You can have a tree kind of structure, just like you have on a sort of standard file system where there are two basic elements there are groups, and there are two sets, and you can pretty much think of groups as being like folders they, as the name suggests the group things together. And the data sets are like files so they're the kind of end points. They can't have things nested inside them. But you can build arbitrarily complex. So you can have as many layers of nesting and you can have in between group relationships and this kind of thing. I think a lot of the time we don't exploit that possibly because it makes it too complicated. But one neat thing is that you can use this to for instance bundle all of the different aspects of a particular experiment or data set or something together into a single file. And we'll see a few examples of that. As we work through, but I guess we saw some examples in this in the and data file from Charlotte yesterday where it was the kind of raw data and then people have done some embeddings and they were. I think someone asked the question about where X underscore PCA had come from and those kinds of things and those are sort of later additions to the analysis but they were now all available in the same file. And so you can use these groups to collect things together that were done on the same day or we're done, you know, the pre process data and then the post process data and then the summarized data can be stored later. You run into the same questions as you do with how you should organize files when you're in hard disk and you can make arbitrary choices about those kinds of things. But what it does do is let you store and collect lots of data together in a single file. And I call this self describing as well because every element within the file including the top level element of the file. They can all have metadata attached to them so you can store things that relate to the data sets or the groups so for instance you could store in the metadata the time something was run or the date it was run on or the name of the person that was using it in the top level you could store the file version number so that somebody who's writing some software for it later can check whether this is version one or version two of your file format and that kind of thing. And everything in here can be named as well so you can it's only as good as as the naming quality is right just like all documentation. You can give things bad names and it doesn't really explain things to somebody later but you can give them sensible names that people can read and understand what the data or the structure is an HDF five gives you all of this kind of flexibility within its framework. And to just kind of show a little example of that. These examples are just taken from 10 X cell ranges own help web pages. But I know they're for slightly different versions of the software so it doesn't match up exactly but on the on the left. This is basically this top level element here is a folder. And inside the folder there would be another folder with the name of genome and then inside that would be a load of files. And that represents account matrix for from their cell ranger software. And this, this is the MEX exchange format thing that I talked about for which is basically just a text file with the sparse representation. And if you looked inside that you get get this as a little header and things. And then there's basically three columns which are the column indices the row indices and the values for each of the cells. I see there's a hand raise some. Yes, sorry. It's a bit confusing now because in a sense this file format doesn't tell you anything about how the file is actually stored on this. I mean here you could put matrix that empty X could be some kind of arbitrary encoding that nobody else than myself could read. Yep, absolutely. So you basically rely on that they have defined or somebody somewhere has defined what an empty X file is and that whoever writes this out obeys the kind of rules that have been specified by someone. But yeah, you can you could easily mess this up. You could miss a column out or have a blank space or something. And depending upon how good the software that reads it later is it could handle that well or handle it badly. But it is essentially just a text file and you could do whatever you want with it. And I guess the point really that I was I was sort of conveying here as well is that there's three files which in this case, this is the this is the actual data in our sparse kind of format. And then we have these two extra things here which are the barcodes which are going to be the column names, and then the genes which are the row names. And all of these could be used to recapitulate our kind of in memory thing that we're normally used to working with in an R session. And actually, on the right hand side is an HDF file file that pretty much is so it's one file instead of a particular folder, but it stores very similar information. Some of which has useful names and can you can easily match them up. So for instance, there's a, there's a data set called barcodes and that will have the same content as this TSP file. And then we have the matrix with our three columns and actually they use slightly different names here but there's a data set called data which are the values for each of the cells. The indices which are the row indices that we saw as well second kind of long vector. And then the, in this case the indices pointers which we call column pointers in my little cartoon, which is the third slightly smaller. So these could all these basically represent the same things as in this file here. And then there's also an element called shape which tells us how what actual dimensions the matrix is going to be. And then the final thing that's, there's another file right this jeans one, and actually in the slightly newer versions of cell range and then give you a little bit more information but that stored in this features group which has extra bits underneath it. So this gives you the genome version which I guess here was encoded in a folder name, and maybe it's better to have it explicitly written in a place called genome rather than relying on somebody not renaming the folder or something along those lines. And then this, this is a gene ID and on some gene ID, and this is the, the actual gene name, which I guess depends upon all kinds of using that kind of thing. But this really just wanted to show that you can get data out of a piece of software like cell range and it's essentially the same data but in two different formats and it can either be in one single HDF file file, or a bunch of files stored in a folder. And that hierarchical structure of HDF5 lets you kind of recapitulate that. And just to sort of hammer home the point a little bit more. There's a piece of software which comes from the HDF5 group called HDF View. If you install this you can double click on HDF5 files and it lets you browse around and look inside them and things. And again you can kind of the left hand panel here is again this exact same structure of the hierarchical nature. You can open the groups and have a look at the dimensions and that sort of thing. We won't use this today. We'll use some software tools to do the same kind of job. But I always have this available so that I can double click on these files and browse around. Sometimes that's easier than using a command line tool. All I've talked about there has been very much about HDF5. It's been explicit to that most other strategies for storing things on disk don't have this hierarchical nature. But I don't want to kind of dissuade you from using other tools and things. HDF5 is relatively old now and there's lots of efforts to come up with other ways of storing data in perhaps more efficient or modern ways. And they have lots of the same properties. So the kind of elements we're going to do now are they are the examples to be specific to HDF5, but the kind of concepts and theory behind them actually apply to lots of different file formats like TileDB or Apache Parquet. So if you've ever sort of tried to use those kind of things they're trying to do things in similar fashions but we've had slightly different technologies and that kind of stuff. And so some of the features you're sort of looking for in these ways that you can store this data on disk in a more efficient, more effective way. So really efficient access to subsets of that data. So we talked about how sometimes you can't read the whole thing into memory. So you really want to be able to jump into a large data set and effectively pull out the bits you want without having to spend lots of time reading things that you aren't interested in at all. It's also useful if you can compress the data and whether that's to a clever representation like our sparse matrices or using something more naive like essentially a zip file or something. Probably want it to be a little bit more sophisticated than that, but it's it's a useful property to have and we're going to explore these top two a few examples in a minute. But also advantage of these things right you can store heterogeneous data so you can store text and things alongside or images alongside the raw data, the more numbers, which is helpful for kind of the metadata and later analysis and that sort of thing. You also want it to be platform independent. Your users and the people that you're sharing the data with can be using windows or minutes or Mac or whatever. And so it's useful to be able to kind of move this around without worrying about it. And probably also a file format that has interfaces in many languages. So we've been switching pretty seamlessly between Python and are in this course. And certainly HDF five can do that. So we're going to only work in our for our session, but if you're a Python programmer, almost everything we do there, there is an analogous piece of software in Python that you can do the same tasks with. I guess one major drawback about all of these strategies, particularly HDF five is their complex libraries and they require you to install something else on the system to access them and to do this interface with other things which makes it more difficult for users sometimes. So that's kind of one major drawback. They're not just a text file, you can't just give them to people and they double click on the mid just opens right there. You have to go and install the HDF five software and you have to have that available when you use something like the to interface with them. But hopefully this convinces you that the kind of trade offs benefits are worth that little headache of getting new systems set up to work with them. So now we're just going to consider a single data set so talked about the benefits of all this hierarchy and stuff but we're just not going to look at that now we're just going to think about a single 2d dense count matrix so no clever representations or anything just kind of know if this is a 2d matrix. How is it stored on this and how do you, how does HDF five meet these kind of benefits of efficient subsetting and access. So let's just consider how you save a dense matrix normally without any extra bits from HDF five and our example is a 20 by 20 matrix. You don't need to go and count it or anything just trust me. But in on disk like your conceptually it's like this is two dimensions but on disk this would just be stored as a single long stream of bytes right there is no dimensionality on your disk. So it's just going to be it would start in this corner and basically the first row will be saved let's say for now we saw this row wise first row we save then the next and the next after each other in a long linear fashion. And what this means is that it's pretty fast to read a single row out from this kind of structure because all you need to do is find the first place and then you can do one read operation you say how many things you want to read. So in this case we would read 20 elements and we know where it starts this 21st element the first element of the second row and we'd read it all about that's pretty quick actually. And so this isn't a terrible situation if all you want to do is read out individual roles for instance. So that's our sub setting operation here. But it's terrible if what you want to do is read out individual columns. So in this case, because it's laid out in one long linear fashion. If you want to read a single column you jump to an element read that element and then you have to jump whole row so another 20 elements to get to the 22nd I guess in this example, read that one jump again jump again piece all that back together again. So you end up doing 20 separate read operations where you have to find them find the right place in the file read that data out and recapitulate it. And this is really inefficient. So if this is your layout, then the column reading is a really bad way, or it's a really bad access pattern. And so if you were to just dump out like a matrix into a binary file somewhere. It's really good to go in one direction and really bad to go in another, and that's not optimal at all. So the strategy HDF five has for trying to mitigate this is that these data sets are not actually stored as one big contiguous block of data, but actually they're stored into chunks so I've just colored these in the colors. I just meant to highlight where they are but don't think that blue ones are necessarily related to each other or anything like that. So let's split our data instead of one big 20 by 20 matrix. It's now 16 smaller five by five matrices. And actually, whilst they still look like they're one big block here on this, they're not stored like that they're actually stored completely separately in each other. The HDF five file and the kind of infrastructure that surrounds that keeps track of where these blocks are on disk and where they would be relative to each other in the overall shape of the data. So you as a user or a programmer whatever doesn't need to actually worry about keeping track of these kinds of things you just need to know that the HDF five is doing that in the background to you would still interact with it like a single big matrix. But in reality, it's stored in this kind of distributed way on this. And the advantage of this strategy is that it allows you to jump in to each block. And you only need to read the chunks that are actually necessary for the subset you're interested in. So an example here if you're, if you're interested only in this one particular element here so it says the third row second column. The only bit you would need to read is this chunk here and you can ignore the rest of the file entirely. And so that's that's certainly more efficient from a memory point of view in that you don't need to read it all in and then find the subset. Perhaps it's not quite as good as our continuous thing before if you could just jump in and read one on element. But our example before we use the column access is a really bad strategy. And so if we consider that here right if we want to read the first two columns. We actually don't need to jump around this for seek operations, which is to find the first element in these four blocks and all we're going to do is read those four so in this particular example here. We would read a quarter of the data and we'd only do for seek operations. And so this is this is much more effective than the kind of layout we had before. And you can see that if the size of our matrix grew but we still only one of the small subset the benefits of this approach get get more and more. So if there was 100 blocks here we'd still only need to read the first yeah 100 columns of blocks here or something we would still only need to read the first four to get out this data. And so this is this is the primary thing behind why HDF five is really useful for letting you get hold of subsets of data and work with it in a sort of effective and efficient way by speed and memory. And up till now I've been drawing square blocks in a square matrix, but they don't have to be. So, your, your blocks can be basically whatever regular shape, you want them to be so in this example, our chunks can be contained entire rows and multiple rows at the same time, but not span across columns and that kind of thing. And similarly, you could switch and have them narrower and you, you chunk things up into columns and have a totally different way that it's stored on this but ultimately they're all 20 by 20 matrices that kind of look the same programmatically. And so that you might not be thinking well why why would you do this and how would I pick these chunk dimensions if I cared about it. And the reason that you want to think about this if you're writing data or reading data in fact is that you want to know that the size and shape of these chunks can really impact the efficiency with which you can get data out. And if you know how you're going to access that data, you can really tailor the chunk shape to match those access patterns. So we'll see some examples in the exercises of this. But yeah, if you knew you only ever wanted to get columns out picking a structure where the chunks contain whole columns and not very many rows would be a really good strategy. And if you knew that you only ever wanted to get rows out likewise that strategy would work. If you don't know, or you. Okay, yes, sorry. I think there's a hand up. Sorry, I have a question. The examples where you have the whole chunk as a column. Is that bad if you want to access the diagonal of the matrix that you have to use all the chunks. Yes. Yeah. Yeah, that would be you would end up if you wanted to diagonal of the matrix. Yeah, you end up reading so we'll see a few examples but you would end up reading the entire matrix and it would not be a very effective strategy. I guess. My middle might well my third point here is the, if you don't know, or you have a strategy like this where it's kind of actually you want both rows and columns, then either square shapes or something that's proportional to the ratio in the matrix are probably kind of a decent compromise. There is definitely a compromise to make there, and it really does come down to, if you know in advance, then you can tailor how you want to do this. Same for if you know you're going to be picking out large numbers of elements versus like individual points from a data set. Then that can also impact the best strategy for doing this. Does that answer the question. Yes, thanks. Great. And Simon. Yes, I was wondering how these chunks strategy adapts to sparse matrices because then you don't know a priori the number of elements if you start to populate your matrix. So I was wondering how do they deal with this. Yeah, so my very last slide will have a brief touch on this past matrices things so I will come back to that right at the end. If that's okay. Yeah, so and then I guess the final two bullet points here right like there might be a temptation that you want big chunks so you have to there's an overhead and try to find where the chunks are and things. But the caveat is that you have to read the entire time, even for a single element. So you don't want to make them too big because I guess the best strategy would be to make the whole file or the most extreme case would be to make the whole file one chunk. But then you read the whole file every time. And that's, that's really not effective. So then maybe you think I want to be small too small but then there's this overhead to actually finding them and keeping track of them in the file and that kind of thing. So you don't want to make too many of them either, because then you spend so much time finding where the chunks are that it's actually outweighs the time they would take to just read slightly fewer, but larger chunks. There's always a trade off here and to be honest, most people just end up with some kind of middle ground. I hope it's fine. But I kind of just wanted to emphasize that sometimes this, this is really, it's a useful thing to be aware of, and it can really we'll see in the examples that can make a huge difference to relatively trivial operations. If you know in advance, what your access pattern is going to be. And so the second part of this. So you have these chunks. And actually, in addition to sort of storing them in different places. The chunks can also get processed by what's called filters in HDF5's parlance, but these are usually so that they're just something that manipulate the data in some way, but they're almost always used for compression. So you take your chunk from your memory, you pass it through things which will basically make it smaller. So in this case the shuffle operation moves the bytes around so that essentially the zeroes are at one end and the ones at the other end. It's a bit more subtle than that, but essentially that's what the shuffle operation is trying to do. And then we compress it with Jesus, which is just the kind of standard relatively naive compression operation. Before we store it on disk and then to get it back out when you read the file it does the same operations put in the opposite direction. So in this case you can insert other filters here and things, but I would shy away from using other filters. Most of the time because really these are the ones which were shipped with HDF5 and if you use anything else which is specific to your data and might look really good, you kind of can't guarantee that your users will have it on their system to read this again. Sometimes get questions from users who try to use our HDF5 to read an HDF5 file and it tells them that they can't because it's been encoded and a filter has been used by whoever wrote the file that they don't have on their system. So you can explore those options but for now, I would suggest just sticking with the kind of defaults that use things that are available almost every everybody's system. But it's this compression strategy which is why it's okay to still store the dense matrices right so up till now I've kind of introduced this sparse layout and then spent a long time talking about why HDF5 doesn't necessarily need to use it. This compression thing is pretty good on the sparse, even this kind of level of sparse matrices right so there's lots of zeros in there, an algorithm like GZIP does a pretty good job of compressing those. And our data sets, I don't know exactly why I would say I know 150 gigs in memory comes to about three and a half gigs on disk, that kind of difference. So it's at least an order of magnitude but it's obviously relative to several factors in here, one of which is how sparse it is and what level of compression you use like GZIP, even if you use that algorithm has options from very fast to very slow but more effective compression. And then the effectiveness of this compression is also relative to the size of the chunks you pick. So it's another part of the kind of balance you need to make where if your chunks are all of size one they won't compress at all right you can't do anything making a single value smaller than it already is. And if you have one massive chunk that is the whole file that's actually going to compress the best you learn that the smallest one at the end, but it throws away all of these kinds of fancy subsetting and random access options that we had before. So that's probably not a good strategy to have this in this chunk I would never really recommend using a single chunk, unless your data are very small. But this is kind of the, the reason why we can get away with using a dense matrix, even in the HDFI file. But that isn't to say that the benefits don't necessarily apply if you're kind of using an HDFI file but storing our three vectors that we saw before. So in this case the three vectors for our kind of sparse representation, they're each distinct data sets. So we've been talking about two d matrices in this case they're just one dimensional vectors, but they can still be chunked. So this time the colouring are some kind of chunk they can be whatever you want again you can choose the dimensions and things, but you still get the same benefit of chunking and compression which won't be as effective as when there's lots of zeros in there. But it can still do something so you can still end up with a smaller than if you just stored the the these three vectors without any sort of compression applied to them. And the chunking means that yeah you do get it's it's slightly I guess the the algorithms working in the background as need to be slightly more complex but essentially you can still jump in and read out the elements you want without having to read the rest of each of these factors. So it doesn't work quite as effectively but yeah you still get these benefits and there's a reason why people still put kind of sparse representation inside an HDFI file. And we'll see in some of the examples this as as we go on later. So yeah the summary here right is some hopefully just do that there are people thinking about this problem and things. But there is no clear consensus in the field for how to store the single cell data sets right should they be in one folder with lots of different files should they be one file. What representation should be used to store these should they be dense matrices or should they be sparse matrices. Should they fit them up should you use some other compression algorithm that's the spoke for your particular software. There isn't a consensus. And so this is challenging right you guys have probably already encountered that before. And at some point I suspect we will settle on something but we haven't got there yet and it's been quite a few years of people working on this. And HDFI is the biggest market share in terms of several bits of software but each of those produce a different format of HDFI. So even if you kind of are aware of that it's still useful to know where the original sort of the original source of an HDFI file was. And there are different properties within that for how the data might be laid out and that kind of thing. considering writing them yourself, and this is whether you're using HDF 5 or any other file format, right? There's always a balancing act between many different competing factors for how you should go about doing that. So if you want to spend time, lots of time writing it because you know that that's a one-off operation and then reading can be really quick, you can make choices about compression levels and that kind of thing. Also knowing the specific use cases, so if you know you want to access diagonal information really frequently, or if you want to access only columns and you'll never worry about rows and that kind of thing, you can use this to influence the decisions you make about how you're going to store it. Most of the time after admit people don't know that, they don't think about it and I don't think the field has come to a consensus as to whether they more often want to know about groups of genes or groups of cells and that kind of thing. So we often end up picking a middle ground, which is some kind of compromise between the two. And yeah, we will explore a little bit of that in the exercises after the coffee break. And yeah, really the, I guess the final point here, I probably should have put right at the start, but the reason we need to worry about some of these optimization things is because accessing data from disk is several orders of magnitude slower than if you've just gotten the memory. So when you're worrying about performance and how long things will take, getting it wrong can easily go from seconds to hours or days without really realizing, like it will work, it just will seem really slow. And if you haven't thought about these kinds of things or been introduced to them, you might have no idea that it's just a case of reshaping the data, how it's stored on disk and you can get these great benefits that kind of will multiply over multiple runs through analysis or something along those lines. So yeah, hopefully this has introduced you a little bit to how people are thinking about this kind of thing. And yeah, we will work for a few examples, mostly using HDF5 after the coffee break. So are there any questions?