 Next up is everybody pages is going to talk to us about sparse array objects. I'm going to have to try and be a little more strict about the time. Otherwise, we're all going to be here till lunch. I'd be shorter. No, right? Okay, does it work? Yeah, it does. I'm going to make it. Yeah, I'm going to talk a little bit about the. Work I've been doing with sparse data representation are. Is it the right error? I'm going to slide. I think you ever. Am I now? I think so. If you click the right corner, the arrows, then this one. Yeah, yeah. Okay. Sorry. Yeah. So, first, I want to discuss a bit what's what's currently available in our 4 spots as a representation. And I'm just going to talk about in memory representation. Not going to discuss on this representations. So, as most of you know, the standard way to represent the sparse data in our is with a matrix object. And that's from the matrix package, which is a current package. It's a recommended package. So you. Every our installation has it. By default, and it's used a lot. Especially in bio conductor, where I've found more than 100 packages using this container for storing sparse data. Yeah, so it's it has been around for for about 20 years now. Matrix package is very early package. Yeah, and to compliment the capabilities of the matrix package. Some some people have come up with. Those two packages, matrix stats and sparse metric stats. So the metric status package by Henry bankstone provides, you know, a row and column summarization. Operations on matrix objects and but it doesn't work with so it works with it works with ordinary matrices. But but it doesn't support DGC matrix objects. So, Constantine, which is a bio conductor contributor, came up with the sparse metric status package to feed that gap. And this is pretty amazing package already implemented in C++ very fast. It provides very fast row and column matrix summarization for DGC matrix objects. So, yes, those those summarization operations I'm talking about also the worst assumes costumes for me, etc. operations. So here's what you can do is pass metric stats. Just an example. Yeah, but there are some limitations with the what DGC matrix objects can do. So, first, of course, it's only for two dimensional data. There is a hard limit on the number of non zero values that can be stored in a DGC matrix object no more than two power 31. And also a concern sometimes that the non zero values cannot be stored as integers. They can only be stored as double double values. And so that's not optimal for the memory print of the object because double is twice the size double in C is twice the size of an integer. And so some operations are inefficient with DGC matrix objects. So here's an example. This is what you get when you try to see by for example, two DGC matrix objects that together have more than the maximum number of non zero values that is allowed. Yeah, the mayor or can't do it. Yeah, so here I'm showing some details about the the internal representation of those objects. So DGC DGC matrix has some sibling some sibling. There are some sibling classes. The D in DGC matrix stands for double. So it tells us that the non zero values are stored as double. And there is also this DGC matrix class where the non zero values. So this is to represent this to represent a sparse matrix of logical values. So the non zero non false zero for logical is false. So the true values or the any values are stored here in a logical vector. But there is no IGC matrix class. So if you want to store counts, which are typically integer values, we have to use the DGC matrix representation. So so I work with I wanted to try something different. I wanted to come up with something new for for storing sparse data in R to represent sparse data in an array like object. So I started to work on this class sparse array. So this is in the S4 arrays package. It's not in bio conductor yet, but I intend to submit this package soon. So it's still working progress. And the idea is to address some of the limitations of DGC matrix. And so here's what it looks like. I'm just using the constructor here to convert that DGC matrix object into a sparse array object. So I talk a little bit more about what the SVT thing means here. But the real class of this object is SVT underscore sparse matrix sparse array is just a virtual class. So it's not limited to matrices. It can be an array of any dimension. So you see zeros here, but the zeros are just an artifact of the way things are displayed. It's just a show method that displays the zeros that there are no zeros stored in the metric in the object. Of course, it's a sparse representation. So only the non zero values are stored there. And the count is just is the number of non zero values in the in the object. So 12 non zero values here. It supports any type. So you can store any atomic type. So integer, row, logical, character, complex is also supported. And there is no limit. There's no limit in the number of non zero values that you can find in the object. So the limit, of course, is going to be the amount of memory that you have on your machine. But as long as you have enough memory, you should be able to store as many non zero values as as you want. So for example, just one more minute. OK, OK, so lots of lots of non zero values here. Of course, that that that is a big object of 26 gigabytes. That's the memory for the subject. Yeah, so I was going to show some details about how things are represented internally. So quickly, the idea is to store the non zero values in a sparse vector. That's what I'm showing to the left here. So sparse vector is just a little table with two columns on the left, the offset of the values. And on the on the right column, the non zero values, the values, the non zero values. Yeah, and for and to represent a 20 by six object, I just use six of those sparse vectors. So that's that's what internally two by six sparse array object looks like. But then this can be repeated to add dimensions. So on the left, you have this 2D representation that I just showed in the earlier slide. And if you want to represent a 3D object, you just add one more level in your sparse vector tree. So that's that's how I call this tree here. It's a tree with where the leaves are sparse vectors. So this leads to very efficient representation and operation. There's really fast fast access to fast random access to any part of the object. And this leads to very efficient operations in general. So here are some, you know, some benchmarks. So it's not. Yeah, so column column section is just amazingly fast compared to the symmetrics. Yeah, so overall it's, it's, it's very efficient. It's way more efficient than a GC matrix object. This one here, I don't know if you see the map. Yeah, OK, it's not, it's not, it's not a typo. It's really a GC matrix object. I guess they were not designed to support this operation because it's not a common operation, but super simon should work anyway. And but with the GC matrix, it's very, very slow. Yeah, I think I'm done. This is a long list of things that still need to be implemented to have a full, fully featured array-like container that you can use for all the basic operations that you expect to be able to do. But yeah, still a lot of work here. And yeah. Cool. Do you have a question for everybody? One? Well, this was given at all, is this, like a problem with the data? That's the idea. Yeah. Are there any questions in the chat? This is a question on, on roll lookup. Why is roll lookup so much slower? Is it based on how it's traversed or? Again. Why is roll lookup so much slower relative to a common operation? Well, that's, that's, you know, the, the representation is column oriented. And like with the GC matrix, it's column oriented. There is this initial, this burning choice. The choice could have been otherwise, you know, to do a row oriented representation, but it's, it's a column oriented representation. So accessing the columns is really easy because you have a list of, of six columns. So if you want to pick up a column of five, you just grab those two list elements. There's nothing, a really fancy, you need to do nothing complicated. But if you want to, you know, extract rows, it's more complicated because you have to do some kind of match between the row indices and those offsets that you have in those files. So you, you need some kind of hash or binary search or there are several ways to deal with this. Okay. Thanks. Okay. Next one here from, from Quan Liu.