 It is 845 UTC, so I'm going to get started. My name is Aaron Wollin. I am a software engineer at TileDB, which makes sense because this is the TileDB tutorial. So the tutorial will be primarily led by Dirk Edelbutel, who is a principal software engineer at TileDB. You may know Dirk is the author and maintainer of RCPP and countless other high-impact R packages. But Dirk is also responsible for the TileDB R package. And he's done an amazing amount of work to make sure R has first-class support for users who want to leverage TileDB in R. So I will take over a bit later in the tutorial for one of our use case demonstrations. Until then, I'm going to hang out in the Slack channel and monitor it for questions. So if you have any questions, feel free to post there. And I will interrupt Dirk so we can address them as they come up. But with that, I will send it over to Dirk. Perfect. Thanks for the intro. Let me turn screen share on. Yes, so Aaron and I are really excited to have the opportunity to present here to you today. We've both been with TileDB for well over a year and working at that. I have focused mostly on this R package. And Aaron looks after bioinformatics applications, particularly VCF. So that'll come up a little later. So with that, this is really the first time that we're talking to the R community about what the TileDB package offers. And I'm hope you'll all be coming away from this with some understanding of what it can do. We have an aim to keep this somewhat non-technical. We have pointers to more documentation at the end of the slide deck. But by and large, we just want to show you by doing what this is about and what it can do and keep some of the more technical aspects for a later time, because this is really the initial tutorial. Here's a quick list of topics. The time that we spend will not be even. So among the key topics, I will be talking a little about dense and sparse arrays, as well as the API, the underlying C++ API that we wrap. And I'll get to that a bit more in shorter notes on S3 cloud access and these other points. A few minutes on the application examples, these four are relatively short. And I will then hand over to Aaron, who has a really fantastic and longer genomics example with which we basically conclude. And then we'll read up. It is, of course, a little challenging to do this wire zoom without being in the same room as you. I've been giving tutorials at USAR for maybe a decade, and it's just lovely to have eye contact. We don't have that here, but at least we have four eyes and four hands. And I think the general gist is that you're asked to put questions into the Slack channel rather than the chat room here, if I have that right. And between Aaron and myself, whoever is not talking, we'll try to keep an eye on the questions, and we will try to break frequently enough so that we can address questions as they arise. I hope that's good for everybody. So how to be, here's a screenshot of the website, makes a very broad claim that we're aiming to make data management universal. There's one storage format, the TauDB array, which works particularly well for dense and sparse arrays, and we will show you a couple of these user examples. And multidimensional, particularly sparse arrays fit, for example, data frames really, really well. These data frames are the bread and butter for R. That works very well, and obviously it also works with numeric arrays, matrices, and all of that. And it's serverless in the sense that something like SQLite or DuckDB serverless, it really is an API, a file, access layer that's provided as a library to which multiple tools can connect. And we're talking today about the connection from R to R. And that went in the email to you. We put these slides, as well as the example programs that I'm going over, for example, snippets into a little R package that sits at this repo at GitHub. And in the email that went out, we only gave half the repo. So if you run install packages, but don't have all packages that this package requires uninstalled, it cannot resolve to the other packages. So the better form from install the packages is to extend repos beyond just CRAN into a bioconductor and other repo that you have at our little ad hoc repo. And then the TileDB package comes in. And when you then do library on the installed package, it will just show you a little welcome message. And for example, tell you where the examples directory are. That's just the per package example directory. And with that, maybe I'll do a first pivot. To a fresh session. So I just do this in Emacs here rather than from, and this is installed on my machine. And then it greets me that, yes, in my local path, in the examples directory, we'll see the script. And I just cheated to keep the path shorter and made this a soft link. But you should have all these examples in the directory, as well as a function slides that then finds the PDF and brings it up. So you don't even have to type the path out. So then we updated a few times and updated the package still after we first emailed you all. So we are internally now in version 013 of this package. But it didn't really matter the slide deck as emailed yesterday was pretty current. We are also on Slack at the conference, in-channel tutorial, how do we? And Aaron and I are trying to keep an eye on that and see if you have questions in here. And I just noticed that someone didn't want to score, right? Right, perfect. So let me just type real quick. So questions here. So with that and a quick eye contact with Aaron, does anybody have questions so far? Are we mostly good and set up? Beautiful. So the first thing that I'm just going to show is a really simple and quick example. And we will go over each and every of the underlying commands in more detail later. So with that, let me hop out of this. Get back in here. And this is the exact same code that's on the slide. I do not need to execute this first line because TiledDB itself is installed for me. So I can just say TiledDB. And in this R-Session already done in second before, when you first load TiledDB, it tells you what the version is. And O094 is what is currently on Crayon. And it then tells you what version of the underlying C++ library giving the TiledDB functionality is loaded as well. In my case, that's 2.4.0 because I work here. So I tend to work against the development version. The release version is 2.3.1. And nothing that we're doing today should affect any differences between these. A data set we'll be using left, right, and center because it's simple and easily understood is Parma Penguins, which I'm just loading. And then because TiledDB arrays, unless you go to the cloud, really make use of directories and create from a top-level directory, the array name, the array descriptor, the array URI, a set of files below it. So I'll just go here to temp because I'm now creating a first array in a temp directory sort of just a throw-off. And a helper function, a high-level function that we've written for the R package, it's just called fromDataFrame. And it takes an object here, the Penguins data frame from the well-known Penguins example and will write it to the address given. There are other options. We will see them. So if I first execute this, yeah, and that happens. And now I will just do because I had done it there before. When you ask TiledDB to create a new array, the creation of the array is the initial write. And that will error out just when you try to write to something initially and the file already exists. You can later open an existing array and append to it. But the initial creation assumes xnt that there's no overwrite or delete operation going on, but that it creates at the very first time. So when that little boundary condition of does the directory already exist or not is taken care of, all it takes is taken our object and a location and you write the directory. We can look at what's written there, but I've never really poked into this. And I'm a developer here. This is mostly opaque. The view that we have of data is that we're presenting the storage not as a format specification that many competing implementations could provide maybe get right or in slightly different incantations. For us, the code is a specification. The format that TiledDB presents is presented in one API. And that API can be connected directly from C++, or R, or Python, or Java, or SQL databases, and others. And it then, similar to other database systems that you ever looked under the hood, it writes in a directory where the writing beginning and timestamps are encoded in seconds since the epoch, as well as the UUID handle. So this one is just where the files are. This is a zero byte OK flag, log, and meta and schema are directories for additional data. And then in the directory itself are files with one file per column of the data frame. Plus some other bits and pieces if you have columns that have null or null flags, there's a fidelity counter. But none of that we really need. I just showed it because one does okay, we write to that directory what then happens. What really matters is that we can treat this URI here, Penguins, as the source of a TiledDB array. And the other way around, from the written array back into storage, the catch or function that we will use a lot is TiledDB array, which will open either dense or sparse and recognize it from the schema. And here, I used two additional attributes, one saying that I want it as a data frame, otherwise I'll get a list and we'll show that later. And extended false just suppresses a display of implicit role names when we're writing and do not give role names explicitly. So then they're better being added. The handle itself for the TiledDB arrays and S4 class with various attributes, we'll hit a few of those. Some of them are Boolean toggles, so we want it this way or that way. The as data frame that I then used as one and that sets it to true here. Once we have the array opened, I can just assign with square bracket operators, extract me the array and I have a data frame back. And this is now essentially the same as the Penguins data frame that goes in, including transparent treatment of missing values, different variable types with only small differences to R. So for example, we haven't yet done anything about representing factors because factors are really an R specialty and we take a sort of slightly broader view in the TiledDB representation, but what we write down also must make sense for Python and other languages and they don't really deal with factors. So if you write something in as factor, you may get it back as a character variable, but as things have evolved, that's more or less the view that R takes these days to when you read from CSV. I see one first question coming up here by Brigitte saying why use an array instead of a data frame? That is mostly just lingo if you wish. For TiledDB everything is an array and what we would have in R as a data frame gets stored in an array. There's just no TiledDB accessor function TiledDB underscore data frame. We'll see a bit more on that. The ability to store as a TiledDB array extends to numeric arrays as well as to data frames. I hope that answers that question. Okay, so that was the very first introductory example. Here's a follow slide on what we just saw when we access the array, unconstrained, no additional selection on rows or columns. We get a data frame back of all the observations of all the variables because we asked to have it returned as a data frame. What we will look into a little closer in the coming minutes then is differences between dense and sparse arrays and when we might want the one or the other, how to operate with indices, a feature that requires sparse rather than dense arrays. But we won't really talk about some of the available tuning options that I'll extend the layout today. That's sort of for a second tutorial. They can affect performance in the large. It wouldn't matter for the penguins, but when you write gigabytes, which will show how to access arrays of that site. It matters a little, but it's a slightly more advanced use. All right. So let me start by dense arrays because dense arrays are conceptually simple and dense arrays historically is also where TiledDB started. Dense arrays tend to be a bit more limited in the sense that they require the same variable for the indices or rows, columns, or third and fourth indices. So you cannot imagine as we will do later with sparse. So here we have the very first example, QuickstartDense, that starts in the TiledDB documentation at the very top in a COC++ example, some Python examples. And here I'm showing you, we'll come back to that a little bit later again, basically how we would do these operations with more atomic TiledDB commands that correspond more directly to the underlying API. So when we ultimately want to write an array, we first has to create a schema. A schema is a combination of a domain and attributes. Domain is really important because that's what organizes the array. So here in the simple example, we're simply saying we're going to have rows and columns ranging from one to four with max values of four and they're both being in. And here we're just having two of those. So a simple vector that assigns to DOM, we're then using DOM, the domain, in the creation of the schema where we're saying we're going to have a single attribute that would be in a data frame, a value column that's not an index column. In this case, very boring, also in 32. And we will refer to the attribute column just as a, that makes a complete schema. And just how we had the helper function from a data frame that really generalizes and uses the same scheme that these underlying ones have, a TiledDB array is then created by where do we want to create it, what you write and what schema. The schemas can also be queried. Once we have this created, and in this case, remember we created a four by four, we can write the array by creating a four by four matrix or an array rather and assign to it. And with that, let me pivot back to the corresponding example for that because it is actually nicer to show that directly in the code. So that creates the DOM, that creates the schema. At that point, I can actually look straight at the schema variable, which is really just a display of meta information, a little bit of the header information. Did we select color moral or what I call this default because I have the same couple of other things, other compression options, things that we don't really need to look into too too much, but we set two dimensions here, the two dimensions showing the two names. And there's already the attribute that's in there, which I then, now I can write. Again, I could look into that directory and see the underlying file. The data that I'm trying to write here is a four by four array and then I can just write it in. First open the array and then do a write assignment with the square bracket operator and I can get the same thing back. And when I say nothing, if I just open it and query the array by default, I get very R default ish, a list of the columns for a dense array. And if I don't say anything else, I get rows as well as columns and the attribute. However, if I say, I'd rather have a data frame and run that, then I do as asked, get a data frame back, which was here from printed the proper way with the row names and class of what comes back is just a data frame, nothing fancy, because you have a data frame and you can then go down the pipeline and do other things with it. I often use data tables, we also do tables, everything that arrives from a data frame. Similarly, because this was a two by two array, sorry, a four by four two dimensional array, which is also matrix. In this case, we can also say, give us a matrix. We have some application cases where people really want matrices. So that's why we put that option in. And then more recently, someone actually really wanted three dimensional arrays and it turned out that I had been focusing too much on the matrix special case and wasn't treating the dimensions in excess of two, all that well, so that was a quick bug fix, not that long ago and version on nine four has that. So if you write a five dimensional array and say S array true, you would get your five dimensions back. And so that was first half of dense arrays. I showed all these examples for the generic tidy array accessor function and the different qualifications for the return type. But we don't have to just stick numbers into dense arrays. We can also write data frames in dense arrays. So here is a dense array data frame example. So let me just go over the code in the second half from that file. For the URI here, I'm just saying temp file. That's a really easy trick, of course, for tests and demos because temp file is a new directory and you can't have a problem writing this. No, I'm just doing the same, but I'm writing the raw Penguins dataset rather than the cleaned dataset. And then look at this again and it comes back basically the same as it was before. Characters, numbers, values, so we're happy. Sort of one of the big differences really between dense and sparse arrays is that the underlying storage for dense arrays really requires a cell to be present at every intersection of the indices. So for a two dimensional by rows and columns, no holds really how we think of a data frame because when we print it, we also see it printed without holds, contiguous. So it's contiguous vectors and it's good for some cases, but it is a bit more restrictive for really powerful applications. So that's why we'll be pivoting a little more to sparse and work a little bit more for sparse. But there are cases where dense is better and it depends a little in its heart to make a sweeping statement that one truly dominates the other. It really depends a little on your problem domain. So that was dense. That was my first sort of sub-segment. Any other questions? I see that Breret and Aaron were busy over on the chat. Aaron, any hand wave, anything we need a break for? No, I don't think so. We had one or two questions in this Slack channel, but you addressed the first one and the second one was about whether you can store different data types in a single way. Yes, I mentioned that in passing and maybe I ate my words almost too quick. So the big constraint on dense arrays is that the rows and columns have to be the same type. When we take a data frame like penguins and write it at a dense array, which is the default when you don't say sparse equal true, we're actually cheating and writing it as a one-dimensional dense array because there's only the implicit index of the roll names. And every data frame has roll names, roll numbers. We can always create an index by just numbering down from one to the row dimension to dims one. And that's basically what we're doing here. And that allows a dense array, dense index, just row indices to have different columns of different types, mapping the data frame capability that we could otherwise not have, could not be matching. So yes, very much so. We can have different data types in the different columns, just not in the indices, which ultimately gets limiting because once you have really large data frames, you want to index on more things than just the numbers. And that's where sparse enters. And coincidentally, that's what we're coming to now. So speaking of which is a little hard for us to keep track of who's there and who's not here, but I can sort of hunt wave because I recognize a few names popping up. And so this is a shout out to Martin Mechler because he has a sparse matrix example straight up from the matrix package, motivating what we can do with sparse because the first way to show that, of course, is by actually creating a sparse. Hold on, this is annoying. Let me just take this one out because I don't need it anymore. And then it will no longer ask which R session I want because I only have one. So matrix package loaded, 10 files set seed, so that's reproducible. And I think this is literally a numeric sparse matrix that I've used in another package of mine when I wanted to work with sparse matrices. So this is 100% matrix and nothing entirely B. So I have a sparse matrix here of type DGT. There are different types that the matrix package supports. The most common one is DGC, but it has a more complicated indexing scheme. And what we're using here is DGT because it corresponds more closely to first index, second index value. And we'll see that in just a second. So basically what's in the internal as for representation of a DGT matrix maps the tile to be representation naturally. So that's actually pretty cool. And just how we had a from data frame generator, I then created from sparse matrix generator. And so this one just writes the object at the given URI here again a temp file to be simple. Check opens the array again to read it. And so these two lines give us a round-term check and you know, see that the CHK matrix that I read back is exactly the same and indistinguishable, I think. So I can pass that through all equal to the one that we wrote in. So what's different now? Well, you know, sparse matrices are very powerful in many big data applications just because we often have data sets naturally with sparsity. You can have matrices where not every cell is presented. Here I had one where in eight by 20, 160 cells I filled just 15, so just under 10%. And as we'll see then in the next example it works really well also with data frame examples. The main advantage by enabling sparse equal true in our bread and butter function from data frame is that I can now also designate some indices. So here for the penguins, I'm saying let's use species and here. And I then should add that they will have slightly different treatment with entirety. Species was a character variable. For character variables, we basically say that the domain is from now to now because you can't really meaningfully enumerate the set of all possible characters. So they're basically unspecified. That has one great advantage because once you write a sparse matrix where a index dimension is not limited, you can immediately append to it because you can never write past the 3D care maximum value. However, if I did column index species in here and did not say year, say 2000 to 10 to 21 I would not be able to write beyond the penguins data set which if memory serves me right has examples from three years, 2007, eight and nine. And a pretty text example that I had was just saying, okay, let's just set one up but put a wider range on the years so that we can append to the data. We'll have a fuller example on that. And so if I don't say anything else the data frame comes back the way it came in. It may be slightly rearranged relative to the our original not to similar to how data table, if you know that works when you key some variables because once you give an index on a particular column it will use that key by default and sort by those values. So the, I forgotten, well actually I can do that now. I forgot what head on penguins was but so for starters, the order is different and we're lucky here because it also started with Adelie and in 2007. So it's the same year. So we get the 2007 vintage before the 2008 and 2009 but because we designated two columns as indices they come back as the first two columns. So the column order is different from what we originally had. So you can't do an all equals there but you can do an all equal on each of the columns. What we can then do when we have a data frame with an index is that we can open it again. So I've shown you new DF and nothing had changed here at the dim of the unconstrained is the same as the original data, 344 by two. But the nice thing now is when we have indices we can actually use that. And TALDB is a lot about performance with really large data that makes it somewhere else. So I chose to represent the indexing by columns as a free function where we're selecting range constraints on the array objects that we opened. So select ranges will modify the X object, X is our TALDB array with the slots and selected ranges as you can see when it's not set we'll just say none. But this convention here is that we can submit a list with as many elements as there are dimensions named or unnamed if they're unnamed they just have to be in the same order. If they're named then you can have subsets of dimensions that you constrain. Here I have set two in the example and I submit to and I'm basically saying, okay, I want the year to go from 2007 to 2008 boundaries included, but for species I only want gen two. So from gen two to gen two, which amounts to an equality constraint. And when I then look at that, my 344 is down to 80 because I expressed a selection constraint. And that's pretty cool because we will see examples later where the URI is not a temp file but something remote. And by selecting from where we operate what subset we want, we're just transmitting the request for the subset to the backend and only select the data will come back. That matters a lot for our remote and really large scale operation. A more recent addition to our arsenal is that we can select not only on the dimensions but now also on the attributes. That is still a little more restricted. Certain things work, other things don't but it works already very well on numeric conditions. So in the more bare bones API representation, I'm saying let me set up, let me initialize a query condition on one of the attributes. And here I'm selecting body mass in grams with the cutoff value of 6,000. This column is of in 32, which we always have to specify and my condition will be greater or equal. And when I do that and basically assign this query condition to the query condition accessor as for accessor to the array object. If we, for example, if I now looked at X, we have selected ranges as well as a query condition. That's that. And query that my selection from 80 is down further and just down to now three rows because I had the initial index selection 2007 or 2008 for species only gen two, but I also wanted my body mass index to be over 6,000. And that's pretty neat. That's in the Cran version. What I have here and which is in GitHub, but I don't know if I saw this one. Do I saw it in a second? But not yet on Cran because we only release every couple of weeks. So this is just something that just got added in the last week or two after the Cran release. We now do a little bit of clever passing on our expression trees. I'm just making this a little richer than the one that we had. So we see here that with the current constraint which is greater than 6,000, I get three back. I can now also say pass me a, oh yeah, sorry. I'm riffing. It's the end that we translate to but the I expression has an embossed embossed. See, that's what I get for changing the code on the fly. And when we do this now and do a demo on it, we will have two rows. Yay. Kirk, we had two questions in the channel. Someone asked about your selected range. Why did you see a bind on the same column twice? Well, yeah, sorry. Very good. Excellent question. Sorry. I was so performance focused here and I think we will have examples on that later but think about a really large data frame of a million rows and it's indexed by years or dates and say you want multiple chunks. Say you have something about all the birth years in the 20th century. And you hypothesize, you know, you're doing social science and you hypothesize that people who are born in the first year of a decade have different outcomes. So you want everybody born in 1900, 1910, 1920, 1930 and so on. For that, you would use the C bind trick because the full specification of what I'm sending in here is actually a matrix and using a C bind this way is just a really quick way to locate a matrix which is sort of one change, one quicker than I find than doing it this way. But they are equivalent because what selected ranges I'm riffing without a safety net, what I love. Okay, I got the same one. So I'm basically just using this format because it's quicker than this format, particularly when you only use one or two values. And one thing that I haven't done because nobody has asked, but probably should is for the special case of the equality, if you only submit one, it could trivially expand of course also to two. You know, we haven't done that yet just to keep it simpler. But for selected ranges, the condition really is a list of matrices and the matrix can express an inequality constraint on every row of the matrix. I hope that helps. Yeah, another good question. Someone asked for a clarification about the difference between a query condition and selected range. Yes, once on dimensions and once on attributes, we, you know, we, especially the core team, think very much about the bare bones functionality in the system about the high level ones. We discussed that a little already, that, you know, maybe that should be more, more seamless between the two of them. But because internally, dimensions and attributes are treated differently, the selection on actual index dimension is different than the selection from attributes. So you basically have to remember how you wrote out your array and what type it is, which the schema function gives you back. It will tell you, oh, this wasn't the dimension or this was an attribute and then correspond according. Yes, and the tidyby query condition in it is a little similar to a SQL statement because that's basically what we're doing there. We're essentially just having relative operations of, you know, column type value operator as well as booleans between them. As of right now, the only boolean that support is actually an end, the or isn't there yet, but that's planned, that's in the underlying API and then you just generally have a column, so an attribute, an operator and a parameter value and we can combine and nest them. Just how I had an end here, we can do that with more complicated expressions. But I think for example, we currently can't spend expressions that spend two columns or things like that. So that's relatively new and we quite like it because it's richer than just the selected ranges feature that we had for a couple of months, which is very good, but it basically forces you to think ahead, will I select, will I want to query on this or that column and make that column an index column and once it's an index column, you can no longer constrain it and with the query conditions, you now can. So we're quite excited about that. And for the question of if one wanted more, then so for example, one wanted Adelie and Gen two, you would hear say species Adelie, Seabind Adelie and then Gen two, because then you would get all species greater than Adelie and less than Gen two. And I forgot what the third one was, whether it was in the middle or not, but that's basically what's happening. So that was a quick overview on sparse. We'll come back to a bit more of that in the application examples. All good. I see no pressing questions. So I'll move on a little. So yeah, this is a bit more here on the query condition thing again. So what column, what cutoff value, what type, it's a little restrictive because when we run this piece of code, we actually don't have a connection to the array. So we can't yet infer the type. So we'll have to work on that a little. So that's when you actually want to float value, you have to trick a little and put a put a dot in there. I have an example for that later. And then if you install tidy B from GitHub repo, you get past query condition. And otherwise that's hopefully beyond on Cran soon because we're in pretty good standing at Cran. Have no open errors. So when we upload something, it could be a piece in short time. Oh, yes. And the other thing, of course that I hadn't shown, but it's here. The last one is we had done everything so far on row conditions, but you can also open a tidy B array representing a data frame by saying only give me, hold on, what just? Oh, new DF, not DF, my bad. So this is basically, this is like the pliers filter, select rather filter. So you say which columns you want, as opposed to which rows, the two of them of course can be combined and you'll always get the index columns back as well. Hope that makes sense. And that way we can, you know, subset horizontally and vertically. And I think that's what I had for sparse arrays. Yes. And the reason I showed the gen two and the penguins example with the years is for certain problem domains, you do want incremental rights and the pens to your race. And for that, you have to just plan ahead and set the range values on your domains large enough. So for example, with geospatial data, we sometimes set, which are encoded numeric, not character, so they don't have the null string, the empty string. So for the numeric, we have done things like just setting them to float men and float max. So then no matter what additional values you may submit can get written to it, but then you don't yet have a logical constraint in that, you know, that you want to make sure that the X, Y location is on your continent or planet or whatever, but that basically how that works. So for incremental rights, make sure you don't limit yourself from the domains. And if you just write from a data frame with the other function from data frame and don't specify domain values, the function doesn't know any better and can't read your mind and will set as a domain the values observed in the data frame. So then you can always write in between, but not beyond in terms of values. I hope that makes sense. And yeah, and one other important aspect is individual rights are all immutable, which means they don't interact at all, are independent and hence highly paralyzable. So that's really neat. So if you have really large data examples, you don't have to write one part after another part after another part, no, we wait for completion with additional, with sufficient compute resources, you can spray the rights at an array at the beginning. Question by Paula about what function just gives you the index. I don't really think we have one for that, just like that, you... If you could return the schema, right? You could print the schema of your array and then use the dimensions function would return just the dimensions for the array. It's a complicated question in the sense that suppose you have a really, really large data set written on the cloud somewhere far away and if you just want the index, we're performance obsessed. So we cannot just give you the index back because that's already expensive. You actually have to traverse all the tiles to pick up all the index values. What we do offer you happily always is cheap operation. So what we have for every index are the bounding values. We can give you immediately on the index back the minimum and the max value, which is sort of the outer hull. But if you wanted to query them, that's expensive. So then we would have to bite that. I'm actually gonna see that query with maybe just one attribute and get them back that way, but it's good question. Good suggestion. I have a few more slides here, just with a few words on from data frame and how to be array, sort of the bread and butter function that I've seen. So from data frame is really what I use day in, day out for test example and checking things because data frames are just such a bread and butter data structure and all that we want to take it and write it to an array and we can write dense to default as well as sparse by setting it when indices are not given, it will add ad hoc row indices and either denser sparse cases. But once you pick a sparse array representation by setting the toggle when you write it, you can have multiple index columns and these can be in numeric or character. So you could really index the way you index in a data frame. As we've seen the Penguins example where we were indexing on the species as well as the here. By default it writes compressed, so it's deficient and we can set different array attributes and parameters still to the function and we can use append mode if we have high enough domain values that I alluded to and which I'll show again a little later. In the other direction, tile to be array is our main reader function. So in the first versions of the tile to be package we had different accessors for dense and for sparse. It was similarly done in some of the other ADIs but we internally normalized a few things in the API to unify things. So now we're just using the same functions to access whether it's dense or sparse. So you'll always just say tile to be array and when you're reading it, it's written, it will tell you whether it's sparse or dense so you don't have to say it on the read. In the R package helper, I put some nice these in that I showed you earlier on a slide to return the data frame and matrix on array and when you can set by rows as well as by columns. Row ranges why the dimension constraint as well as the query conditions I could add it recently and we can select what columns can return. That's what I had on the high level functions and we had a good question in the channel about checking for duplicates in the index. Are duplicates allowed in arrays? Yes, because that's the change that came because the customer wanted it for something. I think that came initially from the Python API and some useless pandas I can't remember. So that's another flag that I hadn't mentioned. And I think it just inherits from what one sets for sparse. So when you say sparse, it also allows for dupes because we can, where it's in a dense array you can't. The row indices have been incremental. So duplicates are absolutely allowed in this sparse case. Can't say much to Williams question in the chat there. That looks wrong because I tested that. So it's on a normal setup, maybe we'll treat that one off later or in the chat channel. From data frame definitely takes tables and data tables and data frames and data, whatever's that inherit from a data frame and just subclasses it down into a data frame before processing. So maybe Tibil is not in your machine. Can't quite read the error, but that's overcommonable. That's not lethal and we can get to that. But let me maybe carry on here if that's okay and if there's no other questions. So we really think of TauDB as a unifying API implemented the C++ layer for the different applications languages, implementations that can consume it. So what matters to us a lot is that the underlying API functions are present in the different access package. So underneath the R package really one to one maps each of the CC++ functions. So instead of using the highly compact expressive R functions that I've shown you, one can also do it more atomistically, more bare bones, which one sometimes needs for more complicated cases or cases that I haven't yet covered in the high level functions. So this basically is the long form version of the quick start dense first example that we started with quick start dense. So again, two dimension, a vector two dimension for the dims, one attribute create a schema from the two dims and the attributes, create the array, set up data, open the array for writing. We haven't said much about raw or call major because the changes in behavior to us are not that meaningful at our level for the intro tutorial. So just take that as it is. But then the way data gets written for the queries when you do it by hand is that you essentially take each vector, here's the data vector and assign it to a particular buffer. As many buffers as you have columns in the data set that you're writing, you submit the query, you finalize the query and you check the query. And that works very well. If you want to or need to write it that way, though of course we don't recommend it because the higher level functions are more expressive. As you will see when you eagle eyed, this is a sequence of operations that all have the same common first data type, the query. So it lends itself to piping if that floats your boat. So we can run the same example in R410. So I guess that was the one example to someone asking earlier with a R410 would be good enough. You couldn't run this example it is, but you can just drop in the McGridder pipe and then it works the same way. So that's the same. So basically everything that's in the API can be accessed directly. And if you go to a package down documentation for the package, you see a really long list of functions because there is so much. But it can be a little overwhelming, which is why from data frame for writing and how to be a way for access, it can be so helpful. Here's another example that one sometimes needs. We currently don't have a high level wrapper for metadata consolidation. Basically every write that you do writes to the array. You can append to an array, write multiple times. And when you read back and having consolidated for multiple writes, then it has to, by hand, pass the metadata together. And you can help it by consolidating that first after a sequence of writes. And you would do that by calling array consolidate. And here we show that when you ask for a configuration, you get a config object back, which in R's case is just like a hash map. It's a name vector. And you can set all the elements by their names. So here we're just saying, OK, read in parallel, write in parallel. On the virtual file system abstraction, also use multiple threads. Consolidate me the fragment metadata. Then we use all these changes and create a new context object. And the context object is a hidden global that all the functions access. But if you don't give an explicit context argument, it will pick up the one that sits in the environment of the package. So this is sort of basically hidden pass from here to there. And then it runs with the context object and picks up the set configuration and consolidates. So that's the second example on what to do with the law level API. All right, any other questions? Paula's question, yeah, not quite. I mean, let me go back to an example. I went too far. So of course, the example that you had was, is it subsumed in time to be query conditioned? Not really. What happens here is we're just opening an array, selecting ranges or setting query condition, and doing all of this on the array object, the S4 container. And in the behavior implemented by the by the accessor, all these things get set up. So something like Tadby query set, setting, making the different buffers ready, submit the query, finalize the query. All that one happens sort of in this one line because the accessor is high level. So I said from data frame and Tadby array, what I also have said, I should have said, is that Tadby array then gives us a square bracket read and write operators and those also abstract a lot of the law level stuff away so that you don't have to do it by hand. And they basically set this up in the query setup before the query submission and query resolution. Hope that answers that. Great. So we did that. With that, our in and I have a couple of quick ones. How about if I do the couple of quick ones on other API features, this is now sort of four times a minute basically. And then we do a break before I get to the application examples. I think that keeps us on course relatively well and gives a good logical break for the break. So first example is S3 because everybody's excited of course on the cloud and they already be working on the cloud. And Tadby has been essentially cloud native since day one. When you build Tadby or don't install the Tadby package it comes with the features to use Amazon S3 as well as Google Cloud Storage and Zoom. I mostly work with AWS. So I commented this out here and showed on this slide as well. When you want to use a particular bucket in your name space you need your secret access key and key ID. And if those sit either in the Tadby config object or as environment variables in your normal session your batch cell or where you set environment variables on Windows, then you can access that way. And that I show you and that's just one change we made in the 24 hours between the first cut of the slides and this version of the slides because this one and now you see that in Emacs my little clock is spinning because that wasn't instantaneous because it didn't go to my log disk. She went to the cloud. And I think that's Amazon data center east. Oops, that's not what I wanted. So that's the array object that I wanted to show. This is what I wanted to set up. And then we query and then again it takes a little longer than from the local disk but not very much longer. And this example you should be able to run as well because we set up two buckets. Aaron will show the other one later a much larger one within USAR 21, 2021 with entire to be conferences. And this one while it is on AWS is a public bucket. So I think you get to that one even without a setup AWS ID and key. All it needs is our software's ability to depart an S3 URL. And that is actually quite magic because I used all the other examples with the URIs that were in my temp directory or a temp file and just by prefixing it with the cloud accessor as well as buckets that you have read access to either public ones or the ones set up by your research group or company you can get to the cloud immediately and everything else that we had said about selecting rows, selecting calling sub seconds, sub sections works transparently from the cloud. So that one's nice. Another one that's useful and where we have had support now for a couple of months and I'm just going to maybe show that first from the code example is that we are able to support the arrow memory format because arrow to note the arrow package allows you to construct arrow objects without actually hard linking to it. So now I just in R loaded the arrow packet and here I created a really trivial, small integer eight by vector of three values just a sequence. So this now is an arrow object and we can use and do use a feature of the arrow package that allows straight up C level interface. If you're C program, it's a bit like accessing via pointers. So you basically grab a pointer of the data and a pointer of the data. There's a little bit of clever trickery going on here because the pointers are really memory pointers. What R sees is just a double but what it does behind the scenes with it is the magic. And so these two allocate them and then we say because memory got allocated you should be a good citizen and actually give it back to the system so on exit we free it and then we use an exporter function that exists in the arrow package and just handed the pointer for the data and the schema in the types that arrow can use. And now we've exported from arrow so the export is an arrow in an arrow's view point. So now we're exported from arrow to title B and we can do the opposite by using these opaque elements AA and AS which I couldn't actually print. There's still just these vectors but I can use the arrow function import me in a racing a column from AA and AS from arrow and create me a new object with that. And now I've recreated new back and it's indistinguishable from the old like it's still an in date with the same payload. So that's pretty nifty. It's still relatively bare bones because we're currently doing that column by column but we will be doing more there. And the nice thing really is that it keeps the deployment at the build clearly separated. You just use that with the title B package which isn't linked or built against arrow in a hard fashion and the arrow package and all they use is a C API that is provided by the arrow API. So that's how arrow internally also does some things are really new on that. So that's quite useful. And the slides just basically show the same example and code that I had there with setup. Two other features that are useful and one that I'll show quickly is we had said that, oops, that's time series, not time travel. That's what I wanted. We had said that the rights are immutable and separate and we can also do several of them. So here I'm just going to do something really simple. I set up this very boring data frame key and value one to 10, one to 10. We'll just see the difference in a second. And I'm just going to write it to a data frame index on key. So it's basically go through and I need dubs in here. And so then I store my current time after I've written that. Now I'm just going to wait one minute. I could have done the example with five seconds or just 30 seconds per minute is a bit more explicit. Makes us see the difference more explicitly. And then we're just going to open the array again. You see that on the line where my cursor currently is and override the value column by just adding 100 to the existing values and write that one back. So our data frame D had key and value one to 10, one to 10. But then the value column, we will just increment by a hundred. This should be finished now in a second. That felt very long. So now we'll open it again, write and take the time. And once we open it and read what was just written the second time, we see that we have the new values. They are a hundred got added. So that's what's currently in the array. We opened the array, we read from it unconstrained and we see it's a hundred to 110. But we can use the same bread and butter function to open by saying, really, I wanted it at the, hold on, let me just show that too. Because we've looked at the directory for the other one before. Now we had two writes. So you see that I have two directories with data with two different timestamps. And that's what happens behind the scenes. Each write is immutable. But when we then come back and read it and tell TauDB that we actually want to read at a particular timestamp, the earlier timestamp and then look at that data, we in fact get the earlier values back, one to 10, one to 10. But the other ones are still there. If I want to read it at a different time point here, I'm just saying more current, which would be the same as unconstrained. Then I get the updated values, the overwritten values back. So that's a little bit like git keeping track of things and allowing you to come back to older versions and can be really useful for tracing and auditing purposes. So that's time travel. And a similar feature, just an additional add on. So this is just the slides of what I just showed. Works with encryption. So rather than writing the data straight up to this, we can also pass them through an AES 256 encryption filter. And when you then read it back with the encryption key, you get the exact same data back. So let me show that example real quick. So again, I'm setting up a boring domain and schema and set up an encryption key. I've forgotten the exact specification for that, but AES 256 has some requirements for what actually is a valid key. So you can't just write the brown fox or ABC, it has to be a certain lengths to give a minimum entropy. And this one is boring, but it happens to qualify. And now I can just say, yep, write me this schema with this encoded key. And I'm going low level function here and doing an array create. Then I open it with the key, write to it. And if I then wanted to just open it without the encryption key, Taldebi would tell me, no, no, no, this doesn't work. This is encrypted with AES 256, but you haven't given a key because the default is no encryption. That doesn't work. But if on the read, I supply the encryption key all as well and I get my data back. So time travel to get access to different writes as well as encryptions are additional features that may be useful for particular use cases. And that puts me at a little bit over an hour and a good place to maybe have a quick break of, what shall we say, maybe 10 minutes to five after five? How does that sound for everybody? Yeah, call this a sec. Yes, the question on the, is the data twice, safe twice? I guess I'll just for that. I may have been again too quick. So the data really is written twice. So that would be redundant and wasteful, which you can overcome when you consolidate the data. I think time travel goes away after you consolidate, but you definitely have the option of not consolidating if you want to have the different writes persist and retain the ability to travel back in time. All right, I have a five pass five that puts us at the end of the 10 minute break that we mentioned. So all good if we continue. It's really too bad that it's virtual because it doesn't give that much of a feedback loop between the tutorial participants and us the presenters. So that makes it a little harder to break at times too. But we hope that with the example package that we gave you guys and the package being and the title being package being all crying and you have an ability to experiment a little and follow with the examples. So in a similar vein, I have a set of examples that illustrate other aspects. And I think for one of them, I will have to be pivoting to AWS. And I just realized I actually hadn't locked myself in. So just give me one second. All right, sorry for that. And we're back and I now have two prompts on AWS which we'll need in just a minute. But that's the second or third example. So the first one is a pretty interesting one too and requires a little bit of setup that I have both on my machine and on AWS because it involves SQL. This is quite powerful because TiledDB as we said really is a storage layer as opposed to a compute layer. But a storage layer can cooperate nicely with some databases that have plug-ins for storage levels. One of these is MariaDB. And so MariaDB is the newer version or the variant of MySQL, quite popular and quite extensible. It just so happens that when you write when you compile and you build when you install MariaDB with plugin extensions, the plugins have to be compiled exactly consistently with how the server itself is compiled. So if you want to stick TiledDB into MariaDB, so you have to compile the TiledDB plugin the same way as MariaDB the server was compiled, which also means you have to be the TiledDB components there. So you really have three compilation requirements. That makes it a little harder to just have a MariaDB plugin say for Konda or Ubuntu or something like that because we don't really have control over how MariaDB is built there. So we have to sort of do it ourselves or you can do it yourself, but then you have to have all components there for building all of these parts by hand. It's all open source. It's not hard. It's just a little bit of work, but it's easy to avoid the work because you can just rely on Docker. And that's what I'm going to show you. So we provide a TiledDB, a free MariaDB enabled container that also has I in there. So you can just pull that and with that we can then work quite easily. And that really is three steps and that may look like a foreign language to you. If you've never worked with Docker, I like a lot of other people really quite like Docker and then it's not that far out. It's just using Docker and demon mode. It's not that much trickier. So if you have the suitable container, you can just run it. And that's important for accessing it. When it runs as a demon, we're giving it a name. And then we're saying interactively as a demon, remove all artifacts that by and we pass one environment variable to MySQL by saying, you know, MySQL be lenient and allow empty passwords here because we're just for now working with local files. So there's no security issue. And that really is step one. So I'm here on a Ubuntu machine and have these things set up here. And so they don't have to type in this type. I just had the command set up for this. So this is the first of the two. And again, I'm just running Docker, giving the running instance the name tied to be MariaDB. So it runs under that name with the options that I just showed including the empty password and starting that. So that's step one. The second step then is that we can start an R session inside the Docker container by saying Docker execute interactively again as the user root in the context of the tied to be MariaDB R instance that we had started. And then if I still have my ducks in a row should be exactly this command here. So now I'm having R the current version the way this was built. And I made no changes to it. That's just how the container is supplied from us. Hold on. And then quickly here, I would just have to shoot. So that was the first command, second command. Now I'm in here and I'm just going to copy this one. So this is terminal multiplexing. I hope it's not confusing me too much. I just in the same sort of terminal frame I have several programs running here by help of something that works on on Linux Unix called TMAX and Quaiobu. So in the container I just started tied to be and now I'm just copying and pasting in oops that didn't work and kills me each and every time. So let me do that explicitly rather than Emacs copy. And now I'm just using again my friend from DataFrame and I'm saying let's take the penguins example and just write it to the temp file. Again, this is not the temp file inside the Docker container. Let's keep that in mind. And I just created a tied to be array as we've done all tutorial long. Where it gets more interesting now is that I can hold on, I'm just going to copy these commands line by line with then bringing up MariaDB. We're also loading deep plier and telling R not to tell us about warnings that come from deep plier and any conflicts. And then the trick is that we're setting up MariaDB via R's database interface package DBI. And one convention for MariaDB and the plugins and extensions is that you want to call the database test. I think that's just, that's something that's internally hard coded. And once you have that, oops, I just want to pick this up. It gets actually pretty nice because now we have a connection object which Xnt knows nothing about TiledDB as far as I was concerned, we call DBI and the MariaDB package. So we're having a connection to MariaDB using the connection. We can then use standard DBI and table interactions of saying from the connections pointed at where we just wrote a TiledDB array and proceed with SQL operations. And now MariaDB will pick up the table expressions, translate them into SQL, send them to MariaDB in the context of a TiledDB array, which is pretty nifty. So for example, the first example, and that again just shows the power is we can just say, okay, from the penguins, just pull me out all the columns that contain links in the name. And this is really pretty clever the way it's implemented with DBI because this query at this point is still lazy. If you do that again and I like doing it this way too and feed it into str, you just see two list elements, they're both shallow. This thing really only has the three or five lines that got printed. It's not the materialized array. It just knows that, okay, we have two columns here. Here's the first two values. And then it knows, yep, there will be more data. And the way it works in DBI and the DBI database integrations DBI is that you call collect to actually manifest your data. And then you really go back and say, yep, get me this. And in that case, it went back and did our usual penguins business, 344 rows. But here we only collected two columns. So I have an actual table with the data. And that is pretty powerful because now we've opened the door to an entire universe of analysis and access functions from the type viewers, deep plier, table, DBI interface and all the rest of it. It doesn't know that, excuse me, the TiledDB is underneath because that's seamlessly provided by the TiledDB plugin for the database packet. And while we are using this here for ease of deployment just in the Docker container, nothing's stopping you for implementing that on a departmental or group server with the same components. So it's just a little bit of set up front load for the ATD department. So that's the example that I just showed, the lazy query non-materialized. And when we put collect in, the query is fully realized. And at that point, I really have all the power of SQL as provided by our MariaDB via the connection with the plugin. And yep, and this is how I set it up. I took this screenshot when I had my screen laid out slightly differently. I since realized that it's better if I work with a larger font here. So now it's just three layers but basically the code here to the left after we launched a container in the background and here for some reason on the screenshot, it just, oh yeah, it's exactly the way it is set for first. I had one session to write in another session to read whereas here I glossed over that a little and read from the same that I wrote. But it's really only the file access to that directory that governs it. And my laptop went on to sleep. So I don't see your questions anymore, so I... William's questions whether the player has query case and Steve, I have no idea. Let me try that. That's, you know, I don't currently work that much day to day with SQL but I remember something that there were tricks but maybe that's within SQL differs between implementations that it can actually be case incidentative with the column names but I'm not entirely sure. Yeah, no, it seems insensitive. So I can also say contain lengths with lower case, so good point, good catch. So either because SQL doesn't care. Yep, so any questions? Actually we forced all of you to just write them out in the slide. I guess we also, I can't have my eyes on all of that but I think the hand raise in the chat would also work. So if something comes up, let us know. Otherwise I would just leave this example here having shown the basics and just basically point back, you can absolutely run this at home if you can run Docker, which, you know, you can run on a Windows laptop, a MacBook, a Linux computer, Docker runs just about anywhere. You just have to docker pull that container and you're in business and you don't have to build anything really. And oh yeah, and another trick. And I, yeah, I may come back to that. I'm not quite sure if I have that on the slide now because the next example is bigger data. So this example I like quite a bit and that's why I had to quickly jump onto AWS because that's something that I did on a bigger machine there because we were at some point curious. We have other applications and other problem domains with really large data sets and a pen to them. And I just wanted to see, am I there yet with the app package and the data from integration? So I just wanted to go out and find the data set that everybody knows and see if I could really go to sort of size with the capital S. And I looked around a little. Many of you may know the package with the example data sets of one step up from Penguins is New York City Flight 13. It's a subset of flight data provided by the FAA to the American public that summarizes all flights in the US during one calendar year 2013 subject to three New York area departure airports. And I thought that's kind of neat. And then I went digging and wanted to find a bit more of that data. It's a little tricky. The app package has a link to a data set that's no longer there. The FAA data set is terrible but I just kept googling a little and searching a little and eventually I found this and this is pretty good. So in the PDF slides when you have them, the link is live. So if I click on this, I think Firefox comes up but maybe on a different screen than the one that I'm sharing. And it's the same URL that's there and maybe spelled out in some other place too. So it's just a top level page with links to the data set under some IBM marketing initiatives where IBM is showing off how much they can do with data. And they basically provide two data sets. One is the full FAA reporting from 1987 to 2020. So it really is a much larger span than New York flight 13 with many more columns. There's a lot of nonsense in there and 194 millions of course is unwieldy but if you just want to get going and test very handily, they also have a random sample of two million. So there's basically two files behind this and I have those two files or started with those two files. And then, well, what's next? So it's not bad. The data that IBM curated there for us, oh, I should go back to the file. I don't think Aaron and I can give you the data directly because it says here somewhere that's still on the screenshot or one below. I think the data set is creative commons non-commercial. So I guess we can use it, but we're not allowed to redistribute it because of course we're a commercial entity, but any of you can get to the data this way and that's what matters. There's two big, so the subset and the full set both come as a TARGZ and interestingly enough in the TARGZ is a single really large CSV file that's again compressed. A compressed file inside the TARGZ makes no sense whatsoever, but that's just what they did. So there's a CSV file that's the standard compressed in XZ file in them and then we can work. And that allows a couple of tricks. My bread and butter data reader is data table FREED and data table FREED is fantastic because it can read in parallel and do all these tricks, but it doesn't deal gracefully with compressed files. To get chunks out, I did something relatively pedestrian then I just had the compressed CSV file to standard out and use SED to take chunks of rows and then process them. That means that every time I want a chunk I have to transfer, traverse basically the entire CSV file again, if I were to do this more repeatedly I would probably just expand the file, split the file in different chunks and then write in parallel from the chunks. That's not what my example here did because that was just a first feasibility study to keep it really simple. And similar to many of the examples that we all have seen around tidyverse and deployers New York City Flight 13, there are a couple of columns that are interested. We wonder about the year, we wonder about arrival and departure airports and we wonder about the airline. So that gave me four possible index columns where airline departure and arrival airports are textural and the year is a number. So for the three textural ones I already get null, null domains so I can always append to it. So I only have to take care of the year variable. And this is not particularly interesting or performing code but it did the job for me over the space over a couple of hours because the data file is really large and basically just checks whether the compressed CSV inside of the tar ball exists and then loops over the file and has counter information from where to when and extracts a certain number of rows from this. It decides to drop a certain number of columns. The data set is very rich because they normalized it in a really, really hard way to be able to fit the CSV format. When a flight is diverted there are additional columns that represent from which destination airport they got diverted to another airport. Well, and bad things can happen. Maybe you get diverted again from while you're in a diversion and from a second diversion to a third. So it goes to a counter of five. Remember that it comes in multiple decades but many of those columns are just basically empty. So I kind of just set chuck it and threw those out and then I needed one helper function to in one case transform a UT of eight character a white character to the normal character and on a few columns I transform boost ins and made factors into char. So that's just basically a worker function that I have running over the data that's extracted and otherwise I'm just looping the chunk that I'm currently looking at at the URI that I came in with which I think exactly is an argument here which basically is a sub directory where I'm working there sparse four index columns that I want and for one of the four index columns the one that's not character I set a low and high value to allow me to append to it because I know my data set will not be before the UNIX epoch can just arbitrarily say let time commence in 1970 sort of you know the hello world standard of time as well as today whereas the data really was 1987 to 2020 so I could have also used January 1st 1987 so a little sloppy and then I always just keep track of how many elements I've written and the yes and I do these things twice so in this part I read the first chunk to remember the column names as well and what initial chunk option I've given myself and then for all other chunks when the array already exists we'll just take the chunks at that point in the loop traversal set the column names and append to the array and keep track of how many we've written and by the time we've hit 194 million rows I think I did that in indices of 10 million rows at a time or something like that and even on AWS on a decent machine it took a little bit of time but again this wasn't written to win a Kaggle competition or be highly performant I just wanted it to conclude if I had to do this again I would split the CSV file into chunks and you know I have a quick and simple R parallel loop to write all the chunks at once but that wasn't really needed and once we have that we can actually go in and so this is the directory in which I did this and there are two tied to be arrays one the airline one the two million subset and create arrays basically the helper function that I just had there that are showing the two slides and I kept the two compressed CSV files this is basically the data that I worked with that came out of the downloads from the IBM site for which I gave the URL and this yeah exactly it's just the helper that I stored there to download the file and extract the compressed CSV out of it so hold on let me maybe I'll copy again from the from the example so there's actually the directory that I'm in but it's an absolute it would still work I think I can pass all of that once I'm sorry and this is now going off against the tile to be back end as driven from the tile to be R package against the 194 million rows and that took three seconds or something like that and that's about as good as we can go immediately because now I got back basically 776,000 rows because I said I wanted all the data from the beginning of 2000 to the end of 2000 arbitrary for reporting airline united airlines in the beginning I was a bit more eager and kind of said yep let's look at all the airlines and just constrained on dates and the data that comes back is then what I requested and a chunk for that request is still so large that my our session died because it was more data than I could handle so this is really big data so you have to operate it in chunks at a time and that was that was a live example on working with really large data or I think I vaguely remember why I had two prompts here so let me just go in here so I forget targeting flights because if I do that's an that's an 11 gigabyte tidal db array because it uses heavy compression in it relative to maybe not that heavy compression okay so the the heavily compressed csv file I guess that compresses better was at four gigs and the fault set was at what was 11 gigs but then we can do initial additional operations so we had looked at query conditions earlier so I can ask the first query condition all array delays greater or equal than 120 and I think that's minute which is stored as a float 34 same thing with departure delays all greater than 120 I think that's minutes in a float and to initialize query condition objects in the low-level API can be combined by saying first query, second query and operator end so I can do that all at once my copy and pasting didn't work I'm sorry I just wanted to do this let's just see I'm gonna post over and then I can fire up that query and remember beforehand we had 770,000 rows but now I'm saying okay I just want a particular subset with particular delay properties and that reduces that one down to 22,000 give or take rows and once again again that just takes basically no time to get this out of a really large array and then with the newer helper function that's in the GitHub repo but not yet at cram we can also write the same thing basically the same way note how I'm cheating here because this parser will not know that 120 was meant to be encoded as a float because we don't tell it anything about the schema there's a disconnect between setting the query condition up and the schema at this point because we haven't connected yet with an array so it would set this up as int and then the query doesn't actually compute down because we don't have a which we probably should add we don't have a simple and smooth cars to go from an int to a float you have to really have the right type in there but if you do it with that way with the by essentially implicitly forcing a float by having a floating point expression in the query you get the same thing operating but because numerically of course it's not exactly the same and that may have to do with how analysts at the FAA wrote down the records it's actually 400 data points less than if we're doing hard 420.0 which you know makes some sense and yep that's what we can do with the flights data set directly from an r prompt but another thing of course that one can do yeah this is what I just showed and the little lingo little detailing on on you know making sure it gets picked up as a float and yeah and this is sort of fully remote even though in the sense it's a local it's a local file system we only get the slices of data back that we're requesting so I could have done this over cloud storage if the array was in an accessible bucket which right now isn't it just sits on aws basically in a in a machine we're getting efficiently 22,000 rows out of 194 million but one thing that I quite like too is to use one additional option for docker if we say let's mount the docker container in a particular directory outside directory where we are and pwd in the basher is current working directory it will accessible on the inside under access point mount we can then do the exact same thing of spawning MariaDB and r and saying you know let's start MariaDB let's start the plier open a connection to db to MariaDB exactly as I done it on my machine locally but have the typical dbi connection object go to the mounted directory which then inside the running docker connects it to the outer file system and I get the outer airline that we'd written before and then we get the same thing on 194 million rows wire the dplier logic that will just take me a second to copy and paste this out but not too too long and I think it's worth it so let's just and that was one of the reasons I had two sessions here so maybe let's do that on the left hand side and again it didn't copy all right so that ran and then this one will just restart oops restart inside the container is really what I wanted and the key here is this minus v argument for bind mounting an outer directory to an inner directory again it's inside the docker container and so let me know I just rearranged the layout so that was just a little trick there so now we've started this set up a dbi connection to database test to get access to a on MariaDB extension in this case the title to be one which it'll resolve I think by finding the directory and then oh sorry my bad yep I did this I pivoted back because the screen was constrained before I rearranged it inadvertently and started back from the wrong sub directory but actually I can save myself that way I don't really need to restart or let's just do it here in the docker container there we go in the docker container down these two connection object but then instead of mount airline because it's not in the working directory I need to just expand that mount is where we started it and then it's git and it's title to be flights and there we go I get a lazy query back and it immediately comes back by I guess just squaring the schema information and getting you know you know a limits 10 back of 1, 2, 3, 4, 5, 6, 7 columns that all match this selection criteria again I was cheaply doing the one something having the pants in there but now now I have title to be wire sequel on an ad hoc array that I've written with 194 million rows from a CSV file which I can directly access and title to be YRR or via the MariaDB extension Dirk, do you have thoughts on the frozen cons of querying directly with R versus through the our MariaDB interface? I would say it's probably driven by different use cases when the analysis that you want to perform naturally maps and you're fluent with all of the defy up verbs and alphabets then going via MariaDB is very appealing because you're immediate work with the player I looked into doing what a couple of other packages have done and taking expressions and translating them directly into something dbi resolvable on top of an array and it actually turns out that that is a metric ton of work it's probably as much work as we have in the package because just you know remember how I did selected ranges to select by rows and what have you we basically have to pick up filter select mutate operators bring out the operations that they're doing and then re-implementing them straight on top of the title to be a race and I was a little fearful that that not only might be a lot of work but also venue for possible infelicities and bugs and I figured as a first proof of concept it's much easier for us to rely on what's implemented tested and widely used and just go to deploy out via dbi and that interface the main constraint is that the deployment is a little harder because you have to have MariaDB built with the title to be extension there but it's not it's not insurmountable and I would expect that there's a small amount of performance to pay because you have MariaDB in the queue but most of these things happen effectively with zero copy of a possible so I wouldn't expect it to be a deal breaker and can you query data present in MariaDB together with the title to be data set no the trick here really is not quiet that you get well it what are we really doing we're using MariaDB to access data stored in the title to be array via the title to be extension to MariaDB so that means you can get title to be data by MariaDB you can't use this to get non title to be data from MariaDB but you get non title to be data from MariaDB anyhow so maybe that wasn't that wasn't directly your question so maybe you can rephrase and then we'll revisit that the yeah as as for as well as a business follow-up question so basically an expression such as this one Aaron can you not if I highlight on the screen you see the highlight oh yeah perfect and it's always a bit unclear whether the cursor follows now we're in a deployer pipeline and I could have 10 different 10 more verbs in here but the key is they actually get resolved by going via dbi because it knows that this corner object is a dbi table object so it translates it into SQL and we'll get this by having the MariaDB SQL engine resolve it for data that happens to be stored in title to be a race so it's the combination of of both of these and you you need them you need them that way um we don't have um a SQL query engine inside of title to be so the only way to get to SQL commands right now is why I think we're really extension I hope that that answers that um let's actually go okay great um that was what I had on the really large data in the um in the flights example so with that let me let me while I'm here maybe just do one piece of housekeeping so that I don't forget that so we started a docker in there there you go so that one's gone so I can log off here over there as well great back to the next example a little bit of geospatial stuff so LiDAR light detection and ranging quite popular these days in a variety of fields lots of spatial analysis we have some really performing R packages um LiDAR LiDAR literally I don't know quite how it's pronounced by Jean Romain who is motivated by forestry examples there's a lot of talk of course for LiDAR helping with autonomous driving for using LiDAR as opposed to full image resolution and there are great many public data sets as LAS or compressed LEZ files and because these are multidimensional arrays they map um they make really well total to be um one thing that we have to do though is we need a reader for LAS files and um one good way is to go with the point data abstraction library petal um which can be built in such a way that it knows LAS and a lot of other spatial data formats as well as the TileDB extension so here again there's a docker container TileDB geospatial in which uh petal which has not unlike MariaDB with its extensions the ability to be built in a large number of different configurations and we added the TileDB configuration to it and you're driving a conversion of LAS or LEZ files with petal by giving a JSON control file so this is just a little array with two entries and you say okay we want the LAS reader so we're taking in a LAS or LAS compressed file in this case file name Outson LAS and we're writing from petal with the TileDB option to the Outson TileDB array with a particular chunk size and large large chunks then um you can um and all that is described pretty well in the TileDB website for geospatial and I did pretty much the same just invoking it from a um from the docker container pointing at the uh this particular file that's a demo file that has been used in a couple of other um demo locations and it's actually stored at the petal GitHub repository and um oh yeah actually sorry the slides showed that how I set this up so I just did this from R but basically uh just wrapping R around child scripting I want this particular LAS file which I can get from geospatial research group site at the University of Illinois it's a particular Cook County where I live last file and we'll visualize it in a second and I think you'll recognize what subset that I'll use and if the file isn't present this is just a little bit of conditional R set the options to a large timeout and slurp these 450 megabytes in and I've already done that and created the array so that we don't have to do this life both in you know in the background for the tutorial and this and this software so the file is there so we would basically skip over that similarly if the TileDB array top level directory um does not exist it means we have to invoke docker to create it and this is similar to the previous example we're running docker here I'm uh using another option I didn't use before and I make sure it runs on the same use id I have on this punto machine so that I have the same file provision on it we're using the v flag for mounting again from the current working directory and writing into uh inside docker container mount point data and then telling docker to start in data and that just means that its process that it starts with this command will have all this stuff local so pipeline jason that I set up where the file was local and the output array was local is all in this directory and then all you really have to do is run that and that doesn't take very long this takes a couple of minutes because it's 450 max and once that has happened you have the file um in a local directory just as I have here and you can uh reproduce the same analysis and you are in the exact same way so with that I can now hop in and go to the code after the conditional writing and so now again title b is loaded um and I um this one directory reinduction let's try that again now I'm in the right place and so lidar is geospatial often coordinated as x and y on the ground and z a height as well as what the laser results get back often color coordinates won't have you so here I'm just uh requesting a particular uh chunk um from that lidar data set uh forgotten how large it is all rows in this 450 megabytes but basically here I'm having 108,000 rows over 15 columns and we can we just got that out of a tidally b array but one of the work host packages for uh working with lidar data is the lidar package which I quite like it um seems to have a slight disagreement with the pida library and into in um front end package over one particular columns which we always need to cast to integer before we give it um to function from the lidar package but then I can create a um uh a last object from the array that I've read so I've read the tidally b array back in and fed uh what was l 15 columns into the lidar package and I wasn't very creative with my names here and just called this um ll for the lidar l array but now we are in the realm of um the um lidar package and it has a default plot method which then because it's lidar data is geospatial and you want to flip around it lets you zoom this and the subset that I found from this particular tile of lls data is what used to be known as the cs tower and is now the willis tower and the lidar measurements that come back I guess just scatter short particular heights and haven't yet filled in so this is all whatever it was 108 um stories of the uh of the willis tower and that's one of the two plotters that one sort of runs by default when you install um the lidar package um there's another one that I quite like which I found there was a reference to it on the github repo of that package its author uh in kebek has written basically reutilizing the rgl package a faster and lighter uh plotter you didn't see the um improved access to the plotting device right there because the subset that I've chosen is relatively small but what he's doing here is notice the message point out killer must be closed before to run other rcode when we plot with rgl um we take the r vectors and actually transform them rather data structures make copies rearrange them allocate whatever you before we pass them to open gl for the plotting that takes a moment um this one's more clever and just passes the r data straight through but that means that you know we're now paused in execution while this viewer is running and that one's um on github in a repo lidar viewer but I quite like it because it's lighter weight um and that comment and outline works for me on my machine because I installed lidar viewer from github it may not work on us until you install that package um that was that was that example and then we have uh real quick one more uh that connects me back to what I used to work on for many many years which is financial uh data mostly from exchanges which are often timestamped these days very high resolution timestamp which makes it challenging to get good example data because selling data is um a major source of revenue um for the exchanges so what I found here is in a registry of open freely available datasets um um I'm sorry I flipped that but you know this was the one that was non-commercial so I think the um the earlier data set from IBM over the flights of course that's US government data we could have redistributed that so everybody has access to that because that's done on US tax dollars um so uh open use lists this is the one with non-commercial uh and they have some immediate language in here that I that I quoted because that's just how exchanges tick uh you know if you want higher resolution data please talk to us and we'll charge you so what we have here is real data from the exchange and it grows and they update it every hour but it's a little boring in that it is minute bars uh rather than really high frequency transactional data but still it's not so bad so when I first looked into that last fall I downloaded a few data files and I still left it at those so for a really simple and quick example um I uh again append multiple um CSV files into one common array for which I just have to pick up column names and indices and the array description the creation in the first pass and in all other paths I just append um this really just is a help that correlates to date and time columns into a daytime object and then drops them and doesn't do much more and turns it into a data table object and here in the writing step uh on uh the first step in the loop we we create and in all others we just append and that's pretty straightforward and once you've downloaded a little bit of data that can run relatively quickly and once you have that and that example I will show you um and for that I have to remember what directory then I think I have that in the code example yep so we need tile to be the table in this particular array sits for me in a in a one-off repo that I worked on that so this is the list of files that I injected it's just nine or so of them from last November um this would be the worker function to inject them but we've done that so I will just do one quick example of reading that in so we open the uri as a data frame and now uh we can for example show how we would get time series data frames out of a large array so here um it's a european exchange in frankfurt one of the securities listed there is the car manufacturer bmw uh as a consumer brand everybody's familiar with them and here I just say okay just give me bmw no other company and give me an hour's worth um on november the 4th isn't a particularly uh impressive um a day that just happens to be a recent day when I um when I downloaded it um and then in order to plot this I use a helper package called rts I always forget what r is for and it's just a not that well known package and I you could set layouts easily to combine plots so it's an hour's worth so we get 60 minute bars and then below that we get a volume bar and uh voila that's uh um that's the data set where you got from there because it's actually stored as volume bars so plotting bars uh in a financial um structure uh just seeing the questioner from kim yano absolutely so you you um so the lighter package for example by genre man works with multiple files and then you index the different files and combine them we wouldn't do that we write everything in a really large array and just index by x y and z and if you wish even time because you could have lighter measurements from um pre harvest mid harvest post harvest and whatever and index by different different dimensions so that makes it actually really really rich because you don't have this soup of having to deal with you know large and number of files and picking up the right ones it's all in the index and how to be uh accesses this for you so that's that's uh actually a good key example of why um this works so well same same here um if you look at the website uh at amazon the open registry each hourly snapshot already is sort of my directory too but in each hourly snapshot is one csv file so if you want multiple you just have to find the right files and what have you next you just write them all into one large array and index by time it's much better and that's the example that I just showed on one slide and that brings me to the end of what I had and we should probably pause a little Aaron you're muted you're unmuted yep so five minutes should we take another 10 minute break yep that works I will be here and I will now that I no longer have to talk and make a mess with my slides and code examples I will uh monitor the slack like a hawk and try to answer questions there while Aaron gets ready and then uh watch uh watch the chat for him while he presents or good great so we'll we'll pick it back up at six ten essential and then we just unmute for a second so Aaron and I were just checking notes over the secret intersection question and whether we picked that up rightly I think we finally realized by shooting back and forth and not being quite clear what the question was the question really is can you have a secret query that goes back against tiled it be uh content as well as other content that that MariaDB may be set up to query and the answer to that is yes absolutely because that's the beauty of a SQL execution engine you fire a SQL query up and it resolves the query and if the table that it wants to go to happens to be tiled to be data and reads it quite the plug-in in this build and if there are any others uh normal database or other database tables it will read those and then you know join union whatever do what the query is as a as a database engine does so that that would work there we go a bit of a zoom glitch there okay so uh for the last demo today we're going to look at a more uh biological use case and i'm going to walk through the process of using tiled to be to store um results from a large collection of genome-wide association studies which are called GWAS is for short um and no problem if you have never heard of a GWAS i'm going to give you a high level overview just for some context you don't need to be a genetics expert or anything all you really need to know is each individual GWAS produces millions of summary statistics and we want to store them in a way that makes those statistics easily accessible for downstream analysis so the goal here is to provide kind of a more in-depth example that requires you know some consideration for how best to store our data in tiled db so we're going to examine some of those considerations for this GWAS database that that we want to create through this tutorial so just a little bit of background to get everyone on the same page at the highest level a GWAS just allows you to identify regions of the genome that are associated with a particular trait um oh and i just realized i'm not sharing my screen am i dork it's just about to point that out to you thank you there we go now hopefully everyone can see that cool okay so yeah GWAS is let you look for regions of the genome associated with the trait so traits could be anything hair color risk for alcoholism or you know the risk for developing a particular type of cancer um and the way you perform at GWAS is you need to sequence the genome uh genomes of a large population of individuals so you can identify the genetic variants in each person um variants being just the the dna sequences in a particular spot that differ from what's typically observed at that location so 95 of people might have an a you have a g and we're all carrying around millions of variants which is what makes us all unique so the question you answer with a GWAS is which of those variants influence some influence a trait of interest um so kind of a fun example that was performed by 23 and me a couple years ago is um are there any genomic regions associated with preference for a strawberry ice cream so the way you you carry this out is yes a thousand people a strawberry is their favorite flavor of ice cream yes or no uh then you examine each variant site individually and perform an association test that determines whether more people with a particular variant said yes than you would expect by chance so here's what GWAS results look like typically so each row here is a different variant and the identifier in the first column tells you where in the genome that variant is located so the variant in row one is located on chromosome one at the 15 791st position and then we have these two letters here those are the sequences the two possible sequences in the location c being the more common one t being the alternative variant and then we have our summary statistics for the association test so we have a beta value which is the estimate of the effect we have a t statistic and a standard error for the beta values as well as a p value which gives you the significance of the effect so this is what the files will look like that we're going to ingest into our array and the data we're working with comes from the uk biobank which is a incredible effort in the united kingdom that is sequenced hundreds of thousands of individuals and collected all kinds of data about them so they have survey data medical imaging data you name it so it's an incredibly rich resource for biomedical researchers in a massive data set and they have leveraged this data to perform GWAS studies for thousands and thousands of different traits the results of which are all publicly available so you can browse the the full set of results in a google spreadsheet that i have linked at the end of this section and for each GWAS they performed there's a compressed TSV file available with the results each of these files contains about 10 million rows and they're usually about 500 megs compressed okay so for the the goals we are trying to accomplish for this is we want to take all this data and put it into a tile to be erased so the results can be sliced directly without having to download the files first you know especially if you want to do something like make comparisons across traits that would mean you'd have to download all of those files before you could do any kind of analysis and we want to query the data by genomic region so if i'm interested in a particular gene i can easily slice all of the variants located within that gene's region and then finally if you need to if we could query traits directly by their descriptive names so we don't have to perform a separate join with a lookup table for example okay so let's get started so if you want to follow along live you can use this snippet to create a copy of the tutorial script in your local directory one of the first things we do in the script is use the download GWAS files helper function to you know download the GWAS files so this is part of that same package jerk was showing you earlier and that will download six different files that have all been slimmed down for this tutorial just to keep it light and fast so each one is only about 15 megs so if you do want to follow along it they shouldn't take too long to download so the files like i said they're slimmed down they only have a subset of all the chromosomes another change i made was to parse out the genomic positioning information from these variant IDs so in the files you get there's going to be these four separate columns which is just information i extracted here and the reason is i want to use this information to create an array that's sliceable by the chromosome and chromosome position so we needed to separate those out into separate columns so this is a diagram of how we are going to model this data so we're going to create an array with one dimension for the chromosomes and then one dimension for chromosomal position and with that layout you can perform a typical R-style query an index by a row and column name which will retrieve the corresponding cell so for example if this was a chromosome this is chromosome 3 this is the i don't know fifth position that would grab this cell here which would contain all the relevant information for that specific variant so all those summary statistics which are stored in separate attributes in our tile db array and then we'll add a third dimension for the multiple GWAS analyses so this is going to be a three-dimensional array and that third dimension will be for all of the individual studies that were performed by the UK Biobank so that's going to let us officially grab specific regions for specific traits that we can pull into R and visualize or you know perform any of the other kind of statistical things you'd want to do in R so these are the final dimensions and remember when you're designing an array the dimension order is important we're using phenotype as the first dimension because in our use case we are typically querying for GWAS specific results for particular trait so selecting one specific trait will automatically filter out the vast majority of the results so tile db can quickly zero in on the relevant slice of the array so i'm going to switch over to R so we can look at the code and start building this uh so this is the the example script that's included with the tile db usr 2021 package um and again it's it's there if you want to follow along later if you don't want to go ahead and do it live but we'll load a couple of the packages we need here the data directory is where i'm going to download those GWAS result files and then this is the location where i'm going to create the array this is just some some housekeeping stuff we're going to make some ggplots so i set up a theme that makes it look a little nicer um so run this to download the files i have already downloaded them here so i don't need to do that so on to creating the array so the first thing we need to do is create the dimensions so our first dimension is for the GWAS phenotypes so we want to be able to query by the names of these traits so we're going to set the data type to ASCII in order to create a string dimension the second dimension will be chromosome which is also going to use strings to accommodate the non-numeric chromosomes x and y and then the the third dimension is chromosome position so these are denoted by positive integers so we will store them as uint 32s and then i'm limiting the domain to the size of the largest chromosome just as kind of a safety check here so if you're trying to ingest something that goes beyond uh 250 million you know there's probably a problem with your data so created that um and then we combine our three dimensions into a single domain for the array and here this is where we're specifying that the order of the dimension so we want it to be phenotype chromosome position okay so uh next step will be the attributes so the attributes are going to be all the remaining columns in those results files that we're not using as dimensions um really there's there's two categories of those we have two character vectors or two character columns for the two alleles the reference and alternative allele we're storing those as chars uh so those will come back to us as characters in r and then the summary statistics are all stored uh as doubles so we're we're storing those as float 64s in our tile db array and as far as uh compression you know tile db has a wide array of choices for compressing your data and it's very often worth experimenting to optimize and find a good balance between performance and storage so here i'm just using z standard for everything because it does provide a good balance between the two and generally works pretty well but if you're creating a you know a massive array with a new scheme it's it's worth trying some different compression options to see how it affects your results so we do that and then we tie the dimensions and the attributes into our schema um i am going to use the allow duplicates flag because there are cases where variants different variants will share the same uh genomic start position and we want to ingest both of those into the array so we create our schema and then we create the array uh so i usually include this little snippet here uh in my scripts which basically just checks to see if there's an array uh already at the location that i want to create it at um and if there is an array it deletes it now be careful with this you know if you have large uh arrays that took a very long time to create you probably don't want to do this but um it's handy for experimenting and it's nice because it uses tile db's virtual file system functionality so this works whether or not the file the array is local or remote on s3 and i think i already have an array so i will delete it so then we just run tile db array create and that should create our new empty array we can verify here yep so it's called uk biobank geostdb and if we look inside there are no fragments written yet um so now we're going to ingest the actual results uh so the data into the array so the first thing we need to do is open it back in write mode which i will do here and then if we print that out we get the nice summary information that tells us you know where the array is uh it's a sparse array etc um and for the actual ingestion i am just going to loop through each file um which we've identified here so this is our vector of the six geost results files we downloaded and here is where the ingestion actually happens so we're going to loop through each one i'm going to use the vroom package to load it into r which is it's great for uh loading in large tabular data and then here we're actually performing the ingestion so i will run that and it's going to go through each of those files one by one um and you know here we're doing this all locally and serially because the files are very small so it's not a big deal but this approach wouldn't work if we were trying to store you know the entire uk biobank dataset for example um but as dirk mentioned earlier how db supports fully parallel reads and writes so we could have easily performed this uh in parallel um so if you're at an institute with access to an hbc for example you could have dispatched batches of files uh to separate nodes that could all be written to the array simultaneously um we also offer a service how to be cloud that provides serverless infrastructure um so using that you could actually spin up as many nodes as you need uh on demand and adjust all of the results files simultaneously um oops forgot we actually you should have a slide let's see did i refresh this no i don't think i have the i don't have my slide here um well it's okay so we do we have uh we have a functionality on tau db cloud that lets you write a user defined function so any arbitrary task can be uh defined as a udf and then you can run those in task graphs which can be distributed across as many serverless nodes as you need so later on i'm going to show the the full array i created with the uk biobank dataset and that was ingested using our tau db uh cloud udx uh but for this little one fine to just do it serial serially as we did here so that's finished um to actually query the array we need to reopen it in read mode uh i'm indicating that i want the results to come back as a data frame and then these are the specific attributes uh i want to retrieve from the array so we're going to look at a couple of different types of queries we can perform with this array um so the first one is just a simple r-style array where we're using square bracket indexing to subset by row and by column so we're going to uh index the array to pull out just the GWAS results for water intake which is one of the phenotypes they looked at and uh limited by chromosome 20 and then i'm wrapping it in a tibble here just so we can print the results out nicely so print those out you can see uh there's our our dimension for phenotype our dimension for chromosome and then we have uh the third dimension which we didn't limit so it's returning all of the results in that chromosome um and this data might be easier to visualize so here we we returned let me get my zoom face out of the way we returned uh 290 000 uh variants from that chromosome here we're looking at the negative log 10 transform of the the p-value so we can kind of spread out those those lower p-values this is called a manhattan plot which is typically how you visualize the GWAS results uh so if i want a query on all three dimensions i'm going to use the selected ranges approach that dirk showed us earlier basically we just would create a list with an element for each of our dimensions and provide a a query range that we want to limit to so i'm doing the same thing i am pulling out results for water intake on chromosome 20 but now i'm adding a range to limit the results from the chromosome position so we're just going to pull out the variants that are located between five and six million base pairs so i uh attach that to the selected range slot re query the array and then we'll just update our plot so the x-axis is the same but you can see we've just pulled out now those points for that specific region uh of the chromosome so the the other kind of query we mentioned we'd like to perform is looking at results for the same region but uh across phenotypes so maybe you have GWAS results for you know chocolate ice cream preference and strawberry ice cream presence you want to see how those results differ so we can run the exact same query but now i'm going to set the phenotype dimension to null indicating i want all the phenotypes back for just this region of chromosome 20 so update the selected regions rerun it i'm going to create the same plot but now we're faceting by the phenotype dimension so you can see for all six of the phenotypes that were ingested into this array we've pulled out the variants for that same region between five and six million base pairs so i'm going to pause there for a second to see if there are any questions we need to get to why is there a gap in the middle of a in the middle of the Manhattan plot that's a good question oh did someone answer it so that's the centromere exactly right so that's that's sort of the middle of the chromosome if you ever seen a picture they kind of look like bow ties with a weird circle in the middle so those are regions that are full of highly repetitive sequences generally there aren't a lot of gene coding regions there they're also difficult to sequence because of all the repeats so usually you just don't have data for that part of the chromosome but good observation can you take the one preceding it that was two domain specific for me um which was that if a cell yep if a cell indexed by chromosome number in chromosome position represents more than one variant the data for each variant would be contained as multiple sets of attributes in that cell yeah so you would if we were querying it as a data frame you would just get back multiple rows for that um for that cell which would normally just be a single row so this is this is a case that crops up pretty often for our vcf product so we have a program called title to be vcf for variant call data so that's data that's ingested directly from vcf files and very often you get different variants that share one position and when you query the array they just come back as multiple rows in your results okay so if there's nothing else i will get to the last section which i guess i don't have the slides for unfortunately okay so that's fine so if you did follow along and you created an array on your local machine we developed a small shiny app that lets you explore explore the array um and unfortunately it's not in the slides but you can install it yourself uh here i'll just paste this in the slack so everyone has it uh so if you install that and just load it and run the app uh then you should be able to paste in the local uh file path for the array you created so in my case it was this so i'll paste this in you need to load the array let's say ah right and then if everything went okay you should get a green box here and which indicates it is a valid array and it was loaded into memory um so what you can do with this app is once it's loaded it allows you to query it you know interactively without having to worry about the the r syntax for now so you can specify your phenotype of interest which chromosome you want to look at um then hit submit query and it basically is performing the same kind of queries we were just performing earlier um it will also generate plots for you um and then we can also look at statistics that tile db optionally provides you have to enable the statistics um i think the function is tile db underscore stats enable in the r package um so you know performances of course pair amount to uh tile db so we track every part of the stack and provide pretty comprehensive information about how much time was spent in each place in the code um so you know most critically here we can see how long it took to read the array so that's the sum of the read time um if anything feels slow you can kind of go through here to see if anything looks off so it's very useful for diagnosing performance issues if they should crop up um and the last thing you can do here is this will actually generate snippets so you can perform the same query directly in r or in python because of course we are a multi-language format um so this is really just meant to kind of help you verify your local array was created successfully and provide some hints for exploring it without worrying too much about the syntax um so we've been working with this sort of toy version of the data set for demonstration purposes but we also did ingest a much larger swath of the the full results to an array on s3 that everyone who attends this tutorial has access to and if i reload the array it should fill that in by default right so here it is so this is the the uri for that array so i can load this one and now everything is happening remotely so this is like a 200 gig array that's uh on s3 and oh would help if i ran my query first we could perform the the exact same queries we were performing with uh with the smaller one you know we'll get the plots back the results um you can move it along and query a different region and uh you know of course this is all remote so the normal caveat supply performance is going to depend on the speed and reliability of your connection um but generally the the performance is very good you know it should only take one or two seconds at most to pull back tens of thousands of these variants um and 200 gigs while it's fairly large you know it's not massive we have other genomic data sets on on tiled db cloud for existence uh exist for instance that are multiple terabytes in size um specifically for vcf data and you know that's a case where we have customers actually using it in production so it's been highly optimized but if you do any kind of sequencing or vcf work uh highly recommend checking out some of our our notebooks and exploring those data sets um so that's all i have for the g was use case let me see if there are any final questions about it yes the shiny app is is available online it's hosted on shiny apps dot i o i have the lowest tier paid account so i think i get i don't know 25 25 hours or 50 hours or something like that so if it shuts down it's because i ran out of hours but you can install it and run it locally as well uh yes the so someone was asking about sort of the the preprocessing for the data used in the analysis yes this was this data well this data comes from the uk biobank so it's really pretty homogenous since it's uh mostly people of british ancestry but they do a very detailed methods about correcting for population structure um yeah so that's all i have for the g was use case dirk do you want to take it home i could or are you still on this screen share i mean i can i can take it over again i mean i still have it here but i sadly don't have the updated slides i yeah but lost our nice summary yeah all right um everybody's back at seeing my screen then i guess so yeah thanks thanks for air and that was really nice um we were of course you know everybody's ever given a presentation knows how that goes with the deadline device we were finalizing things sort of by today and still stopping files around the pdf around so i think the pdf may be current but i may not have picked up the most a recent version of all the script so if something's missing just remind us with issue tickets at that repo and we'll wrap that all up because that's why the repo is set up so that you can just install our packages and have it all so the idea really is um uh to be um in the package but all the script examples that we show including the shiny apps so we'll we'll go over that tomorrow and check that out so in summary what erin and i tried to show you and convince you off is that tidy b is an open source embeddable storage engine the underlying c++ code is all mit licensed uh you know will always remain open source can never be taken away from you similarly the uh that provides the access layer with which you can store any type of data in this open source format and we believe strongly that an implementation to accessing it is a really rich uh and really feature rich and capable of approach of accessing this native to this format is a full cloud operation on the three leading providers we work a lot with aws just as everybody else does but it works the same way with google cloud storage and a few bioinformatics project work with that as well as with azure and we have a couple of clients in um microsoft centric environments who are the nice thing with all these cloud backends really is that it's limitless scalability because it's no longer your machines it's their machines you can just fire them up and you can bring that in-house um to the same uh the same way we illustrated time travel and encryption um all of that is provided on top of the um you know generic c++ api that python and other languages access to wire the r package also mit licensed on cran will never be closed source and always be open source they are to be used and extended by all of us really um and there's already a bioconductor package um tidal b array it's called um without a hyphen in there that uses the context of some of their higher level data structures we're aiming for high level offering the interfaces that we've seen from data frame tidal b array and others as well as lower level access we showed some short examples and erin went over that with squares example and it's also fully interoperable with dbi and aero as illustrated um we really think that uh multi dimensional sparse array can store just about any type of data for you straight up numeric data um vcf arrays data frame will vary as types we covered a few of those including geospatial financial and uh general wider social study data um but the sky really is the limit and we'd love to be in touch and help so just contact us and some of the contact informations will come on a slide in just a minute uh further resources that are there this was an excellent post that stavros our founder wrote last fall i had come across uh basically not a formal request for common but sort of a laundry list of nice requirements posted by the open and lt that's a bunch of as i see it i think mostly european folks interested in machine learning and they were musing about what requirements scientific open data structure would would have um they went a particular way but you know we we wrote this blog post and basically addressed every single one of the requirements that they have and showed how it maps to tildy b so it's a a really nice detailed thorough but not too long uh description of the capabilities of tildy b if you're coming from a scientific data angle which you know maps to enough business cases um we've got websites docs dot tildy b dot com just illustrates everything there's lots and lots of sub levels to the navigation on the left of that screenshot uh we added the two main uh github repos here the our package has a package down um documentation site as well that's linked from the repo and you can talk to us by email hello at tildy b dot com and a friendly colleague of ours or we will be back in touch the web address the docs that i showed there's an online forum for feature requests and support github i showed we are on twitter there is also a community slack uh that we set up a couple of weeks ago and which is just slowly getting going it didn't quite go from zero to a hundred as the uh one here for use are when the um uh other online forum didn't quite stand up to the charge of all the users but if you hit that URL um or the one at the bottom of our homepage you can join the slack and ask any questions there and you know we're um happy and chipper and hiring and growing um when i came here we were still single digits employees erin came a little later and there was no teens and we're now in the 30s and i'm sure that'll uh that'll continue that way so if you're interested in this and have some skills to make the product uh even more awesome uh just talk to us and with that i'm amazed we're only just about two minutes over um so that uh that should just be it and i believe that sydney and the organizers now want us to um to stop uh really soon because the uh overall time budget is a little um it's a little limited so we'll probably have to uh stop the recording in uh in just a moment and uh i think that's uh that's just it but we'll be monitoring um the channel while user is still ongoing uh that slack then may die out i'm not quite sure what's happening but there's our community slack and all the other efforts um you have the github repo with the code that we looked at and we may have missed one file update or two so um um file an issue ticket and keep us online but otherwise thanks for all the really excellent feedback and questions it's been uh it's been a pleasure and i hope you found uh time to be interesting