 The Carnegie Mellon vaccination database talks are made possible by Autotune. Learn how to automatically optimize your MySleep call and post-grace configuration at autotune.com. And by the Steven Moy Foundation for Keeping It Real, find out how best to keep it real at stevenmoyfoundation.org. I welcome to another session for the Vaccination Database Memoir Theory's first edition. We're excited today to have Harry Student S. He is a group engineering manager at Microsoft on the Azure Cosmos DB team. He's been at Microsoft since 2008. Prior to that, he worked on the SQL Server database engine. He did his master's degree at University of Washington in the undergrad at BitSpline, which is a good school. So as always, did you have any questions for Harry as he's given the talk? Feel free to unmute yourself, say who you are, and fire a question to him anytime, the way he's not talking to himself. And Harry, again, thank you so much for being here. The floor is yours. Go for it. I really appreciate this opportunity to talk with people from CMU and affiliate it to it. So I think Andy already provided my intro, so I'll probably dive right into it. So here we are going to talk about Headstrap with Azure Cosmos DB. And this is hybrid transaction analytical processing. And so the problem statement that we were after in this particular endeavor is that we saw that a lot of customers were trying to take data out from our world to be transactional store, trying to have a pretty sophisticated ETL pipelines, et cetera, and to insert data into some other analytics system just for doing operation analytics. So the cost and the effort that was on the customer side for this was pretty high. And we said, why don't we try to solve this problem for customers and try to make it seamless. But the most important thing was can we reduce the cost for the customers in having to get to this operational analytics. So that kind of summarizes the problem statement that we are after here. So it's a pretty simple problem statement, but the way we thought was it was a segue to solve more problem for customers in making their presence in Azure more rewarding. So in terms of the next set of goals that we are after here is that when we think analytics, right, we didn't want to invent a new language or anything at this point. So we wanted to meet where the customers are. And so we kind of tried to make it interoperable with Spark and T-SQL where a lot of mind share exists in order to do analytics on top of our data. So some of the value prop of our cost must to be OLTB transactional store is that customers continue to take advantage of the schema freedom that we provide in terms of the JSON documents. So they don't have to do database development and they can evolve application semantics at a pretty high rate. And so we want to continue to preserve that in our pursuit of enabling analytics on top of our data. So apart from the schema freedom being globally distributed and elastic, which means that customers can do local region data access in each of the regions in which their applications are presented. So as well as they're able to increase the storage and throughput without any a lot of pre-planning because we are able to do that seamlessly while being online for workloads. Those experiences have to carry forward in this headstab addition of offering tools. So those were some of the first principles that we had in designing this. So the first and foremost is that our customers are take advantage of the low latency, single digit millisecond latency for reads and write operation on the OLTB. And we don't want to regress on that. And this means that the analytical queries, the disk bandwidth, IOS and the query computation, CPU cost, et cetera, we wanted to keep it off of the hot path engine so that our OLTB workloads are not affected in any way. So this just reiterating that we want to continue to have the same SLAs for OLTB that we have had before. And there's also naturally a natural derivation is that we don't do any remote IO transactions, et cetera, and the OLTB is a transaction path because of this new addition offering. We wanted to continue to keep the transaction path as late as possible. So in terms of the feature set itself, we wanted to expose analytics with archival support. And the reason for archival support is that usually customers want to address a larger space of data like a two-year history and so on. And for that analytical view while from creating the OLTB to a smaller working set that they usually update and modify. So we wanted to enable that opportunity here and that means that we do need a much cheaper storage medium in order to enable such a large amount of data for analytical queries. And time travel queries and snapshots which kind of go along with it are some of the things that we wanted to add into our roadmap as we plan this architecture. And going back to where we started is to reduce the total cost of ownership of the operational analytics space and trying to keep the data within the ecosystem. So with that, in terms of what we are really looking at, we have customers using SQL API or code offering and MongoDB API and so on. We have other offerings too. And the customers kind of insert their data using these APIs into our transactional store. And we kind of have a decoupled storage medium which keeps data in a format that is efficient for analytics, which is the column store format. And we try to keep the storage in sync and make sure that the freshness is maintained. And also the data transaction is consistent while being able to serve queries with snapshot isolation from all these analytical query and time set on the right side. And so this isolation, the fact that query run times do not reach out to the transactional store during query execution was one of the core things that we wanted to achieve through this. So in this entire screenshot that you have in front, the central box of how the storage is designed and how it works with all the high availability, your application and elasticity features, etc. And how things are kept in sync and what is the format, etc. Those are some of the topics that I'm going to go deep into in the subsequent slide. So that's the box I'm going to focus on. So in terms of the storage option, since we said that, okay, it's going to be a decoupled storage and we want to reduce an amount of processing that happens in the world to be database engine to maintain this decoupled storage. Logstructured storage seems like a pretty good choice because writes are batched and amortized and the number of iOS we do is pretty set and so that it doesn't add any unpredictable order of an operations or anything like that for persisting one batch of operations. So we did go with logstructured storage. And so it's apparently stored as a lot of this audience would be aware of, and where it replaces and deletes generate invalidations for the old versions of the records. And, you know, delete generate stormstone markers and so on. And the rights are batched and done in the background. The unit of right as I would refer to again and again in the subsequent slides is what we call as a segment this is the unit of rights that we have batched into one file, and we make it durable and that's what we call as a segment so we can think of this as a segmented storage system if any of you have come across that such a thing before. So now the introduction of our format. So, you know, we didn't. We started on this journey from scratch and we didn't have a format earlier. And we thought, why don't we, where should we introduce a new, a new format of our own making, which may not which may lack in probability with other open source capabilities and readers and ecosystem tooling support, etc. So we started, we decided on starting with just to park a as a base format for that we will use, but again park as in any column format is needs work on top to reflect updates and deletes in an efficient way and as well as, you know, a table format on top is also required. So it makes sense of the entire gamut of files that are present in a folder. Right. So, we choose to be simpler with at least the base format so that, you know, the unit of read could be easily readable with the several open source libraries that are present for, I think quite a few of you might have had to deal with here. So it's a column of format where a bunch of rows could be grouped together before coalescing into columnar chunks, which we call as row groups, the set of rows that are grouped together. And each column chunk has a bunch has a set of pages, and so on and so forth. The important two points that I want to highlight about this format itself and the one that, you know, that's very important for our context is that park a requires a friend schema before writing data, which means that us being a schema free we don't take a document data model. We do need to do schema inference so we did implement a schema inference component that kind of computes the union schema that's upcasting that there's a lot of that and converts that into a schema. So there are some gorgeous within compatible data, data type conversions and so on but parking that aside. It is that that's the that's the mechanism we use in order to write a schema from the data that was schema that's to begin with. One of the most important things with Jason in general is that it has capability source to support nested structures nested objects and so on and a number of levels of nesting and also a lot of these fields are optional in nature in Jason like the number of items can repeat zero or more times and so on and so forth. So in order to persist the nested levels and everything in a way that it is captured with full fidelity, such that when you reconstruct the Jason draw from the park a data, it needs to be identical the way we sent it earlier. Right and that involves quite some encoding and we again choose to do something that is more commonly done, which is terminal encoding this is a picture from the terminal encoding paper. And this is where we encode the definition repetition levels for optional in order to disambiguate whether option fields are present or not. There are so many times the elements have repeated and so on. And so this also has a wide adoption in the park AMR libraries and in this park ecosystem and so on so that's something that we adopted and, you know, our back and has done C++ so we kind of did it on top of the C++ library but the reason we adopted this encoding because quite a few readers did the decoding pretty well in the OSS ecosystem. And summing all of whatever we spoke in bits and pieces together, you know, coming back to the big picture a little bit here. So we have OTP transaction traffic that's coming to a cosmos to these shots. And each of these shots maintain the analytical storage footprint of their own in the in a decoupled storage and any other any analytical runtime let's park a T SQL, would just talk to the decoupled storage, contain the IOs and everything to just that medium and be able to serve analytical queries. So that's, that's kind of summing up what you know, a little bit on the big picture before I go further in depth into the lock structure story itself. What are the IND stands for? I'm going to come to that Andy in a bit. They are something called as inviridation files. I'll be deep diving into it. Alright, so, you know, going a bit deeper into the next levels of lock structure storage so let's say that we have rights and we have passed the first set of 100 records and return to the segment one, and the next set of 90 records to segment two, the next set of 80 records to segment three, and the next set of 90 records to segment four, etc. The important thing to note here is even though we said apparently storage, etc. They are not actually physically may collocated with each other the subsequent blocks. So, one dot Park is written to storage stamp one, two dot Park is written to storage stamp two, etc. And so they are, we did this deliberately in order to kind of spread the bandwidth because you know, we have it since we are targeting the storage medium for this. The bandwidth constraints might also be a little harsh and so we wanted to make sure that we are able to spread the rights across several such stamps to be within the bandwidth limits and also be able to, you know, keep the cost low and keep the policies for showing up in an article storage loop. So, which means that in order to logically stitch things together to form a logically consistent view of the database at any point in time. We would need something like a root segment that that's what we call it, but in database systems you would have, you know, heard of checkpoint files manifest files etc and different systems. So, every segment tidy has a corresponding physical uri. And so that's how they are tied and these physical uris one dot uri and two dot uri could actually be in different data centers but within the same region. And that enables us to be resilient to capacity constraints within any one particular data center, etc. And so on. And we also have something called this is a very smaller miniature view of the root segment is much more complicated stuff in it but in but this conveys the essence. So in terms of the statistics we have things like a total number of records, what are the number of invalid records at any one point and so on and so forth. And this, this is one number that would change right because if I had pretend some keys into segment one, and if I have replaced or deleted it in segment two and segment three. Then it means that the number of invalidations in segment one is going to continue to go up. If it becomes completely invalid our query run times will ignore the file. If it becomes more than 50% invalid or garbage collection compaction will pick it up, and so on and so forth. And the other column that we have here is the transaction ID, which is the world to be transaction ID with which these records were associated with and this enables us to provide the snapshot isolation marker for any queries that start off of the query run times. And we also have a property bag in it, which says something like a last durable segment ID last durable transaction etc. And these are the markers at which the query that starts right now holds on to and provides the version of the database that is consistent as of this particular transaction ID. All right, and so this also enables us to continue to persist new rights, continue to mutate root segment etc with new updates, though the query that started already will continue to provide an earlier version of the data. So now I'm going to go a bit deeper into the invalidation than the inb files that I've been showing to understand that better. There is a concept that that I've shown here which is log offset. Okay, so it's a logical offset. And it refers to a logical location in which this version of the key is currently present that is the latest version of the key is currently present in the decouple storage medium. And so it's a triplet what it says is one zero two zero one nine what it says is the the latest version of this key is currently present in segment one row group zero and at the row index of two zero one nine that's what it says. And the reason we maintain this is if there is a subsequent updates using replaces or deletes, we would be able to generate invalidations because for the old locations. Without having to look up into the decouple storage and these invalidations that we generate will be persisted in the inb files I have something to show for that in the next slide. And these invalidation offsets would be used by the query run times to push down as a filter predicate to consider these roses invalid, and so that it won't even materialize those rows. It will invalidate those rows before materialization. Okay, how do we actually generate these invalidations just an animation to show that we have a let's say we have a the this is the content in the old table as of this point, and we have two documents and with these as keys. And let's say the dark one is getting a new version that's getting a replace operation. We are replacing it with a version two. Okay, and so the current file that's accepting rights for the dark structure batch segmented right is three, and it's going to allocate a new offset in three, or the new version of the dark one. However, it's also going to add an invalidation for the current offset of the version one of the document one dark one. Right. And so, and once we allocate a new offset, it is written back and the transaction is submitted when with the new log offset. So you can imagine that there are some crash recovery scenarios that this opens up where the in memory rights are the in the three dot segment is not it persisted or a remote checkpointed. And so there is an operation log behind this document table that I mentioned here that I haven't necessarily shown in the ppt that's that serves as a right to head log for the analytical storage so that we can recover all the versions without loss even if there is a crash. And once a good number of rights have accumulated into three. I have shown just one here. It will be flushed to remote and the and this act of flushing things to remote doesn't make it immediately visible to any queries. It's only the next step of checkpointing that actually makes it visible in the root segment. So the three dot file is now if you just see the difference between the previous segment and now the difference is that three is now valid with a bunch of records in it hundred records. And not just that the ones in validation count went up by one. So from 10 to 11, right as part of committing this so it's either the causality and the effect would be both visible that is a new version of the document that got inverted it will be present in three as well as the automatic count on one will go up or neither would be so that's the atomicity for this. And in terms of the. What's the average size of like your pocket files in cosmos. So we try to. So that there's a there's a trade off there between stillness and the size and I have a slide on that how we balance both. And we try to achieve around 300 to 400 MB of pocket file at the minimum, so that you know query performance doesn't have to suffer due to too many years. So in terms of invalidation design itself. One thing we had in mind is that we don't want to do too many iOS in the in the frontline path that in order to keep the pressure low on the on the frontline transaction path even though we are not doing the remote on the front line transaction path we still want to keep the latency to reflect it in the remote analytical storage low. And so, some of the principle we started with was that one party file data file flash should also involve just one invalidation file flush right. But because one data file flush can actually accommodate invalidations for many segments in the past, including itself. But that shouldn't end up in us doing so many number of iOS into the number of segments attached in order to persist those invalidation so this is something so that we said in order to kind of keep the pressure on the relative system low. And this should such a design should also allow for snapshot isolation of queries because we are not changing any data, not doing any in place updates so anything that started off with the previous version should continue to work. And the queries and ingestion should be completely decoupled such that you know queries don't get blocked on some rights that are trying to checkpoint and and vice versa so that you know one doesn't hold on hold on hold the processing on the other. And to kind of accommodate these statements, the invalidation itself is a log stream. And the invalidation files though they have called out as INV. It's just semantically different but physically they are just pocket files themselves. And that allows us to kind of compress the it's nothing but a array of offsets along with some metadata about which transaction made it invalid. And it allows us to kind of since we sort it as well before putting it in. We allows us to, you know, do run length encoding and compress quite a bit of work because of pocket format. So, in terms of how we merge that if there are too many invalidation files that build up like how do we merge and kind of, you know, restrict the number of Ios that happens on the query side is, you know, we do a merge strategy and every time we merge we see the number of we see the file size that comes up and if it is beyond a size we put it to the next level, etc. Something similar to the log-structured merge strategies you would have seen before. And each of these files would have the invalidated offsets again sorted so it every merge allows us better gives us better compression. And this is what the invalidation file itself kind of looks like. So the first three columns are nothing but the log-offs at triplet. And the rest of the columns are some metadata about what made it, what is the timestamp or what is the transaction that made it invalid. And also the final thing is the operation type that made it invalid. It's used in some pretty neat features but not so important. Yeah, so that's the invalidation format itself. So summing it all up again, right, putting the bits and pieces together once again. So we get a bunch of writes, we allocate offsets and that's part of the user transaction for commit. And we write that to the table and then all of this from this point onwards happens a bit asynchronously without blocking the transactions. And that would involve schema inference and then parquet invalidation generation and flushing the files and then checkpointing it to make it possible. So now talking about the freshness optimizations, right, the one Andy alluded to earlier, like how do we make sure that we are able to give a good freshness between the time it takes to process something in the transaction store and how long does it take to reflect in the analytical store. And the more we try to compress that in order to provide better latencies. It comes at the cost of generating smaller parquet files because enough number of writes haven't accumulated to create a larger file and so on. So what we do is we kind of make the last segment has have several incremental flushes with several smaller files, but we don't finalize the segment until we come back and redo it with the larger files. So we do intermediate merges for the last segment. If there are too many trickling rights and the right rate is not high enough, then this will take care that the number of files we create for the lifetime isn't isn't going to be too many. So we do quite a bit of intermediate merges for the last segment until we have accumulated enough sizes that one final flush of the last segment again would create a good size and then we move on to the next segment. So as in any lock structure systems we, you know, we have to do garbage collection because there is garbage with older versions being generated. And so whenever threshold drops down below a particular threshold of validity in a particular segment that is more than 50% 40% invalid etc. Then we kind of pick up those segments for relocation. And this relocation, it means that we are relocating the valid data in it so that the entire segment can be called invalid. And, and we kind of interleave it with the existing rights as well. And so it lands in the data block while invalidating the old locations for itself so that that's one level of GCB that we do. There's another compaction that we do that is kind of we do with along with the different partitioning strategy that I'll talk about just touch upon it a little bit in the last slide. So this that okay so here, I think that kind of some summarizes in terms of the local storage one unit of storage design now the rest of the slides are going to talk about how it had this big couple store analytical storage is going to deal with the rest of the high availability and the elasticity features that Cosmos DB provides. So, Cosmos DB today provides a geode application, we call it global distribution, your application, etc. But it enables customers to have data in each regions of their choice. And we have an internal state machine for your application just like we do. We have an internal state machine for local high availability within a region, and the geode application we currently do to every region that the customer configures. And so there is a copy of data in every region that the customer wants. So in even for the analytical storage that we have, we wanted to retain that that even the analytical storage should be in every region that the customer has the pre storage provisioned in, and so that they they don't see any semantic differences, and if they are able to if they compare their own DB views and analytical views on the same region it should all add up and so on. And any kind of geo failure was where we switch the primary status leader status from one region to another region. And that should kind of transfer the status for the ownership of analytical storage as well. And one of the core things that we did in this particular thing is that we used to do geological replication and we continue to do that. But for let's say we have a one petabyte of some data that's present in one region for a customer and that customer chooses to add another region. And at that time, doing, you know, a record level replication and, you know, reconstructing parking on the other side etc is pretty expensive in terms of CPU memory so we do physical copy of files and we interleaved with the logical replication of the study state data etc in order to see the new region with the existing copy. And in terms of when configuring georeplication all the records and will have the same log offset in each region though they will be physically pointing to a different physical file in each of these regions because because we don't share data across regions. So those are some of the important aspects so as I you know this picture kind of summarizes what I spoke about where we do physical copy for the files while interleaving it with logical replication all of these are resumable to make sure that any of the primary failures that we have in the database etc is able to resume the copies in the. Maybe if it's more like I should ask at the beginning. This is something that the customer always gets or they have to come on and say they have to turn this on. Yeah, they have to turn this on it's because it's it's also cost for them to have a present so it's it's their decision. But I mean, sorry, I know georeplication to turn up but anything that's generating these parquet files. Do you have to turn this feature on or cosmos is always doing this for people. No, that also they have to turn on because they do have a larger presence in, you know, larger involvement in the analytical query runtime part of the story. So that's why it's it's again their choice to have the copy of data more optimal for this. Thanks. So partition splits right so cosmos TV is elasticity story such that we are able to kind of scale to any amount of throughput and storage by scaling out. And this happens seamlessly bear customers that they've got workloads continue to be online while we do the partition splits. And when we do the partition splits one parent partition gives birth to two or more children it's usually two right now. And when the parent partition dies after that split process is completed, and that kind of device a logical key range of data between the two children, right. And since it's a since it's an activity that might impact if availability if it goes on for too long because after a point, the nodes capacity of the parent will run out so we were there has to be a right pause on the system of the split state too long. And so we so well, what do we do with the analytical storage when portions puts happen right because each of the namespace in the decoupled storage was owned by a partition. Now it's going to die. But it can the data die probably not because it has some archival data that that that is present only there. So it has to be continued to be referred as a valid data in the container. Problems in that space and so we, you know, in order to keep the splits continue to happen at a good rate, we ship only the active snapshot and OTP, you know, when doing the splits. Also, so the parent partition namespace will retain the archive storage, and we continue to keep the partitions that mutation history, in order to make sure that we are able to reach all the archive storage through the current children and so there will be some kind of ownership transfers etc. In order to, in order for the children to be able to reach out to the parent namespace. So, you know, pictorically, you can think of the, you know, the partition and the top kind of getting split into two, and the, we are not copying the physical files of the parent because the data is going to get divided between two partitions so we cannot just copy it access. All right. So we instead be replicating we kind of copy only the local data that is present in the world to be sure and let the children generate new files, honoring the log offsets that were there in the parent and not just that while, while a parent dies, like for example in this, in this depiction P0 dies by giving birth to P1 and P2 and P1 dies giving birth to P3, P4 and P5. So when P0 dies, it actually transfers ownership of its archive storage to P1 and when P1 dies it transfers its own archive storage to one of the children that is P3. Sorry, it is P5 and it transfers its parent P0's archive storage ownership to P3 and that way, all the child live partitions of a container at any one point has a reference, a referential ownership to one of the parents, at most one of the parents in the ancestry and that way we'll be able to, in the query runtime will be able to refer to the archive storage of the entire container. And when somebody adds a new region at this point in time to complicate things, these partitions which own the archive storage also on copying the its own as well as its archive storage to the other region as well in order to continue to provide the same, you know guarantees that we have spoken about so far. You know, I'm not covering a few topics related to multi right, because we made it a bit too simple on that and cluster migrations and so on, but that kind of summarizes the, you know, the technical deep dives that I was trying to go into. And coming stepping back right why did we do all this. So, this architecture is something that we are using it for other features as well like point in time restore and a few others outside of headstab as well. But with the with the headstab scenario. And we were able to, you know, kind of compare the total cost of ownership between someone adopting our new solution versus doing the traditional way of doing ideal pipelines to traditional, you know, analytics system and so on. So, not going to go into every detail of this but this is the kind of, you know, the total TC woven that we have in terms of comparing ourselves to a to a system that that is popular. So, I'll just give it a minute on the slide. Probably less than that. What is the bronze layer. Okay, so that's kind of the Delta lakes medallion architecture, where they all the addition comes into the kind of the bronze layer that's the recommendation architecture and all the new addition, the raw layer that comes into the bronze layer, and where a lot of the deal happens and, and then it needs to be merged into existing data on the same keys etc that's happens in the silver layer and so on. So there is, you know, complicated stories there and, and then kind of the processing power to do it and the end to end latency of reflecting the updates just go up. Like, the data versus like, there's like a Delta Lake term. Correct. Sorry. Yeah. I think you've gone. Okay, cool. So, in terms of the customer feedback just carrying a couple that came to mind yesterday and so these are customers who have really used our solution to their cost and cost advantage and they're able to run it with great cost as well. In terms of their solution. So, what do we do about the roadmap, right. So, giving up a different partition key for the analytical storage is something that has come up quite a bit with our customer conversations and that's something in here. We have previewed right now, and that's an important win as well. And we do have something else in progress which is to add a lot of metadata indexes in order to allow queries to print out a lot of files based on the properties and the filter predicates and the projections. And we do want to provide customers a way to provide a reduced column set for persistence and analytical storage than do it all that we have right now. And then expose as of timestamp query since our we have all the versions we should be able to travel back in time so you know we do want to expose the query and complete the dark. These are just some things that come in mind in terms of the customer facing differentiated offerings but in terms of the internal architecture we have quite quite a few improvements lined up that's going to benefit even other features on top of this as well. Yeah, but like, I mean, I realized you have an app in the future like what percentage of the columns in the road store to people actually want to clue it in the local storage. What percentage of customers like you know this like this reduced column saying basically saying only shut out like these subset of the columns on the table. My question is like, do you have an idea of like roughly what percentage of the columns people would want to only include like can you see this for the analytics of like of the queries are actually running that like most queries only touch 10% of the columns on the table 20% or so forth. Got it. Yeah, I don't have that statistics right now with me, but it's something that we can definitely observe in our query and time telemetry and and I'm guessing that it's going to come down to around 40 to 50% that's pretty much the set of columns they are kind of interested in because you know, since we are in the Jason Jason document model. A part of the schema is more like fixed because that's like bread and butter for that application semantics and a part of the schema is like more evolving and very sparse. So it's likely the ones that are going to be there for every rule that they are likely going to be able to do some very meaningful analytics on top so I do believe it's going to be around that percentage mark. So in terms of the learnings, you know, we have quite a few but synthesizing the top two. So in terms of, you know, if we if you see this architecture where the world to be is also responsible for the analytical storage which is in a decoupled medium and so on. There is there is an availability dependency that you know, because we need to do recovery, etc. And you know, as we want to increase the amount of party file size the amount of recovery we have to do also increases when the world to be primary fails over etc. All right, so we innovative quite a bit here in order to reduce the impact of this recovery on the oil to be stores availability like a primary swap happens, due to load balancing or note failures we wanted to happen fast and not be stuck behind recovery of this remote storage and so on. So we did quite a bit of, you know, features in the space in terms of asynchronous recovery and so on, in order to kind of keep the order of constant recovery time due to analytical storage. And we also did certain things like, you know, treat the incremental file flashes as as if it's durable for the purpose of recovery and so on in order to continue to optimize on that. And another thing in general is that I created a development as much as it's good to go to market and gather customer traction. It is, it is quite challenging in terms of in a, especially in a past service where we own the data, as well as on the upgrade stories on the servicing etc. So we have to, you know, kind of continue to support the customers that are in all the different formats in different iterations of our offering. And so that's quite a bit of effort that goes in planning those updates in a non destructive way and in an impactful way so that's something that we continue to learn from and optimize, but it does take quite a bit of time in this. And last but not least, I'm almost at the end of team is hiring a lot. This is just one of the fish to fry. We have a lot more bigger fishes to fry in terms of running a global distributed database service, a lot of late reduction problems in doing geo failures, etc. So, and we are also embarking on distributed SQL etc. So, you know, a lot of great problems to solve. So happy to hear from a lot of this bright minds at CMU and have them guide us where we should take and how we should plan for the next All right, thank you. Thanks for your time. I'm open to questions. I will follow that everyone else. Most people here was called actually are from Carnegie Mellon, so there's other awesome people at CMU, although we are awesome. Okay, so I open the floor to the audience if anybody has any questions feel free to unmute yourself and fire away at the hurry. Otherwise, I will ask my questions. You're all suffers. So, this is interesting that you guys are making heavy use of parquet for almost everything. I was wondering if you comment on like, since you guys are running a parquet, like, you know, massive scale. You know, what aspects do you find in parquet frustrating or limiting, and you would actually want to prove them. And, you know, if you could change to parquet spec, what would you change. And then, you know, is the, you know, do you think it can have a better compression ratio, or do you think it have better computational efficiency for processing the files. But what's what's the main limitation you see with parquet. So I would say to one is one is the fact that the support for schema evolution. Right. Like, for example, data, because in our Jason free Jason's schema free model, a column can be an integer, though it's it's rare, and it doesn't happen that often in a, you know, carefully designed workload. It can be integer on one day, but it can be a string on another day. And in those cases, color parquet doesn't kind of support as far as I know, doesn't support the same column name being two different data types, etc. And that is quite important for fidelity because, you know, one needs to because because we are trying to push the data cleaning aspect to customers. You know, the query layer, because they will be able to better make sense of why it would have happened and so on. So it's important to even reflect that correctly. And to that to the customer side so that they are able to better process it but right now we are not able to reflect it well unless we do we have to provide another option called full fidelity schema inference for customers to opt in. But that would make it pretty hard to write queries because you would have to say column a underscore and select column a underscore and from table and select column a underscore string from table and whatnot right so it's not very easy to design that so that's one thing. And the second thing would be the some level of support to mutate a file once it's written right. So, what would it take to add, maybe add some metadata and some property back, etc. It's not, it's like, it's like, you know, once it is written it's, it's that's it and that's kind of making it a little harder. But those are the two things that I would say. The big problem like the patchy hoody guys have or one house right that like you have to maintain the delta for the updates and then merge it later. Right. Yeah. All right, question for virtual. Yeah, two simple quick quick questions number one is the grooming from oil to be into your lab. What is the time interval you do that is that and when how do you tune it is it in seconds minutes. It's a we try to do that every 30 seconds now that should be that should be the case so. Yeah, we try to force a flush with whatever records have come in in the last 30 seconds and that would generate an incremental file. Well, and when a good amount of records have accumulated for this current segment we finalize into one larger file later. And second part is, do you support time travel queries. Currently, our storage is able to support it, but we haven't worked on exposing that in the query layer yet. So that that's that query surface area has to be hooked up. Cool. Thank you very nice work. Thanks. Anybody else. When did, when did Cosmos TV or when did you get that proper sequel? Is it recent or is it a few years ago? T SQL you mean. Yeah. That was only made possible through this, because we have our own version of SQL querying. Not exactly T SQL that is, but it is based on ANSI SQL. So it does provide a good amount of feature coverage on but that is to query our oil to be data. Okay. I guess the question is that I think my question is sort of more about JSON stuff or JSON data, of course, you guys see a lot of it. So you need to characterize it. Like, I mean, you sort of mentioned is how like there's sort of this question is like how regular is it. Like you mentioned that there's like the core data that you have certain keys or attributes in documents and then sort of random stuff. And then he said there's sometimes people make mistakes or this random data where it should be 99% of the time it's integer, but someone throws a string in there. I guess I mean, how common do you see this in those applications? It's not common in production quality workloads. It is more, you know, it is more like development time workloads as well as, yeah, so it's not as common as we thought.