 The Carnegie Mellon Quarantine Database talks are made possible by the Stephen Moy Foundation for Keeping It Real and by contributions from viewers like you. Thank you. All right guys, thanks for coming to us for the final talk this semester for the Quarantine Database series. Today we have Sean Ma, who's the Senior Technical Director at PINCAP. And so again, the way we do this every time is if you have any questions, please unmute yourself. Say who you are, where you're coming from and ask your question and feel free to interrupt at any time. And as always, we want to thank the Stephen Moy Foundation for Keeping It Real for sponsoring us this semester. And we want to thank Sean for doing this because it's 5 a.m in China in Shanghai where he's at. We appreciate him getting up at this early to talk to us about his database. So with that, Sean, go for it. The floor is yours. Hey, hello everyone. So today I will bring you to talk about TidyBee on the long journey of VEGTAP. So a little bit about myself. I'm Sean Ma. I'm the Product Manager and Tech Lead of the Real-Time Analytics Team at PINCAP. Previously, I was a Tech Lead of the Big Data Infect Team at Nadie's Research. My main interest focused on the distributed system and query engine construction. So TidyBee has a lot of stories to tell, but this time it will be told in the EdgeTap perspective. From the very beginning, we have the slogan about EdgeTap description. We said we are 100% of transactional processing, but 80% of analytical processing. So that should be our slogan like three years or four years ago. But it was very confusing because people ask us why it is like 80%. It's not like 79% or something like that. We cannot answer because 80% is just a number that comes to our minds. And we just say that it's a little bit not as powerful as our TP design. So it's 80%, but it's just a number, no meaning. So we changed the slogan to something else. We said it's EdgeTap database instead of like 80% of AP. So EdgeTap was a fancy name back then. And we think that it can cover the ability of our database. So we used that term. But it was a very, very long road to go after we used the EdgeTap term because we need to make it a really EdgeTap database. So a little bit brief about the TITIB itself. TITIB in its ancient time is inspired by Google Spanner. As we always said that Google Spanner, they built a nice architecture and we inspired by Google Spanner paper. And for the previous GA version, it's like a 0.7 version. TITIB was a scalable database. And it's built for MySQL compatible because we want to grab those users from MySQL world. And it has the transparent sharding strategy. And the sharding itself is in rate sharding. And also it built for a strong consistency and also the transactional support and regardless of the sharding existing. So basically you can consider TITIB as something like this. It's a very, very large size of MySQL for a user perspective. So for TITIB server, TITIB actually has two layers. The one layer is the TITIB server. TITIB server you can consider as a database front-end. It's the computation layer for the TITIB itself. So basically it's a stateless SQL layer. And it will not shuffle any data in between during the query back then. The TITIB itself is kind of like traditional, very traditional model or very traditional design because the upper side of the database, even in a new SQL world, it's not quite different. It's just like you input SQL and AST builds and logical plan and optimization, those kind of things. It's nothing very special. So it's a full featured SQL layer. And we almost like made everything by ourselves. And for the optimizer, currently the RBO plus the CBO is not like purely cascade style things. And we have secondary index support and the amount DDL. So more complicated things happens in the TITIB site. So the TITIB has two major components. It's a distributed storage. And one component is placement driver. The placement driver itself is a, you can consider it something like a BOSA scheduler plus the timestamp protocol. Because the transactional model we are using is MDCC. So we need a timestamp protocol that will be touched later. And for the storage layer, it's a lot of different nodes from of type KV. And when user reads or writes data, they need to touch the placement driver first to locate the data where which type KV nodes it is in. And then it directly operate on the type KV. And always the clients will have a cache about the metadata cache, about the placement information. So it will not always go back to the placement driver always. And also the placement driver has embedded with some on scheduling system like it will do the balance of the storage based on capacity or based on the workloads pressure. So physically on type KV is something like that. On the very top is the JRPC and it's the open API towards the outside world. And the lower layer is the transaction layer and also the MDCC. And these two layers are based on the top of the multirapt data replication layer for us. It is responsible to replicate the data for a single type KV, for a single piece of information to different nodes and to guarantee that we have multiple copies. And also we have a consistent multiple views of the copies. And for the very lower layer is the RocksDB. Yeah, we had a RocksDB embedded inside because it's a perfect choice for if you want to have a server level of a key value store of a library instead of another system. And the whole system for the KV side is building Rust. And for the TidyB we just introduced and also the placement driver they are building Go. So this is the internal or logical view of the type KV. So each of the data is being cut into pieces by the primary key. You can consider it's a global sorted data range in the primary key sort order. And the key sort order have later been cut into different pieces. Each of the pieces are something we call the region. So that's basically the sharding in the distributed system, in distributed database. So that's a range sharding strategy. And every time that you write new data into a single region, then the region might become bigger and that region might split. And for every table, each of the table, it will be cut into different pieces based on the primary key. And those pieces of regions will be replicated in a rough-to-replication group. And for each of the regions, they were physically stored in different KV nodes to guarantee that we have free physical replication of each of the piece of data. So for the region split emerge, it's very important for the rebalance. Because when we do the rebalance, we need to guarantee that each piece of the region being transferred will not be too big. Just consider that you have a single table that the table was empty. And after it been inserted a lot of data, if the region will not be splitted, then it might become a huge region, something like, say, when terabytes or something. Then if in that size, the region transfer from one nodes to another nodes will be very heavy, and might cause very heavy, very fierce performance downgrade. So that's why we need to have the region in almost like 100 megabytes in every size. So after insert some new data, the region touches thresholds or splits into two new regions. So the split itself is not like physical splits. So basically, it's a logical split inside a metadata and memory, and you create a new region, create a new region info, and those new regions will cover the new starting and end from a piece of data. And when the compaction or background work is working, then if the region has been moved out, then the real data will be removed. And if the region becomes too small because the data has been deleted, then we'll merge it back. So that's basically how we keep each piece of the region in a suitable size, which is suitable for the region transfer and rebalance. And the table mapping is something like that. Because in previous slides, we introduced that how we built a distributed KV. So that's not something very special. So for the SQL side, for user point of view that they are using SQL, they are not using a key value API. So we have a mapping strategy from the SQL relational model into the KV API. So basically, that's your primary key and your primary key will be mapped to a key. And everything else will be mapped into value. So that's basically the rule. And also the secondary index is nothing special. That's the index key itself will be encoded as the key of the key value store. And the value will be the pointers to the or the primary key or the role ID pointers to the to the clustered data. And transaction support. Transaction support is you can read the Google populator paper. It's exactly, almost exactly designed based on that. So it's basically a two-phase commit, nothing very fancy. And we also have some optimization recently to do like one PC to save some latency. And it's almost a decentralized design because PD was the only centralized components in the transactional world in our transactional world serve as a timestamp Oracle. But with optimization, this will not be the bottleneck usually for daily use bias. Because every tidy B, it will have a single output channel to do the request. It's not like every stress or something. So basically, as long as you don't have like this tensor for or hundreds of different tidy B server, and PD will not be the problem for the performance. And this transaction model is acid compliance and based on the based on the key value system. And also the index itself, like if you insert a role data, and it might come with a couple of secondary indexes. So these kind of index consistency are guaranteed by the, by the transactional, by the transaction itself. So that's basically a brief, very simple brief about a premature version of tidy B before it's G version. So we had, wherever we were very happy when we built that system and we, we sent the database into to our customers and say, hey, we have built a very good database that is distributed and you can use it as if it is the MySQL sharding cluster. So what about replace your online TP MySQL sharding cluster with a tidy B? But user are usually skeptical. So they said, and is it very stable enough? It's a, it seems to be a very early, early stage of the project, how we can trust you. So let's try it out as a secondary cluster to do the replication first. So the TP debut becomes something like AP debut. We ask the user, is it good? It's pretty convenient at a real time analytics. Well, so we built a database that designed for a transactional purpose, but our very early stage users, they are using it in a pure analytical use cases. They just use it as a backup cluster or as a real time sync secondary cluster. And they do every real time analytics or data servings on the tidy B cluster. So, yeah. So why is kind of special to suitable for the analytical processing? So before the GA version, in the very early stage, we have such a features is very similar to the edge based features. It's a co-processor. So for every tidy B nodes, it can process some of the computation to see to speed up the query. So basically the computation can be split it into the tidy B is very simple. Basically, you can consider that the future and the partial aggregation and some top-end calculation. So those kinds of things can you can we can push down from the tidy B and to have a partial plan in tidy B. And after everything partial plan done in the tidy B, then the tidy B will gather all the partial results and to do the final calculation. So that's how the that's how the distributes and co-processor works. It's very similar to the to something that's edge based co-processor. So overall, at a very early stage of the tidy B, it has such features that are suitable for the real time analytics. So one is it's it is scalable so it can do the data integration. And because it's designed for TPs, so it's very easy to have the tidy B as the secondary cluster to the after behind the online databases, especially behind MySQL. And we can do the cross-sharding queries because the query the query engine were regardless of where we're not the query engine where we're just work on the cross-shards environments. So for the database that has been distributed, usually the sharding cluster are usually a proxy based to sharding cluster. It has some limitations on the queries across the shards, but that's not the case for us. And also it had to it can do the real time updates because it is designed to be the a TP TP oriented database. So it will not have the limitation of the of the analytical database which have the have a batch updates insertion support only. And also it's a scalable storage for multiple data sources. And comparing to the NoSQL, it has the secondary index. So that makes it a a good choice for data integration in real time. So actually that's from one of our users. It may be from it may be from that he's used the cases. So you can see that in early stage that user use tidy Bs as such, they have multiple MySQL clusters in in at their online online layer. And so they use the sinker sinker was the component for data migration or for the data online synchronization for tidy B. And they have a load balancer and then the the data is being inserted into the tidy B. And their their core requests can directly go through the tidy B for some online or for some real time on serving or analytics. So that's the very common use cases for our data and data integration use cases. What is PD in your diagram here? PD is the placement driver. So that's the timestamp article and also the scheduler. Okay, thank you. So you can consider that's something very similar to to maybe something like big table or edge space. So you have a scheduler to transfer the piece of data to transfer the sharding around to make everything balance. And also we we need to have a timestamp article. So that's also the PD. All right, thank you. So one of your diagrams on the right is the tidy server, the same as the green green thing marked as tidy B on the left. I mean here. Yeah, is this tidy B the same as the tidy B server on the right? Oh, yeah, sure. Yeah, that's the confusing side. Yeah, because we don't have an extra name for the tidy B server itself, the computation layer or the front end layer. So this tidy B should be the should be the tidy B server. And also confuse us our customers or our community users sometimes because when we say tidy B, it sometimes means the tidy B cluster or whole tidy B project or sometimes it's main means the tidy B server. Okay, thank you. So is everyone happy now? Actually not. So after a year after some after a while, we interview the users later after they've been using it for a while. So for a GP scenario, it works. And regardless of it's not very stable because it's very early stage. Remember, we're talking about the tidy B pre GA version. It's not like the version right now. But for AP scenarios, it's like the end report is slow because because they are analyzing a lot of data. And also during the very big joins the tidy B server my own and because it cannot shuffle data around. And also it it does not work well with the existing big data systems. So basically problem is like the image of computation power between versus the star scalable stories. So it looks like something tiny had in the kingdom, you can you can order the doll on the online market online shop link here. So we have a scalable story system. But for analytical purpose, the tidy B server is not indeed as scalable. But for the TP transactional use cases, it is scalable because you can insert you can put a multiple tidy B servers on top of the tidy B and each of the transaction are very small and tiny. And that scale well. But for analytical purpose, each of the queries are like super large, if you have like large amount of data in store in the scalable database, a scalable storage system, you got to have something like scalable computation layer, which can shuffle data surround and to work together to deal with each single query. So actually we have different choices to make make it scalable. For now the tidy for for even for now the tidy B server itself cannot cannot exchange data for a single query and also for a tidy B. So basically, at that time, we're taking a strategy like a skill lab instead of skill labs for the for the computation layer. And as discussed in the previous slides, we have co-processor, but co-processor cannot do the full speed up for this in a distributed mode. For example, it cannot do join because the join cannot be like just split into partial plans and to just gather the final partial results and to do the final computation when due to join need to shuffle data around. And also for something like distinct counts, it cannot be speed up by the co-processor. So the two choice we we encountered, well, we had at the time was first is the hard way to build tidy B or tidy B and distribute a computation engine. Basically we want a MPP framework. And that's a quite big change at the time. So and it also very risky and take a long time. The real the real thing is like we don't have enough people to do that back then. It's like three years ago. And so the other choice is like we embrace the ecosystem. We can we can bring something now seem to our piece into our into our picture big picture. So what we do really what really happens was we build something called the type spark. So type spark is something we leverage the big data ecosystem, especially spark is a very very well implemented unified computation engine. So what we do is like we build a plug in for the spark. And we we hijacked the spark plan. And to transfer and rewrite into something we can we can reuse. So basically it's like when we have a spark plan logical plan, then we we rewrite into something else like we rewrite into something that's the later half can be pushed down into the time kv and the upper half will be remaining the spark ecosystem or be remain in a spark engine and to compute. So every time you send a query, then the query will be some some query like aggregation or like the filtering will be pushed down to the technical processor. And when the spark get the results of the type spark plugin were translated everything like a metadata or the role decoding and back to the spark spark format where it can understand and the spark itself will continue the computation. So that's the the spark. So we leverage those kind of things as to our distributed temporarily distributed computer engine. So, um, yeah, just like in previous slidesets, some of the things will be pushed down and some of the things will be remain in the type spark. And this one is a tightly integration with the spark distribution system spark ecosystem. So after this thing's been done, we can we can reuse a lot of the ecosystem tools that spark can spark spark uses say that we can do like we can connect to zeppelin. And we also through the spark we can connect to some bi tools like like tablets or or condos those kind of things. And also some of the users that use it in in a lot of different spark way like using using it with a scripting scripting language and also use it in the bi maybe maybe some AI is just for fun. Back then we don't have any like machine learning users and also do some details. Do you modify spark to do this or is it smart spark provided API so that you can plop in tie spark without having to modify anything above in the spark? You don't need to modify something. It's like when you plug when you do the plugin inside, then you can use everything just as the catalyst engine to write sequels. Yeah. Okay. Thank you. So that's one of the use cases. I remember is is one of the fresh foods fresh foods e-commerce company by I forget the name, but they built their tidy be they using their tidy be asked their real time data store or real time they were housing. So the the data been channel channeled in from the SQL server about Kafka or and the flume. And they use tidy be for some they use tidy be to store all the data from both the SQL server and my SQL and some real time data serving are from the tidy be instance on tidy be server instance. And they do some data science or bi tools things and using a tie spark plugin. And also the tie spark plugin will be the bridge and they can transfer archive some of the data. So then some of the code data back to the how do so is everyone happy now. So no tie spark provides actually a purchase park provide a kind of like brutal force computation with with low concurrency. But some of the time sometimes that people need more than that bigger state. They have like a smaller scale of course they want everything like in more interactive way. And spark is not very suitable for that. So we've got to turn back and we're back to the start and to enhance our tidy be server at the time. So what we did was like how long did you go tie spark before we said like all right this is a bad idea we got to be a bandage. It's not like a bad idea we're still using it even for right now but it's not like covering every scenarios we want. We hope for we hope it's to be the distributed engine for every scenarios for analytical use cases but we find it's like for some like a more agile or more like a smaller course they are not suitable. Especially because our users are basically more like the DBA they are not like application developers for big data sites. So for for most of the DBA here we encountered they don't quite a fan of big data system because big data system is more for like engineering kind of developer kind of style of world. So DBA have a very bad experience about a spark. So they are not willing to maintain anything like a tie spark or any kind of big data things. So yeah. So that's the that's also a major reason that we turn it back. So my question is how long did it take you to get tie spark up and running? Maybe that's better question like was it like because it's just a plug-in you don't modify it as a spark. So it's like something someone could get running like a single engineer got running in a month or is it? It's not like a single month. So the things need to be done is like firstly you do the metadata transferring so basically you say that you have the type system rebuilt from the encoding and decoding and also the stats statistics read from the spark and also you have a hacky way to hijack the spark plan and to rewrite it in something else. So it basically is like three three man nouns we have two people it's like two one months to have the prototype and the other months to publish it to something like production use case production usage. All right thank you that's impressive. So after tie spark we have we back to enhance our own system. So we basically optimize around the single-nodes had to be used for for those like smaller queries a single-nodes is really the way to go. So we made it become smarter and smaller more efficient and smaller in a smaller scale of use cases for optimizer for the very beginning like the pre-ga version you cannot really call it the tidyb have an optimizer it's just something like you carry the st around and to do the computation just like any premature or the database that might have and later we send it to rbo plus cbo and yeah you can consider it's like a totally rewrite if the optimizer layer and now it's not like yeah and now we are experimenting on the cascade model to expose more optimization chances. For execution we for the very very early version we have the classic of a panel model it's like row by row computation and row wise execution and later for the analytical purpose we change it to the batch first we change it to the batch execution and then we change it to the to the vaporized execution for all the different components in the tidyb coprocessor and also in the tidyb and the type flash. So that's a diagram from the enhancement to show the enhancement we made from the very early version to the 2.0 so you can see that for tpch it's a very huge boost after the restructure and redesign so is everyone happy now we still have a key conflicts we haven't resolved yet so because we don't have a columnar engine and a lot of the time the people are users or our competitors challenge us it's like without the column store how you can say that it's an edge type database and so the real other biggest problem challenge is like there's no workload isolation at a time um users needs to um run every chorus on top of the same storage system so um very often that if they run something like a ties bar chorus or like a very big analytical chorus then the storage system were were like shaking and so the tp workloads were not be stable so the workload interference is very um is very very big problem for us because tp workloads is very fragile it's just like a small boat sound in in the sea and when the ap quarry is running they they tend to grab the every resource they they can have and to run to try to compute a single big chorus as soon as possible so that's not something that too can can be merged or can can live inside the same world together happily so we have hard questions to choose between the the two sides actually we have some um have some intermediate design try to solve the problem it's like we we first like design the storage system to be um to divided the the resource into different lanes and we give it the um we give the tp a uh independent thread pool and also the ap independent thread pool but actually it's not a good idea say that when the ap quarry is running it even using a very small thread pool of resources but it grabbed the resources and never give it a back so even with very limited the stress it can cost very high cpu contention so after different trials and errors then we decide to have another components to solve two problems once together one problem is like we need to have a column store and to isolate the resource and also we want to isolate the resources to different nodes so that's what we call the type flash um so it's a distributed column of storage and um as a companion of a type kv so it's a it'll use a rough learner to replicate the data from the type kv into the column store format so uh at resource perspective you can dedicate a total of different machines to to run type flash and we'll do the async replication from the from the type kv nodes uh and also the follower reads and learner reads or read protocols defined by the wrap paper so plus the mvcc uh together these two two things together it can provide a consistent snapshot read and also um so that provides us a perfect workload isolation and also partially on the code base is partially based on the click house so that's why we have free language in our system one is go for the type type of server and the other is a rust for the type kv and now we have a bring the bring the click house code base in so we have c++ code base as well so this is the diagram of the whole uh type b so the new type b design so for the for the left lower part is the storage layer and for the right hand side is the type kv and as you can see that for each region of the data if you choose to have a column store replication then it will use a learner protocol to asynchronously transfer data from the type kv to the type flash um so why this is not a hard line it's it's something because it's a synchronization asynchronous replication so that means even if the type flash notes was down it will not impact the normal transactional processing and the type type b itself were used the cbo the cost base model to choose the right access pass so if user does not define that does not have the trigger that like they they want the perfect isolation they put the type b to do the choice then the type b will do the cost model on choices and just like how database to choose a single indexes it will read the stats and to um to compute if it is a suitable workloads for the type flash then it will route the queries into type flash so basically type flash have basically in in so basically type flash itself is a updateable column store so naively we think that column store is not fit for online updates so um traditionally they were housing on databases they and support only like like batch updates like every hour or every day based on the so updates consider not easy based on the primary key but it's not absolute so um so usually people use something called delta main is divided the column engine into two pieces one pieces is like the the right optimized area the other pieces to the read optimized area so that's what we do as well i think a lot of different engines do the same thing for example the c store i consider c store to be something similar they have a they have a database embedded for the for the engine online insertion and they gradually compact everything to a stable version of column store and also to vaporize i also could do a lot of things to did that before and we are not something very new we just borrow a lot of idea from different systems so we have to design something like that um so for the first version we use the rsm um but it seems not for our perspective it seems not very optimal for the read performance because iosm tree have two layers and between each of layers there are a lot of things so when you do the read you need to do an n-way merge so that's what the performance uh degradation comes from so later we um had a new design we um divided the divided the whole data in a way like a tree so you can consider it something like either the b-tree or segment tree so the piece of data were not each piece of data were not have overlapping but uh but each each of the leaf nodes where i have two sections one section is the right to optimize the sections it's uh delta sections and the other section is the stable sections for from time to time we need to maintain the data sections and to compact it together with the stable sections so that's the that's our new design you can consider it's basically something like a combination of a b-plus tree and the rsm tree it it can be considered as a b-plus tree because it's um it's it has very similar structure and the only different thing is like it has very huge leaf nodes and it is it is also similar to the to the rsm tree because it has a stable and it has the uh you can consider it's a two-layer of uh rsm tree for the delta layer you can consider it's a memory table but indeed it can be dumped into the uh into the disk and the stable layer is the tier one it's a layer one so from time to time we need to be combined and compacted to a better um better organized better organized way so by doing so we can avoid the multiple multi-way merge so there was speed up the speed of the scan query and as I said it's asynchronous because we are using rough learner to replicate the data so it's a asynchronous way to replicate uh to replicate things so that avoids any interference from the ap side to the tp side so say that tp your ap type flash nodes uh was brought down um it were uh if you're using a strong consistent synchronized way to replicate the data then the the transactional the tikv might must need to wait for the the type flash nodes um backup but for an asynchronous replication we need we don't need to wait for that but the the cost is like how you keep consistent read so the way we do that is like we um we use the rough learner read a rough follow read and also the follow read is essentially the uh the the algorithm we're using and uh together with mvcc strategy we can provide a strong consistency so the thing is it's like uh one type to be submit a query to um one type type to be insert a data in time one in t1 or t0 in the timeline then after it's uh it's been inserted and in the t t1 timeline it's uh it's a meta query so for the t2 when the type flash received the query request you were not directly uh response based on its own data you were first send a reading that's query a reading that's request back to the leader nodes uh back to the reader leader replica ask if there's anything um newer or more fresh than the things we hold uh in this diagram the type flash nodes is uh replicating in the roughed uh in a roughed way and the progress is uh free because roughed is like the log of a roughed is like linear so you can consider each rough log replication can have a progress number so in this diagram the progress number is free so it asks for if there's newer log or roughed entries mean to be appended so uh in this diagram the the there's a newer entry is like four and after the type flash nodes received the answer that we have a newer entry it were blocked until the newer entry data comes in and then it will provide return back the read results so uh together with cc so after after you read back the data it will use the timestamp for the read query uh everything that newer than the timestamp will be uh will be uh will be discarded so these two together then we can have the strong consistent read so it guarantees that everything being right being written can be read back and also we can have a strong uh uh can have a consistent snapshot so that's how we uh implement a asynchronous replication but a strong consistent read so previously on on tidy beach top the feature is like we are not building a separate database like the people you need to choose that if you're using the tp side or ap side or like 100 percent of tpu or 80 percent of ap that's not a thing right now so we are building a edge type that in a way that when the data being written then you can you can have both uh see uh both replication uh both the row store formats and the column store format combined and it's always keep keep up to date so the the analytics on the on the tidy b is not like fresh data so it provides you always the latest data as described before but the typeflash itself it can also be used as a column store indexes when you don't use a tidy b as the your major transactional databases sometimes you use a pure analytical service so you can consider at that time the column store is not a isolated column mirror so it can be served as a column column indexes and a cbo will decide which index are the best suitable index for a specific query so for use cases um a very easy a very common use cases that you can very easily think about is um some users use the tidy b in this way um previously they have a mycico for their um online applications and they uh from time to time they need to move the data from mycico into analytical databases and then um they have a bi services on the on on top of that and after switch to the tidy b then it can have both combining to one database they can have the app server on the online app server on tidy b and also they can have the bi um bi reporting directly on tidy b as well and some of the users are using it like this way they have their online applications um in tidy b and also they have the cdc to channel out the data about Kafka and they have the flink as their aggregation or a transformation engine and to insert it back to the tidy b and the online data will be transformed so the flink is served something like a real-time materialized view and uh the more bi oriented services will be on top of the materialized data so is everyone happy now so the story being told up to now is like the story happens that's uh within this year so the timeline has been progressed into this year so the thing's not good is like um actually there are um there are quite a few things that we're not uh happy with one is like um the test part is not uh for now it's still the only distributed engine for the tidy b ecosystem we we need a night native uh native computation engine which is in the distributed framework for um these um these engines will be built for those like medium-sized or or small-sized interactive cases um because um spark is using a march style shuffle and also a lot of things are designed to be like more uh ETL style so we want a uh our our native engine to be something like mpv style instead of like a march style and also um because the because of the design that's private before the data has been directly channeled from the tidy b side into uh into type uh time flash side so the data's always uh so there's very little chances for us to insert any transformation in between so because the data is almost equally just from the transpose from the row store to the column store so we don't have any chance to insert any um ETL or any data transformation so we need to have something like data transformation so um our distributed engine is almost done it's making um we're making a clear house to uh making the type flash to combine together to work as a mpv framework so um so still using the tidy b as a single entrance for the front end so we're using we're sharing the same password and also share sharing the um almost the same optimizer um but with an extra uh distributed planning stage so the thing is something like that um user when usually use the uh send a queries then um it might switch the modes if it is a mpv mode then we're sent back to the um to the mpv cluster and if it is a tp transactional modes queries then we're go through directly the optimizer and directly uh read the type kv so um for now the mpv cluster cannot read directly from the type kv yet but later next streaming uh next development cycle we want to make it um possible to uh read directly both through row store and column store so there are exposed more chance for optimization um the performance looks promising but it's not something like uh done yet so for all the performance charts here in the type flash mpv uh some of the queries we need to hand tune if it is using the broadcast drawings or using the hash shuffle drawings uh because the optimizer is not very quite done yet um but if you can consider if everything uh if the choice of uh the join operator is like okay then that's will be the that will be the uh the thing that we can we can get um so the compare the comparison is between the two engines so uh for the type spark engines we're still using the um using the type flash as the storage system for the type flash and pp system we're using the same storage system but a different computation engine so that's the performance difference it's uh it's still in a premature stage so you can consider that in the in the future it might be faster next year spring something uh it's not a full uh complete uh tpc of some something it's like we just put some of the results here for real-time transformation it's uh so previous slides we also show the use cases for real-time transformation so real-time transformation is very important for us because for most of our use cases are in real-time um if it is like something like not real-time then people might use something else they can they have a lot of other uh traditional deadware house or dead lake system to be choose they might choose how do per day i choose something like like the house or like something else or um so when people use tidyb as their uh analytical uh choices they usually want something as real-time they want fresh data so but for the design right now we have we don't have any chance to insert any um transformation so uh Apache Flink becomes something um very handy for us uh when we do the usually interview we find there uh even before we advertise for marketing around the the two solutions combined we already have a lot of users who's using tidyb together with uh Apache Flink so it's kind of like perfect match somehow um for the flink side um tidyb is a perfect sync for the flink because it provides both a roll store and a con store for um different use cases and even for the con store it can provide online updates so that's how a lot of system cannot do so before the tidyb um the flink have two choices one is like they use a musical store they use a musical store but a musical store cannot provide uh enough that the sequel or the computation ability uh or they can use something um if the data source is not like updating it's a penalty then they can they can use something like how do data lake and to directly put the data onto the hub but uh in that case they were they were not dealing the change the change data well um also for tidyb side it's a both a streaming framework to transform the data but they also can um do the batch workloads like the like the spark can do so it seems like a perfect match somehow so is what everyone can be then um i'm not sure every time we are um talking to customers then they bring something new ideas maybe next uh in next stage we'll do some something like a um incremental materialized view or uh we we're already building something like allow the type flash to be a independent first class storage system then tidyb can directly write into um write into the column storage instead of like or write to the roll storage and then think to the column storage um yeah so they're allowed to do and we're think hard and work hard and we'll see if that fit the customer needs thank you awesome thank you so i will applaud on behalf of everyone else uh so again we have a few minutes uh in the forward any questions please unmute yourself and Sean directly okay so uh i know you didn't talk much about the optimizer um um and and i was dealing with the baby pooped his pants while he was talking i i missed i missed i've been apart sorry uh how are you guys collecting statistics are you running analyze are you are you computing statistics as the data comes in uh what's your general approach to this oh we're running analyze so there's like faster analyze and the full analyze different modes and we're using uh um previously we're using two different kinds of things when it's like histogram the other is in sketch and now we find like we might focus on only the histogram so the faster analyze where collect only a few of the data and the full analyze where where do the do the heavy work and we um we don't i i think we don't still have the um like the the um that takes it to be fixed when the data comes in typically do you say you guys had sketches and you had like a count in sketch and then you had a histogram and you guys abandoned the sketches yeah the same sketch is based on is for both uh mostly for the point for us and for optimize for the other things that we rely on the histogram but recently we find that maybe histogram is a better idea we might like abandon the same scratch part so it's super interesting because because so the the the co-founder of splice machine he told me the exact opposite he said like we were using histograms you were using sketches and then they switch over to the yahoo sketching library and then like they abandoned all our histograms i i find that super interesting uh okay um have you guys considered i mean some work out of berkeley and a bunch of other places on using like neural networks for instead of histograms have you looked into that or is that too far far in the future for you guys oh actually i'm not sure i think it's uh the optimizer is based on uh optimizer is totally on the other team so that would be a question i'm not quite familiar with that's fine it's just like the optimizer part is super interesting to me i always ask questions when i see people talk about it um the i'm very sorry no that's okay that's okay so with um again i apologize you might have said this or missed this the integration with flink right again that's just sitting on the outside right tie flash itself does not rely on is that correct can you repeat your question sorry like with flink like it's not like the the story with spark tie spark was that you guys added a connector so that spark can even tie uh kb with flink it's it did you add the same kind of connector or is it just people are people just running flink on the outside people for um running from also also for a tax park is from outside we didn't uh tied it combining together with ours with our own products so basically the flink thing is like um during the tax park we already have a very fat client which can encode decode and translate metadata and also just plan transformation so when the community so actually flink integration is driven by community so community just take the fat clients of the of the of the tax park and to directly insert into the uh into the flink your uh yeah it is kind of like because of the previous work it's easier to do in the flink okay i understand okay all right uh any last question from the audience okay shan again thank you for doing this really busy you're getting up early to get to talk with us really appreciate it um all right guys this is it for the for the quarantine daily seminar series the end of the semester uh end of the year that 30 talks uh it's very exciting everything will be on youtube uh soon soon after i think we want to thank steven the steven moe foundation for sponsoring us this entire semester uh and we're looking forward to everyone coming back next year in 2021 hopefully we'll have vaccines and then we'll be able to still you know keep talking about databases okay