 Databases, a database seminar series at Carnegie Mellon University, is recorded in front of a live studio audience. Funding for this program is made possible by Ottotune. Google. We're super happy to hear today, we have Charlie Yang from Alibaba Notion Base, come to talk about their system. So we appreciate Charlie being with us, because he's currently in China, it's 4AM, I think he holds the record for the, staying up the latest for giving a talk with us, so I think the record with so many India was 2 or 3AM, so again Charlie, thank you for doing this and being with us. As always as Charlie is giving me a talk, if you have any questions, please unmute yourself and ask at any time, would this be a conversation for him, not just talking by himself, to himself for an hour. Thank you Andy, and hello everyone. Because I'm not a native speaker, so I think maybe when you ask a question I may need more time than usual to answer the question, so I will say solid at first. And my tech talk is about the architecture inside of Oceanbase, and Oceanbase is some kind of like Google's banner and is a distributed circle database. We have two colleagues, I am the CTO of Oceanbase, and we have another colleague, Guo Pin Wang, he is the circle leader of Oceanbase, and we will answer the question together. Okay, and my tech talk consists of four parts. Firstly I will briefly introduce Oceanbase, and then the design overview of our distributed engine, that is the lower layer of Oceanbase, and then the design overview of our HTAB engine, that is the upper lower of Oceanbase, and finally I will introduce the TVCC benchmark of our Oceanbase. Before my tech talk, I would like to introduce the mobile payment in China. When you go to some city in China, for example, in Hangzhou, and when you go to Hangzhou in China, nobody takes cash, even a bag on the street, a homelands on the street, they will give you a QR code. So you can just get the QR code to finish the online payment. And the right chart comes from double 11 day. As the exact first second of 12 o'clock p.m. on the double 11 day, the payment traffic search for about thousands of times, and it lasts for about several minutes. So I think this case is very challenging, especially to the infrastructure and to the database, according to the scalability problem, the consistency problem, and the high concurrency handling problem. So we designed it to develop an Oceanbase database starting from 2010. Now Oceanbase serves all payment requests of AliPay since 2017. And on double 11 day of 2019, the QPS of Oceanbase has reached 61 million. Besides AliPay and AliPay, Oceanbase is also used by more than 500 customers in mission clinical scenarios, such as payment, accounting, customer information, and so on. Oceanbase has set the TPCC number one and TPCH number two on the 30,000 gigabyte data set with official word record. Oceanbase, the main scenario of Oceanbase is scalable OITP and HTAP. You know Oceanbase is a line of scalable distributed database with strong consistency and high availability. So it's very useful in some scenarios, such as double 11 day, black fly day, promoting a new product, and so on. Oceanbase is also useful in real time operational analytics scenarios in one unified system with scalable OITP workload. This case contains, for example, when a user finish a transaction on the double 11 day, the merchant want to change their promoting strategy immediately. So it's very useful even with support HTAP features. And Oceanbase is compatible to MySQL database with high performance and much lower cost. In AliPay, we migrated all MySQL database and all Oracle database, of course, to Oceanbase. And the storage cost of Oceanbase is just one third of MySQL database. Oceanbase is used in all mission critical systems in AliPay, such as trade, payment, accounting, SIFT, which represents customer information, promotion, real time data serving. As I mentioned before, the peak performance is 61 million QPS, and we have many clusters in AliPay. One maximum cluster has more than 200 nodes in one cluster and more than six petabytes storage of data. And one single table has more than 320 billion of rows. In our real production, and Oceanbase uses Paxos to achieve high availability. We have an IPO equals zero at your less than 30 seconds. IPO represents recovery point objective and IPO represents recovery time objective. So it means that in case of failure, no matter server failure or IDC failure or even the city failure, Oceanbase can recover in less than 30 seconds without any data loss. What is your P99 for this database here? What is your 99 percentile latency for transactions? Okay, the latency, right? Yeah, in AliPay, each transaction, for example, a payment transaction has for about less than 20 circle statement, and each circle statement in Oceanbase for about less than 1 millisecond latency. Have I answered your questions, Andy? Yes, so that's a 1 millisecond per query. But when you go to commit, how much time does it take to complete the commit? So maybe the one payment transaction is 20 queries. Say round up, it's 20 milliseconds on average for the commit the transaction. Oh, in AliPay, we use the micro service in our application. For example, we have a payment transaction. The payment will need several application systems. And a payment transaction has a total for about 200, for about 200 circle in Oceanbase. And so each transaction, from the view of application, each transaction for about less than 100 seconds. Yeah, because each circle in Oceanbase takes less than 1 millisecond. Actually, less than 0.5 milliseconds. Okay, okay. Okay, I understand. Thank you. Thank you. Oceanbase is also used by more than 500 customers in mission critical systems. In finance industry, Oceanbase is used in debit and credit card transaction building accounting systems such as ICBC. ICBC is the largest bank in China. And China Construction Bank is the second largest bank in China. And also some other banks, some other financial companies. Oceanbase is also used in telecom industry, in mission critical boards and CIM systems such as China Mobile, China Telecom and so on. Besides finance and telecom industry, Oceanbase is also used in lots of high tech industry, high concurrency scenarios, payment accounting customer information systems such as AliPay, Alibaba C-Trip, G-Catch and lots of customers. And let's talk about the design goals of Oceanbase. You know, the monolithic database such as MySQL and Oracle, they have full circle functionality and the performance of single node is very, very high. On the other hand, the distributed storage system, they support some useful features such as high scalability, high availability, such as Bigtable, HBase, DynamoDB, but they do not support full circle functionality. They just support a simple key value API or limited circle functionality. And yes, we have some distributed circle database such as CoqlogDB, TyDB, Google Spanner. They support both high scalability and full circle functionality. But the single node performance of CoqlogDB or TyDB are very low compared to MySQL. Maybe just one third of MySQL database. In Oceanbase, we need high scalability, we need full circle functionality and we also need high performance of a single node. So you can see Oceanbase as a distributed circle database with full circle support and high performance of single node. We have published our TBCC benchmark result in 2019. We published our TBCC benchmark that is more than 60 million TBMC and ranked number one at that time. In 2020, we published a better TBCC result that is 707 million TBMC and ranked number one again. In 2021, we published our TBCH result that is more than 50 million QBHH on the 30,000 GB data set and ranked number one at that time. But unfortunately, just after two weeks, another database company, I think you may be very familiar, is called XOR. XOR published a better TBCH result on the same data set. And from then, we ranked number two until now. Let's talk about the design overview of our lower layer. In Oceanbase, each cluster consists of several zones in one or multiple regions and we have a role called OP Proxy. From the name, you can see that it is used to load requests to Oceanbase server according to the location info of each partition. And each Oceanbase server is similar to a classical RDBMS. You will compile a circle statement to produce a circle execution plane and then it kills that plane. And that is only one Oceanbase server. You will be elected to host load service. I think this design is a little different from the other distributed storage system because in a typical distributed storage system, the load service used as scheduling, global scheduling or global management, you will be hosted in an independent process. But in Oceanbase, it is integrated to Oceanbase server. Why? Because if Oceanbase may be deployed in just one machine, one machine deployment, so if we have only one process, one machine deployment may be much easier for the user. Ludo logs are replicated among the zones using Pexos. And there are two kinds of transactions. If the transaction is for one partition is executed locally, but if the transaction is for multiple partitions, no matter multiple partitions in one server or across several servers, it's executed using two-phase commit. Let me explain some basic concept at first. Each cluster has multiple zones and you can think a zone in Oceanbase as an availability zone. In most cases, it's an IDC. And each zone has multiple Oceanbase servers and we have a concept called resource pool. Oceanbase is a multi-tenner architecture and each resource pool has multiple resource units. Each resource unit can be hosted in only one Oceanbase server. You can think Oceanbase as independent MySQL instance in the AWS RDS. So a tenant in Oceanbase, it has its own database and each database has its own table and each table has its own partition and each partition can be replicated into several replicas. So Oceanbase divides each cluster into multiple resource pools owned by tenants. Resource isolation in Oceanbase is done internally by the database, not by the container or by the virtual machine at the operating system level. In Oceanbase, we use two-level partitioning. That is harsh partition, range partition, or harsh partition as the first level and range partition as the second level and vice versa. And we have a concept called partition group. It's used to do co-location and several partitions on the same partition group will be located on the same machine. Mender table in Oceanbase is used to maintain the location of each partition replica. And the underlying mechanism of Mender table is the same as the normal user table. We can also use the SQL query to query and modify the Mender table. So the Mender table in Oceanbase is also scalable like the normal user table. Loot service will do the load balance such as global management, partition movement, load replication, leader switch, and similar to the other distributed storage system. Okay. We use PaxSource to achieve high availability. You know PaxSource is a column-based consensus protocol and each replica in Oceanbase should be replicated to the majority of the replicas such as two replicas out of three or three replicas out of five in our typical deployment. And we, as I mentioned before, we use PaxSource to achieve high availability. That is one component which is very special in Oceanbase. It is data consistency check. As you know, database is a very complicated system and maybe we have software bug, especially data consistency like a bug. So in Oceanbase, we can detect this kind of bug automatically by the system itself. In transaction level, Oceanbase verifies the cumulative checksum of each texture. It means that in case we have some maybe concurrency-related bugs, Oceanbase can detect automatically because of the transaction level checksum. So what do you check something? You check something, the data that the transaction wrote to or read to? What's the cumulative checksum? Yes. We checksum the data. Yeah. Each time we do a transaction, we will, when the transaction maybe modifies some rules, we will then checksum the rule or all the data of that rule. And then the cumulative checksum is the cumulative of all rules modified by all transactions. You're comparing the checksum of the transaction, is it active-active? Or are they comparing checksum of the transaction running here and the transaction running, like the same transaction running two different locations? We're just checking that like... Yeah. Is it kind of like a miracle tree where things are happening in the incremental order correctly? Oh, no, no, no. Because in Oceanbase, we use Pakistan, so each partition has several replicas, right? So each replica will do the same transaction. So... Okay, I got it. Yeah. Okay. Yeah. In replica level... Okay, thank you. In replica level, we also verify the checksum of each replica after major compaction. You can see a major compaction as a global snapshot. All the replica will do the same thing at the same time. And it's executed once a day in our typical deployment. During this major compaction, we will do replica level checksum. It means that we will check some several replicas. We will compare the checksum of several replicas in the same replica, for the same partition. And we also compare the checksum of the base table and the index table. In this level, we will verify the checksum of each block after we read or write from that block. Let's talk about the distributed transaction. We use two-phase commit to achieve a primary distributed transaction. The left chart is an example. Alice wants to transfer $10 to Bob and when we commit the transaction, Oceanbase will initiate a two-phase commit in the background automatically. Let it do prepare as a first phase. And then after all participants have finished the prepare phase and then to commit as a second phase. There are two problems in two-phase commit. The first problem is global transaction service, GTS. In Oceanbase, we use a mechanism like a timestamp oracle. To avoid the timestamp oracle becomes the bottleneck of the system. All GTS on the same Oceanbase server are batched to reduce the number of requests. And the second problem is transaction latency. In a traditional two-phase commit, I think you may be very familiar. We need to wait for the coordinator to write a global commit log to make sure that the global transaction is persisted and then after the prepare phase, and then we can response the user. But in Oceanbase two-phase commit, we can response the user immediately after the prepare phase. It means that after all participants have finished the prepare phase, and yes, they will write the package of the log in each participant. After they have finished the prepare phase, we can response the user immediately. So in our failure recovery phase, it may be more difficult because in our failure recovery phase, we have no global commit log. So we need to collect all local status from all participants to design rather to commit the global transaction or just lower back it. The isolation level of Oceanbase is snapshot isolation. We use multi-version concurrency control, MVCC to make sure that the lead queries will not be blocked by the lead queries. That is a famous problem called right field problem. For example, if we have A equals one and B equals one, and we have two concurrent transactions, A equals B plus one and B equals A plus one, and maybe in snapshot isolation level, A equals two and B equals two. Finally, I think this result is very strange because it does not equivalent to any sequential execution of these two concurrent transactions. So in Oceanbase, we support another mechanism called low log. We use low log to avoid right skill in our typical workload. That means a user can use a SQL dialect like select from the table for update to explicitly use low log. Let's talk about the linearizability, which is very important in a distributed SQL database. I will give you an example at first to explain what is linearizability. For example, at least long SQL select statement want to select from some table, but it doesn't get a response yet. And then I run the insert statement to insert one low into that table and commits. And Bob runs another insert statement and insert another low and commits again. Maybe finally, Alice's query returns and she gets Bob's role but no mine. This behavior is very strange. It is a violation of linearizability because my transaction happens first named Bob's transaction, but Alice can only see Bob's transaction but no mine transaction. So this is because in our distributed system, the different server does not have different time. In Oceanbase and in TideDB, we use GTS to achieve linearizability. In Google Cloud Spender, it has a very famous mechanism, I think you may be very familiar, QiwaTai. It uses QiwaTai to achieve linearizability. The pros of QiwaTai is that it can be it supports global transaction, but in GTS, we cannot support global transaction because of the latency. But the side effect of GTS is that each transaction can meet the way. It must wait for about seven to 10 milliseconds in Google Spender. CogLogicDB uses another mechanism called HLC, Hyperlogic Lock. But Hyperlogic Lock still has time deviation. That means that if the time of two events, the HLC time of two events are very, very close, CogLogicDB cannot design which event happens first. So CogLogicDB cannot solve the linearizability problem. Yes. The storage engine in Oceanbase is an LSM tree. We have memory table as a table like a typical LSM tree implementation. We have two kinds of compaction, minor compaction and major compaction. Minor compaction is used to bump memory table into assets table. And the major compaction is used to merge major assets table and several minor assets table into one single assets table. And in our memory table, we maintain two index concurrently, which is different from other LSM tree implementation, B3 index and HAC index. I think in our real workload, HAC index is very important because we have so many single log get query. And assets table is divided into data blocks. We have two kinds of data blocks, macro block and the micro block. Micro block is mostly two megabytes. It's the union of light. Only modified macro block need to be rewritten a duality compaction in Oceanbase. This is different from the other LSM tree implementation. For example, in logs DB, each time we do compaction, the whole asset table should be rewritten. But in Oceanbase, only modified macro blocks need to be rewritten. The micro block is mostly 8 kilobyte to 512 kilobyte, which is very similar to data page, data block in my circle or post-school circle. It's the union of light and it's also the encoding and the compression unit. And we support low cache and block cache like typical LSM tree implementation. We use two level compression in Oceanbase. Oceanbase specific compression method we call encoding as the first level. And after the data is encoded, we will use a general compression as the second level. We support different kinds of encoding methods such as dictionary, IIE, const, difference, prefix, and so on. And we also support some closed column encoding methods. For example, column equals or column prefix, I have two charts in the bottom of my slide. Column, in our workload of AliPay, we found that several columns in our table is very similar. So we developed column equals or column prefix encoding methods and the general compression we support ZSTD as therefore snappy they live. And finally, we use ZSTD by default because the compression ratio of ZSTD and Zlib is higher than the other compression method and ZSTD is faster than they live. Okay. Hey, I have a question here. So I assume your LSM tree is using like a key value model and how do you store relational data like a table in this key value model? Like in TIDiB, you have key as the primary key and the value as all the columns. So what does Oceanbase do? In TIDiB, we have key as the... Yeah, I think it's not... Not a simple key value model in Oceanbase. The key is the primary key, but the value will contain several columns. Yeah, we will contain... Our storage format, I think, is a little similar to traditional database. Yeah. And we can support some operations such as find a column. We have a column index in our value. So it's a column... I think you may be very familiar. Column index is used in traditional database, not in LSM tree implementation. Our storage, you can think of it as a combination of LSM tree and traditional B-tree implementation. For example, in Oceanbase, we have macro block. It comes from a traditional B-tree implementation. Okay. Oceanbase supports online schema change. Online schema change is very important in our workload because DBA does a DDL operation. We must make sure that the DDL operation will not affect the online business. We support online schema change by the multi-version schema mechanism. Oceanbase allows a rollout of a new schema while the older version is still in use. It lacks some other systems such as Google F1 or the other distributed server database. And Oceanbase will support circle plan management. When DBA does a DDL operation, we have new circle physical plans. But Oceanbase will not switch the traffic to the new plan immediately. We will switch the traffic to the new plan step by step, 1% at first and then 5% and 10% and so on. And after the new plan has been proven to be better than the old plan, we will replace the old plan with the new plan and also the old baseline with the new baseline. The index maintenance in Oceanbase is very similar to a system Google F1. In index maintenance operation, we must make sure that the user can always read an index which is consistent with the base table. So when we add index, we will grant the right capability and backfill the index data at first. After the index table is consistent with the base table, we can grant the rate capability at last. The job index operation is vice versa. We will revoke read capability at first to make sure that the index table will not be read anymore. And then we can revoke write capability and push the index data. So let's talk about the HTAB engine. Because of the limited time, I will not dig into much detail of our circle engine. And our circle leader is also online. So if you have any problem related to our circle engine, you can just unmute as big as your questions. How does it build HTAB system? I think HTAB system is a combination of OITP system like Oracle or OLAP system like Impala. Oceanbase is a distributed HTAB circle database. So our circle engine is a little similar to Oracle parallel execution engine. But the difference is that we need to consider two problems. The first problem is the network, because we are a distributed system. The second problem is HTAB workload. We need to consider both OLTP and OLAP workload. We have several replicas, and each replica can do different things. There are several challenges in HTAB. How is they organized, low store or column store? How are resources isolated? How to guarantee the OLTP and OLAP query performance in the same system? How does the optimizer choose the data access model? And how to choose the execution engine and so on? In Oceanbase, how to handle the HTAB storage problem? As you know, Oceanbase uses Paxos. So we have always several replicas in our system. We can always use several replicas to do OLTP and several replicas to do OLAP. The OLAP workload is very heavy. We will use individual replicas to process OLAP query. And it's kind of like a combination of low store plus column store. Some replicas use low store, and some replicas use column store. But if the OLAP query is light, Oceanbase just process OLTP and OLAP on the same replica. It's a mixed low column storage. So it's some kind of like Paxos. Let's talk about the resource isolation. In Oceanbase, we have a concept. In each tenant, we call it resource group. The OLTP and OLAP query can be hosted in different resource group. We use resource group based logic isolation. There are several resources, CPU, disk memory, and network. CPUs are isolated through Linux C group. Discs are isolated through user-level IOS scheduling, which is developed by ourselves. Memory and network isolation are not supported now, but I will add the network and memory isolation in our future version. In each tab, the circle optimizer should be exceptionally smart. Oceanbase will use cost-based optimizer. It's a bottom-up system like optimizer. Our optimizer will choose different execution engines, storage engines, plain caching methods, and so on. For example, we will choose whether to use the low store or column store, which replicates and whether to execute in our serial execution engine or parallel execution engine, and whether to do plain cache. In OLTP workload, plain cache is very, very useful, but in OLAP workload, I think it's not useful. The parallel execution framework has a great overhead for distributed OLTP queries because of the plain serialization overhead and execution startup overlap. So in Oceanbase, the optimizer needs to generally play based on the scenario. If the query is very big, we will consider it as an OLAP query and generally a parallel execution plan. But if the query is very light, it's a small query, like single log or stability loss game, we will generate a serial execution plan and it's executed using bus. Bus is a module of Oceanbase. The performance key points of OLTP and OLAP is different. The performance key point of OLTP is execution overhead, but the performance key point of OLAP is efficient pipeline, parallel execution, vector execution, and so on. And the OLTP needs high concurrency, low latency, OLAP needs low concurrency, high latency, and so on. So in our OLTP queries, the execution logic is ship data, ship data, but in our OLAP queries, the execution logic is ship function. This is different. For example, in our serial execution, we have a module called bus, I have mentioned before. Bus will be used to ship data instead of ship function. In Oceanbase, the data is distributed on different machines, and in our OLTP workload, the amount of data accessed is very small. So the bottleneck is the overhead of our execution framework in serial execution. So in order to eliminate the overhead of parallel execution, we just ship data instead of ship function. We will not put the second execution play in our bus implementation. We have a bus, independent bus layer in capital serial state remote access capabilities, such as index scale, remote scale, we just ship data. And in capital serial, some transaction control logic, such as triggers, constants, and so on. OLTP logic is very complicated, such as trigger, constants, following key, and so on. In our parallel execution, we have supported distributed plan. A distributed plan is divided into multiple DFO state flow operators based on the date transmitter, transmit operators, exchange operators. Exchange operator is used to shuffle data between machines. Our parallel execution framework is very similar to Oracle. A DFO is a basic parallel execution unit and has its own parallel degree, is executed in one ocean-based server. And quite a coordinator is in charge of DFO scheduling. I have an example in the right chart, which is an example of a distributed plan of joining. So finally, I will talk about the TBCC benchmark in our ocean-based. We have published our industry paper in VADB this year. In TBCC benchmark, we have five types of transactions, new order payment, order status, order delivery, stock level. And the TBMC is used as the the TBMC in TBCC, which represents the number of new order transaction per minute. And in TBCC benchmark, the percentage of new order must be less than 45%. And TBCC benchmark has also specified the storage. In TBCC, each warehouse produces at most 12.86 TBMC. That means that if we want to use several or seven million TBMC, we need at least 75 million warehouse. That is for about four petabytes storage of data in just one replica. And TBCC also specified close warehouse transaction. Close warehouse transaction in distributed circuit database means distributed transaction. The distributed transaction ratio in TBCC of new order is 10% and payment is 15%. The isolation level of TBCC is NCC-sealizable. And I think you may be very familiar. NCC-sealizable allow right skill problem. Yeah. And TBCC must run at least two hours. The performance data should be less than 2%. And in our benchmark configuration, we have a total of more than 2,000 cloud virtual server. Ocean Base TBCC test is hosted on the Alibaba cloud. And we have more than 1,500 ocean base servers and several clients, RTE, and several web servers and several monitor servers. Our TBCC benchmark runs for about eight hours because it's the first distributed server database in TBCC test. So we run longer. We run eight hours, not two hours. And we have for about more than 55 million warehouses in our TBCC test. We have several challenges in our TBCC test. The first challenge is dual ability requirement because we use Pakistan so we need at least three replicas. In our TBCC deployment, we use two data replicas and one log replica because we do not have enough storage if we use three data replicas. And the second challenge is loading initial data. Loading initial data in TBCC benchmark is not travel because the data is so huge. In one replica, we have for about four petabyte storage of data and we have so many servers in our ocean base database cluster. So it's configured as one replica and no logic at the first. And those are batch insert into the ocean base database. After all those are inserted into our database, we will replicate to partition re-replication to get two data replicas and one log replica. And in TBCC test, that is a table called item table. Item table is read by every new order transaction but that is no right during the test. So item table is replicated to every server in a cascade manner because we have so many ocean-based servers. So we must do it in a cascade manner. By valuation of the performance throughput, the TBMC data should be less than 2%. This is a very challenging to a LSM tree like storage engine. So we use a fine-tuned compaction strategy. We tuned the compaction strategy for a long time, for so long time. And there are some CPU stress for the background operations are reserved in our TBCC configuration. As I remember, we reserved for about 20% CPU to do all background operations such as compaction, data replication, and so on. This is the TBCC result, the TBMC. From this chart, you can see that the TBCC performance is line-less scalable. When we need more TBMC, we can just add more ocean-based servers and the performance will speed up lineally. This is our result of TBMC JETA. From this chart, you can see the smallest and largest top JETA are 0.03 and 0.37%, respectively. And the smallest and largest bottom JETA are the minus 0.03 and minus 0.81, respectively. So the JETA of ocean-based is less than 2% and we can meet the TBCC specification. This is the result of our durability test of our TBCC benchmark. In TBCC, we need to kill all rows of a database. In ocean-based, we have several rows. Loot service, GTS, global transaction service and ocean-based server and normal ocean-based server. GTS, global transaction server in ocean-based is integrated into the Loot service. So we can just kill Loot service and kill a normal ocean-based server to verify the durability feature of ocean-based. In our test, we run for about eight hours and in our durability test, DT test will run for about less than two hours durability test. We kill Loot service at first and then kill ocean-based server at the second phase. When we kill Loot service at the first, you can see from the graph that ocean-based recover immediately because although Loot service has GTS, it does not have a normal partition. So it can recover immediately. And our GTS service uses specific Paxos election algorithm, no normal Paxos algorithm, like normal partition. And when we kill an OB server, ocean-based recovers in several tens of seconds without any data loss. And you can see from the chart that the performance will not drop to zero because each OB server just have a portion of our partitions. It does not have the whole partitions. So it's a distributed single database. So only a few partitions in our ocean-based are affected. Okay. So because of the limited time, I will not dig into very much detail of ocean-based. So if you have any questions, I am Gopin Wang. We are very glad to talk with you. I have a bottom. Thank you so much for spending time with us. All right. We have a few minutes for questions. If anybody has them, go for it. I will say there was one from 3D earlier and you want to unmute yourself and ask it? Yeah. Hi, Charlie. Thank you for this talk. So when you talk about the storage engine, you say you're using micro and macro blocks. Okay. You mean the storage engine of macro block and micro block? Yes. So this is to optimize our disk access during reads and writes, yes. You mean it's optimize the version of ARS-M3? Oh, I see. It's okay. I think I'm confusing it with something else. Okay. The difference of ocean-based ARS-M3 and a normal ARS-M3 implementation, I think is the macro block. So in a typical ARS-M3 implementation, there's no macro block. It just has micro block, right? Because each time we do compaction in a typical ARS-M3, which the whole asset table is relighted. But in ocean-based, only macro blocks need to be relighted. In our typical workload, for example, in our workload in the AliPay, where we do compaction, only for the typical workload, only maybe less than one half blocks are relighted. So we save a lot of CPU costs. And especially in some right-heavy workload, for example, a lot of right-heavy workload, the right is some kind of like value sequential. Value sequential. So lots of macro block, they are not modified. So the mechanism of ocean-based macro block is very efficient. I have a follow-up question here. So do you like keep orders across macro blocks? Because in traditional ALS-M3 systems, inside one sort of string file, you will see all the key value pairs in order. But now you have macro blocks. So do you like keep orders between those macro blocks? Or when you want to do some range scan, you will need to merge a lot of blocks? Because now the minimum like unit of sorted things is macro block, which is two megabytes, instead of an SST file. Okay. Although macro block is two megabytes, yes, the size is relatively large, right? But we also maintain the order between macro blocks. Oh, that sounds interesting. So that is no cost to maintain the order of macro block. Why? Because each time we do a compaction, we will scan all the macro block. And when we find some macro block is not relighted, we just do anything. But when we find some macro block is relighted, we will write a new macro block. And we will maintain the index of the macro block. So the index of macro block, they maintain the order using the primary key order. Oh, that sounds interesting. Thank you. All right, Gavin. Good for it. Yeah. Can you folks hear me? Or does my microphone not work? I mean, you speak a little louder. You'll be okay. Okay. Yeah. But thanks for the presentation. My question was, do you use any libraries to interact with the disk directly? Like the NVMe drive? So something like SPDK or anything like that? Or do you just use regular like P-read and P-write? Yeah. We just use direct IO. And in most cases, I think it is P-read or P-write. Yeah. We have tried SPDK or some, maybe, and no volunteer memory and so new hardware. Yeah. But we find that in our typical workload is not useful because in our typical workload of REP, the bottleneck is the storage. The bottleneck is not storage. Not the IOPS is the storage capacity. So compression is very useful. It's very, very important. But because we use a very fast SSD in our REP, so P-read or P-write is enough in our workload. We do not need, so we don't need SPDK or other new hardware. But I think maybe after Oceanbase is migrated to the cloud environment, SPDK or RDMA may maybe become more important. Okay. Thank you. That was really interesting. All right. Any other questions with the audience? So my first question is, I mean, the system is stable. You're running AliPay off of it. What's sort of the major, what's the next sort of major challenge you feel like you guys need to tackle? Is it the back end of the system or is it like SQL compatibility with Oracle and MySQL or is it improving the query optimizer? What's the major thing that like, okay, we need to spend a lot of time on this? Because it sounds like it's LSM. It's distributed architecture you have. It's like you're not going to rewrite any of this anytime soon. So what's the next big challenge for you guys? Okay. The next big challenge. I would like to introduce the challenge before. Yeah. When we migrate, when we use Oceanbase in AliPay, there are lots of challenges. I think the most challenging in AliPay several years ago is the consistency and stability problem because it's a database. In AliPay, when we use the in AliPay in our first application, we must make sure that that is almost no back in our online system because it's used in mission critical scenarios. So yeah, we does a lot of things such as code review, lots of tests such as pressure test, failure test, and so on, lots of tests. And we also test the real workload in our offline environment. Yeah. That's a lot of things. And I remember that I almost reviewed every code during that time, at that time, a lot of code. But in our future, yes, in the future, I think there are also some other challenges. I think the future challenges comes from lots of functions. And you have talked about the capability, compatibility problem, like lots of customers outside AliPay, they will have lots of compatibility related requirements. In our customer in China, the customer in China, I think is a little special. The customer in China don't want to change any code in their application, any code. So Oceanbase is compatible to MySQL, is open sourced, is compatible to MySQL. But in our Oceanbase, it's also compatible to Oracle, which is more open sourced. Yeah. Even our Oracle user, they doesn't want to change any application code. They will find our support team to migrate the data for them and migrate the application for them. So that's a lot of challenges. And in our future, we also will focus on some elements such as HTAP. We will improve the performance of our OLAP, especially real-time OLAP performance, and also some cloud features, such as how to make Oceanbase serverless, how to make Oceanbase more cost-efficient on the cloud. Yeah. Okay. I mean, I wouldn't say the, I wouldn't say an application developer telling you they don't want to have to change your application to use your databases. You need to China, right? That's pretty much everyone, right? So I guess now related is my last question. My last question is related to this compatibility point. Which of the two is actually harder to support MySQL or Oracle? I'm guessing Oracle, but I wanted to get your take on why. Like, what is it about that makes it so much harder? Oh, the answer is very straightforward. Oracle is the compatibility feature of Oracle is much more than MySQL. Yeah, as we have static, that is the feature of Oracle is for about three times of MySQL database. Yeah. But MySQL compatibility, when we support MySQL compatibility, that is also some interesting problem. For example, when we support MySQL, we found that MySQL database has lots of bugs. What about several hundreds of bugs? And Oceanbase team, we submitted the bugs to the Oceanbase community. But Oceanbase, the MySQL community, they admit that some bugs, but the other bugs, they does not admit. They say that it's used as a customer to these kinds of bugs. So it's a feature. So in Oceanbase, when we support MySQL, compatible with MySQL, we're also compatible with some bugs, actually, in MySQL. But MySQL users are very used to it. Yeah, it's very interesting. Okay. I guess my last question to be, and I'll have Chi, my student translate for me. Do you have to license like something from Oracle to be allowed to have SQL compatibility and a catalog compatibility? But they're notorious for suing people if they copy things. You mean there may be patient or license, this kind of problem, but... Yeah, like Oracle will sue you. There's been passed with companies have copied Oracle's wire protocol, and they will sue you. I was wondering if you had those issues. Okay. Actually, the protocol of Oceanbase is not the same as Oracle. We have Oceanbase is compatible to Oracle. The user can use the Oracle OCI stored procedure like Oracle. Yeah, you can use it in Oceanbase. But the implementation of Oceanbase to compatible to Oracle is not more the protocol of Oracle. We just make sure that the behavior of Oceanbase is the same as Oracle, but the protocol of Oceanbase, for example, the client server protocol of Oceanbase transfer my SQL. Yeah. Got it. Okay, okay. Got it, got it. And the answer to Gavin's question is like, how can you patent a wire protocol? You don't need to patent it too. That's the end of that. Yeah, but in my SQL compatibility, we are compatible to my SQL database. Got it, got it. Yeah, I understand. The protocol. Okay. Okay, this makes more sense. Okay, awesome. All right, guys. So Charlie, thank you so much for being with us. This is very great.