 This is the first time I'm presenting and there's only two people. But yeah, let's... And so my name is Weishan. I work in Alibaba Cloud. I'm a database architect there. So the cool thing about Alibaba database group is that our unit is actually cut across the entire Alibaba group. So we don't just support Alibaba Cloud. We support Taobao, TML, all the different business units within Alibaba. So we have the SREs in our team. We have the developers, the product managers. Everybody's in the same group cut across. So this is actually quite unique within Alibaba. So five things to talk about today. Introduction, database overview. The first two will be a little bit of marketing. And then number three and number four will be technical deep dive. I don't have the demo today because I just burst my benchmark limits yesterday night. So I still have some numbers to show though. Should I stand up? I shouldn't stand up, right? Yeah, so database architect used to be a solution engineer in Oracle. Then before that I was SRE in GlobalSign. So Alibaba Cloud has actually a wide variety of databases. There is the OLTP databases, the OLAP databases, NoSQL databases. One cool thing to say is that during double 11 last year, actually the entire workload was supported on Alibaba Cloud. We were able to process about 500 over 1000 orders per second. That was quite impressive. We have different types of databases, the relational databases, the NoSQL, analytic and certain tools. So we have like for example, PolarDB that is for OLTP workload is a cloud native database. We have Cassandra, we have Mongo, so we have a partnership with MongoDB. So currently in MongoDB 4.2, we are the only one in the world that has it. The other cloud vendors doesn't have it. And then we have Clickhouse. Clickhouse is pretty cool. It's written by the Russians. It's a super fast OLAP database, highly compatible with MySQL. These are all the entire range of services that we have. We have PolarDB, we have the standard RDS, your OLAP, your analytic DB. You can see how all this fits into it. One cool thing I want to mention is DES, Database Autonomous Service. So this is like a service that can monitor your databases. For example, if it ran out of resources, it can do auto scaling for you. If it can do smart index recommendations, it can also do things like... Basically the idea is to tell you that the database is going to fail before you even know that it's failed. Of course, we make a lot of contribution to the MySQL community as well. There's a lot of enhancement. So we have our own version of Ali SQL. So that is a combination of all the different upstream contribution and the contribution from Pekona and others. We made it into it and we contribute back to the upstream. We have a service called the Distributed Relational Database Service. So this is like a database proxy that allows you to do sharding natively. So you just configure... The proxy will know where all the different shards and the data are stored. So it just drops to it automatically. So we will open source this year as well. White papers. The last few years, we have started to release a lot of white papers talking about the design philosophy that we did, why we did it, how we did it. One of the things that is really cool is X-Engine. So we can think of X-Engine as a replacement for INODB in MySQL. The reason why we designed X-Engine is because for e-commerce workloads we have a very unique workload because sometimes hot data cache, hot data, cold data might become hot data really quickly due to promotions and all that. And we realized that INODB couldn't handle the kind of workload. They're switching from cold to hot simultaneously. So we had to come out of our own X-Engine, storage engine to do it. The other one that I'll be talking about is analytic DB today. So analytic DB is our own cloud native design OLAP database. So today I'll be talking about it. We released a white paper on it as well. That's all for the marketing. Now it's going to go to the deep dive technical stuff. So, analytic DB for MySQL has quite a long history. It started off back in 2011. It was designed for, it was called Garuda in 2011. And after that we realized that it's not good enough. Then we released a new version 2014 called ADS. And it's on Ali Baba Cloud. After that we released 2016. And today is analytic DB 3.0. It's 100% compatible with SQL 2003 standards. We use it during 2011 to support the real-time analytics for internal business units. There are some of the features of analytic DB. One of the cooler thing is you realize in traditional relational databases most of the data are stored in rows format. And it's very slow when you do things like sum or aggregation. That is why columnar database started things like SAP HANA. They start using proposing in-memory columnar store. But the problem is those kind of databases are weak in OLTP workload. So we actually have a design that is row column hybrid. It allows you to do quick point lookups and also allows you to do columnar, sum or aggregation really quickly with very little overhead. Column joins, real-time queries. So we use this direct acyclic graph as well. So think of it as like a map reduce kind of concept. It's a stage by stage pipelining. So in map reduce you have like a lot of maps. Then after that you reduce it. So for DAG, execution engine is the same. We split the workload into different nodes. Then after that we run it stage by stage. And then finally we consolidate the results back in the master node. So I'll start to go into the nitty gritty details. This is an architecture diagram of analytic DB. The thing that you want to see is in the storage layer, there will be a single leader and two followers. GTMs is transaction management. DAG is the direct acyclic graph. WLM is the workload management. So you have a leader and a followers. On the storage portion you have the workers with leaders and followers as well. And all these are all synced through the rough consensus. On the resource portion I want to touch on is that in the next few months we will be able to do kind of like a hybrid partitioning table. So you can have all the hot data on NVMe SSD and your code data on OSS. So this reduces the storage cost by quite a bit considering when you have more than 20 terabytes, 100 terabytes of data. The other thing I want to mention is optimizer. So we support rule-based and cost-based optimizer. You might be wondering, rule-based is like the design that was 20 years ago but why do we use it? I'll explain later. Storage engine architecture. So this storage engine we wrote it ourselves. A few things I want to mention here, predicate query layer. This allows you to do predicate filtering at the storage layer site. So instead of sending the rare cost to the storage and getting all the data back, it's actually able to do the predicate push-down at the storage layer. So if you do a select where name equals to Weishan, the data that is returned from the storage back to the compute node is only where it goes to Weishan. And then you do the joins and all that. So as you can imagine, this allows you to reduce significantly the amount of data sent between the storage and the compute itself. Yeah, this is really, really cool. Transaction manager. So we have two-phase commit MVCC for snapshots isolation. We have consensus algorithm manager, rough manager that apply the consistency across different nodes if you need distributed transactions. Caches, we cache the blocks index partition, primary key indexes. And this thing here, the gray color thing, it's actually cool in the sense where the storage portion is pluggable. So you can point it to object storage, you can point it to HDFS, you can point it to whatever file system you like if you want to. So they're all modular. The next one I want to talk about is some of the optimization on engineering stuff that we did. In your traditional OLAP databases, Teradata and all those, they actually have the read request and the write request all served within the node. So you start up a Teradata service and all the request is being sent to it. So this means that when you're doing, for example, about loading or real-time streaming at night or whatever, it will affect your read, your reporting a lot. Basically, you will have to do the about loading at night and then do the reporting in the morning. This is really old-school traditional stuff. So what we do here is we actually split out the read nodes and the write nodes. So the read path and the write path are all done separately. So they never have to compete for resources. And the data are all written into locks, commit locks. And imagine, you can think of it like Cassandra SST tables. They are all returned to the MAM table and to the SST table. And within the SST table, they do a compaction. So this is the concept that's similar. But what we do for the compaction is we use parallel MAP reduced jobs to do it. That was really, really quickly. In the diagram on the bottom right, you can see that when a client do a select query, hits one, sorry, it do a write query and it writes it to the write node. And if it goes to the read node to do this select, so he wants to read what is just right. So there's a way to do it. You can choose whether you want to pull, the read node will pull the data from the write node if you need, like, guaranteed session consistency. Or if you don't really care, you can do a bounded stillness read and then just read it from the distributed file system below. So it's really up to you. So some jobs, some queries, they require real-time consistency. So they have to read it from the write node. But some, they don't care. They just do it from the file system itself. And you can add nodes, read nodes, like the partition will just auto-balance by itself. One of the things that a lot of data warehouse designer, they get a lot of headaches is where to put the indexes. Because indexes are not cheap for writes. So whenever you put, like, indexes, they affect all the writes. So they'll have to think, like, oh, I have to put it on a primary key column. I have to put it on this column, that column. And it's problematic, right? Every time a new OLAP query turns up and then it will just blow out database because there's no index and it do a full table scan. One thing we did is indexes on every single column. So we don't care. We index every column. So whatever query you hit at, analytic DB or ADB, it doesn't matter. So you can just throw anything at it. So you don't have to, like, think of the indexes, index it, and then you run the query. Just throw everything at it. It's fine. And we have, so for example, on the left, you have a select query that uses the name, the sex, the city, and some JSON data type. So we have inverted index, bitmap index, and all this filtering can be done at an index layer itself. So if you are doing slightly more selective query, you don't even need to hit your data files at all. Everything can be done at an index level. The filtering is all done at an index level. This is really, really cool. I don't think a lot of other databases do this. Just now I mentioned one of the key engineering features was the hybrid row and column storage. So we have the detail file and the metadata detail file. So this detailed metadata file is stored in memory. There's always cache in memory or RAM. And it's all things like statistics, dictionary, the mean size, max size, distinct count. And all these will help to do column pruning during the query time. So based on all this statistic information, they know what you call them to prune. And so in all design itself, the data in each table partition is maintained in a single file called a detail file. And they're all separated into multiple row groups with fixed size. Within a row group itself, the values from the same columns are stored together in data blocks next to each other. So even if the columns, when you need to query all the columns, they are stored in different data blocks, but they're all besides each other. So this spin itself can pick up all the data block by itself. So you don't need random access. It's not a random access IO pattern anymore. It's not random anymore. So this is like quite fast sequential IO pattern. And if you have complex data type, so on the left, you have all the data blocks for the column, all the header, they're all fixed size. And if you have things like JSON or vectors, sometimes they are unpredictable in size. They could be one Mac, 10 Mac, 16 Mac. So if you try to accommodate them in the same file itself, you're going to have problems, because sometimes when you read a block, it's 4K, 8K, and sometimes it's like 16 Mac. So it doesn't make sense, right? So we have designed very, very similar to Postgres, Toastable. So if you have JSON or complex data type, what the main file actually stores is just a pointer to an additional file. So instead of storing it all in the same data file, data blocks, it's just a pointer to an external file. So this is really quick, right? Some people might think, if you index every column, when you do the writes, it's going to be very, very slow. So how do we make sure that the indexes doesn't conflict with the write performance? One of the design breakthrough that we did was we have a concept of the baseline data and the incremental data. So we have a main index file, main data file, and all writes to it. Maybe I'll just go through this. So there's incremental data and baseline data. When we write, we always write to the incremental data and it emerges back to the main data file. How we do it is, first, on the incremental data site, we make it immutable. At the same time, in the atomic session, we create a second incremental data file that accepts all incoming requests. Before the merging of the baseline data and the incremental still data is completed, all true will serve incoming requests. And once the merge is completed, the old data is removed. And then it's being served by the baseline data and the incremental data. Advanced optimizer. For example, in Oracle, they used to support RBO, a rule-based optimizer, and then they start using Cosbase because they think that Cosbase is much more powerful, much more advanced, which is true, and rule-based has no place in the world anymore. But we think otherwise. In an OLAP kind of database, sometimes you don't really need to do planning. For example, if you want to do point-lookup, you know that you're going to use the index. So perhaps it's better to actually just put it in the rule itself and say that for point-lookup, these kind of queries don't even do planning, just do it this way. So in our scenario, because we have so much users querying our ADV all the time, it doesn't make sense to do that, to waste that kind of CPU cycles for that kind of useless queries like point-lookup. So we just tell the rule and say, every time you receive this kind of request or if indexes on it just do a point-lookup query, just use this index straight away. Adaptive, join, auto-optimization. So at runtime, we will collect all the rows, and then if based on the predicate, based on the rows returned, at runtime we will decide whether we want to use this table to join this table or A to join B or B to join A. So this is quite standard optimization. Parallel queries. My SQL 8 has it, I think. Yeah, so we have it as well. Partitioning pruning. So if you query a where clause and a where clause is in a, for example, you query by date, and this date is in two partitions, so we just read these two partitions and the rest is being ignored. So this is very standard optimization as well. One really cool thing is for this database is our optimizer, our execution engine is all storage aware, because it's all cloud native, right? So we all know, we actually put in some smartness into the storage layer itself. So like I said earlier on, there will be predicate push down, join push down, index and joins, based on at the storage site. So the optimizer doesn't even need to do all this. When it sends the query to the storage site, the storage will know and do the filtering and the push down itself. So this is, I think there's only one, there's only one database in the world doing this right now. I think it's XRData, Oracle XRData. They have a, it's like a single rack machine that the storage is like really expensive and it's got smartness built in, like smart engine built in. That is very expensive. Then I'll talk a little bit about the execution engine. We have, we use code gen, very similar to LLVM, so it's compiled at runtime, SIMD and CPU cache friendly. One thing I want to touch on is the DAG engine. The DAG engine was used first in MapReduce and after that by Spark, SparkRDDs. So basically it's a stage by stage pipelining. But the problem with MapReduce is whenever you do the multiple maps, it needs to hit the disk. You need to store the temporary data on the disk and then after that, at the reduced site, from the disk and then do the processing again. So you can have multiple stages. In Spark, oh sorry, in MapReduce and if one is filled, then you need to do a retry, or you need to build in a retry logic. So in SparkRDD it's the same, but Spark is more powerful because you can have unlimited amount of stages and then you can do like whatever you want and all the way to the end and then you do the calculation itself and all this can be done in memory. That's why Spark is really quick. So MapReduce, Spark are all different implementations of DAG's engine and we use this DAG engine as well. The reason why we use it is because in essence, and I think DB for my SQL is actually MPP database. There are multiple nodes. The data is all fragments all over the place. So you need some way to, so you generate a, at an optimizer site, you generate an execution plan and this one is, this plan needs to be carved out and distributed to all the different worker nodes and so you would need an engine to do all the stage by stage calculation and then the results return back to the coordinator itself. A benchmarks. So this is one of the benchmarks in our white paper. All this can be found on GitHub, how we do the test and all this. So this can be reproduced on your own if you have doubts on it. So when you, on Alibaba Cloud Console, when you buy a ADB instance, you don't get an instance. You get like clusters of instances. There's, we call it node groups. You can have like two node groups, four node groups, six node groups. When you buy two node groups, you get two node groups. One would be the read node groups and one would be the write node groups. So we ran, for the white paper, so we ran three kinds of query, a full scan, multi table join and point lookup. So these are very common queries that Data Warehouse uses. It's very hard to see from there. But, okay, basically this is trying to say that ADB performance is awesome. So we're going to look at the white paper and check it out. This was a recent benchmark that we did. The, it's quite hard to tell. So there's a few things to note. Some of the query runtime we actually cut off at 250 seconds because in one of the query, I think it's query two, no, query one, MySQL actually took more than 10,000 seconds to run the query. So we had to cut off at 250 seconds. We tested this, we tested this against MySQL, Presto, Spark and Impala. We had to remove query five, eight, nine, 18 because it hit OM and Presto itself. You can see that most of the comparison itself, we're actually a lot faster. I'm talking about like magnitude faster. Of course, this is not 100% for Presto's and for comparison against MySQL is not very fair. The reason is because when you buy an ADB instance, you get like two clusters. Whereas in the MySQL itself, we actually create a big box that we added up all the instance CPU and put it in a single box and then do the comparison. So it's not like 100, it's not a full Apple to Apple, but for Spark, Presto and Impala, it's Apple to Apple. So we, whenever we do a benchmark, we always list it down in GitHub and the script so we can reproduce it if you like. Yeah, so I was trying to do bench demo yesterday, but I finished up my Iowa Cloud internal test account. So I basically screwed up my test account by using up all the credits. So I don't have a demo today, but if you like, let me know, give me an email address, I can get you a test account, you can test it out. Yeah, that's it. I have, that's all I have. Yeah.