 Hello everyone and thank you for staying from the last presentation today and So my name is Aston Gavill and I work for the alababa cloud And I'm here to talk about the polar DB Which is a database architecture for the cloud So before working for alababa I used to work for Oracle in the MySQL optimizer team for 10 years and I've been working on databases for 25 years or so so At alababa we have a lot of databases. I'm not sure how much you know about alababa, but you can think of it as the Chinese version of Amazon PayPal and eBay and YouTube and what name it you combine So there's a lot of so it's the biggest e-commerce site in in China we also have a cloud offering that is the biggest in China so The So the e-commerce of alababa they have like Sales sales for one trillion trillion US dollars a year. I believe that is something like It's 70 Luck or you would call it or something like that and There's a lot of data petabate of Data and on average there's like 100 million transactions a day But that is not even over the year or every day someday days of the year. There's much more There's a special day in China. They call the singles day because it's all 11 11 all once Where the where the total alababa sales is Many times higher than what you get on Amazon on cyber Monday or Prime day now the numbers for a for cyber Monday is for the entire US sales so you can see that then just the sales for alababa is much higher and When November 11 arrives we need to be able to handle a load that is 100 times higher You see that when you go here You see the load as approaching midnight and then it goes up then all the people have already have put all their Goods in their sales baskets and they're just ready to push on order on this day. So We need to have a set of where we can keep the latency for the all these customers Stable they should not experience any slower a Process than they used to and one thing good thing about the cloud is that you don't have to buy all the Hardware that you need if you are some some in Chinese company for example that need preparing for November 11 You don't need to buy All the hardware that you to be able to handle this loud you can if you use the The cloud instead then you move your capital expenses to variable Operational expenses so you can actually just pay for extra computers compute when you need it and The forecast says that by 2020 like eight eighty three percent of the enterprise workloads will be in the cloud And it's not only about loud We see that the amount of data that people store store becomes higher and higher And One of the main reason for that is that it's no longer just humans that enters data into the database You have the internet of things. There's a lot of things sensors and so out there that Stores a lot of data into the databases that so the storage needed is increasing in spot exponentially so We want to have a database that is specially tutored or tailored for for the for the cloud It needs to be scalable needs to scale automatically when you get more load You notice you need to scale both on load and on storage It needs to be highly available You need to have multiple copies of the data. So if there's one copies are not available. You still are able able to serve your Customers you need to be able to fail over to another Machine if the current machine is failing and the user should experience. So downtown downtime And we want this database to integrate with all the other cloud services. You have cloud services for security for monitoring for Artificial intelligence and so on. So here's where PolarDB comes in which is Is Alibaba's Alibaba clouds answer to these or solutions for these requirements So PolarDB will use a special hardware in order to provide The good performance for the database It's using auto scaling. You have the usual you pay with how much you use and so on as in cloud You you get all the security from the cloud and there's a lot of Automation in there that's people are used to from the cloud So first, I will talk about the storage part of this architecture And so and PolarStore is our It's a specialized distributed storage Layer So used by the PolarDB So PolarStore has a lot of chunk servers. Each chunk contains some gigabytes of data and each chunk will be in several copies across the storage and the database will connect to this storage by embedding a special library that Communicate with the database storage So this PolarSwitch component and now it knows which chunk should have which Know which chunk server should have each chunk so so the PolarDB does not Just see the storage as if it was a local storage All the magic happens in the layers the PolarSwitch and Special hardware is here used to get the good performance. We used the RDMA network, which is a protocol where you actually Can write into memory on another computer And this is used both from the database down into the chunk servers and to communicate between chunk servers the chunk server has both obtained disks and NVME SSD the obtained disks are used obtained memory. I mean it's used to which is a non-volatile memory it used to get a Short latency a low latency. So When you write The chunk server will have a wider head log, which it it writes to the obtain and then it can acknowledge back to the Client that the storage has been Stored and then the actual chunks will go into the SSDs But that is not inside the critical patch. You don't have to wait for that to happen before Acknowledging the write. Let's see. I think I lost my speaker note there, but I think I've said what I Yeah, the slides I have uploaded on Slide share so if you go to the rootconf Schedule and click on my presentation you will see the links to the slides Yeah, and as I said mentioned earlier, you have the the lib polar FS library that does the actually That interfaces to polar store and this runs in user space. So you don't have the overhead of context switching for for Going to storage and also there's no no system calls no Data copying you know when you do the IO so this polar store and Architecture was presented in our system article up the real to be conference last year then we go on to the next Point there in the architecture and that is the stop separation of storage and compute You want to separate storage from computation so that you can scale these independently so that you don't have to If you need more compute as we do for example for the city for the November 11 You don't have to add a lot of extra disk At the same time because you don't necessarily need more storage and This is a principle that are used by many of the cloud specific databases if you look at Amazon's Aurora for example They also separate the compute and the storage and in a sure that you have the same Good height for scale. I think it's the name of that database So the architecture is that they have the polar store and you have the polar DB On top of as I already mentioned I use is the polar fight system and on top of that we have a intelligent proxy that will wrote the request to different Replicas of the polar DB. I'll show more details on on this and this Architecture you can use with a different databases MySQL, PostgreSQL and so on. I will talk about MySQL here since that's what I know About and What we have done most work on so so Polar DB is actually a modified MySQL and mostly a modified storage engine in a DB storage engine in MySQL So what is the difference between Polar DB and a traditional? MySQL the big difference is that this When you have when you are replicating in MySQL you are sending the binary log to different Notes and they will all store their data in local storage While in Polar DB a shared storage is used There's one single master that does all the writes and then the replicas can read from From the shared storage There's several advantage of this one is the past scaling because If you need to add another replica you can just add it in a short time because you don't have to copy any data All the data is here in the state shared storage if you need to provision a new replica in my traditional MySQL architecture You would do have to copy all the data over to the local storage of the of the new replica And also since we are adding replicas, we don't have these machines we are using for the replicas don't have to have The amount of storage that you need in MySQL So as you are adding more and more replicas you see that the cost saving will increase because we can use cheaper Machines or the user can buy Machines or rent machines in the cloud with the lower capacity and still Be able to serve the replicas another difference is That in Polar DB we use physical replication if you know about MySQL replication I've heard you have heard about the bin log The jacket is actually a logical representation of the changes you are made row by row if you use row by row based replication, which you should so but that means that a Master in MySQL will do a lot or more rights because it needs to write the data to local storage It needs to write the bin log to local storage and needs to write the redo log of inner DB the physical log that you need to have to do recovery of your data and Then it sends the bin log over to the slave and the slave writes the data The bin log it received and the bid log it generates and redo log to storage So there's a lot of rights here that we can optimize by using a physical log instead because Using the redo log instead because then the master just write the data and the redo log to shared storage And the slave can just read What it needs from shared storage and one another advantage of using physical replication is Around if you do data dictionary changes and one No problem with the MySQL replication is that if you for example do an alter table that takes very long This example shows add column Then Your load it will take one hour for example to do this operation and then you send it to the to the slave and then the slave will need one hour to do the operation because the Operation is not sent to the slaves before it's completed on the master and while the slave does the operation No other load can happen on the slave because it needs to the new version of the table first But if you have shared storage, then actually The change on the slave is just a metadata update You just need to tell the slave to use the new version of of the Data files in the shared storage this example that add column is kind of a moot in 8.0 because in 8.0 MySQL added Instant add column, but there are other other The data schema change your operation that still takes a long time and that will block the the slags from Continuing so how does this actually work? I will look at the changes. We do have done to inodibi in this case so on the Let's see, you know, so this shows just one out and In inodibi, there was will be a lot of small transactions that represent each operations that are done to The the physical layout of the database that is not the same as the transaction that you have in a in At the user level it was like it's more like one time section can be to change some data in some page Sometimes some transaction will have multiple log records because for example if you do a B3 split Then you have one log record to make a new page one log record to copy data from the old page to the new page Log records to change the pointers between pages and so on So all these redo log records they are written to the redo log so and That has to happen before You commit the transaction that is what is called writer had logging and Traditionally what you do is in order to get better performances you do so-called group commit you group transaction together so that you can write multiple transaction to disk in one In one white Otherwise you if you have to write one transaction and then wait for the IO to complete and under the white another transaction It takes too long time and then the data is it's flushed asynchronously to this so if if if the node crashes then All the data pages are not necessarily up to date So when you come back up again, you have to do redo and the redo is to Take the log and then redo it on the data so it becomes up to date Because the real you have all the day log But you don't necessarily have up to date data in in the physical image on your disk when the crash happens So the idea of our physical application is that the slave or replica basically is continuously doing redo the Master works as before it writes to redo log records to Before commit writer had logging a flushes the buffer pull a console asynchronously Then on the slave the slave Will read the log records apply them To the pages Continuously what actually happens there is that there is a hash table. So you will hash on page ID for each log record And put it in because then you can do each bucket of your hash tables can be performed in parallel so you can actually do a Have efficient redo by doing all these log records in parallel because you know if they are Log records for different pages. They do not conflict. That is also the advantage of physical replication because if you have the normal bin log Replication in my scale what you know is that all the transactions that are that are in the same group commit They can you can do in parallel But you cannot then for the next group unit you have to wait until those are completed before you Do in parallel so the parallel safe session is is less with the group commit than but actually parallelizing every operation that goes to different pages so and one basic thing here is that we say that this The status of the slave is or the snapshot of the slave is the latest log record past so here the slaps state is the snapshot of T4 because T4 has been passed and Put into the hash table T5 has not yet been passed by the By the slave there's some things we need to take care of when we do it like this We have to change the inner DB to take to make sure that The master and the slaves are synchronized for example One thing I forgot to mention is that these log records They are only applied for pages that are in memory For pages that you do is referred in the redo log record But which is a slave does not have in memory. It just ignore the redo log record for now Because hopefully when it if it may later needs the page These updates have already been applied to the page on the master and flush to this so it actually sees The updates and if not, it will still have the log records and can redo it at that point In order to get it up to date to the snap T4 snapshot in this case So if for example, it read a new page from this care and it has log records Early from earlier than T4 that is not applied and it has to redo them at that point but that means that there will be a lot of log records and that are In queue here on the slave and sooner or later you will need to delete them and what you do then is That the Slave the master tells the replicas the checkpoint LSN Then a master then a database does checkpointing it writes all the dirty pages before a certain point to disk so it knows that Every every record before the log record before the checkpoint LSN. It's no longer necessary to keep because It's the pages has been written to disk so you can delete all log records that it That are earlier than the checkpoint and also There needs to be some synchronization between this the master and the slave because and The database has an underlog or rollback segments, which means that it It needs to if a transaction are in flight and then you need to roll it back It needs to find the operation so it can actually undo them and And the master must not delete and the master will delete the underlog record when it Thinks that it's not needed anymore So the replicas have to make sure that it does not delete too old Now two new two new underlog before they are finished with it because this underlog In in MySQL and Polskast and many databases uses so-called multi-version concurrency control, which means that This underlog is not only done only used for rollback It's only used to find an earlier version of a pay over page because You are a you can if you when you start a transaction You want to see this snapshot of the of the reality for the rest of your transaction So if someone later does some updates, you will actually Undo these updates for this particular transaction so that it sees a consistent view So the replicas needs to tell the master the oldest View of the current contract con con con transaction, so it doesn't a delete underlog that is newer than this So so the title of the sliver's page from the past that is that you when you read in one page And it's not up to date. You have to find the log records and and and redo them What about pages from the future? what if you there You have the snapshot of T4 is the state on on the replica here the read-only node What is the primary has written some newer version of the page to? this already and The Read-only node would read this page That is newer than a snapshot How can it get back to the old snapshot it actually need and the answer? That we need that is should not be possible. We just prevent that from happening So the all the read-only load will tell which version of the page It actually is on and to the master so the master will hold on and flushing To these pages to this gun to Their replicas are up to date So that's some of the details of the changes we are made to in or to make this for work in a shed store Physical replication So what will happen? What the set table? No more will be is that you have will have an application an Approxies and then you have a single master here and Then you have multiple replicas and you can need add as many replicas as you need And the the proxy will then do load balancing it will do read-wide split Which means that it was it can send the rise to the master and the reads to the replicas it can Do load balancing so that each replica has about the same amount of no load if if the master goes down it will be able to switch to another replica will become the master and the the the approximately make sure to vote the The update load to the new master But we also want we do read-wide split, but we also want the application to be able to see its own updates So we have to for example if we If we send the rights to the master and then we send the read a letter read for the same transaction from the same connection to to the read replica then you Must make sure that the update from the master has already been applied so that you see this see with your own updates For example here's an example that first someone updates the name of a user to Jimmy and then he commits the update and then it later Selects the name for this user ID, and then it needs to the The Application expects to see Jimmy and not some other the previous name So what we do that then is that when you do that on the master you return the log sequence number the number of the log record for this update and the real replica will make sure that You actually have this log record applied before you Respond to the To the read request and also the load balancer can be smart And if some replicas is lagging behind it can Send it to a more fresh replica So the delay in order to reap get to the right version is not that big as I said We currently only support single master But the goal is to be able to have multi multiple master because then now the update load is limited with what with what one node can process and The idea of what we are working on then is to have Multiple Availability service in multiple Availability zones and there will be a master for each availability zone and there will be redo log sent between Availability zones to update the different Database servers We also were Addressing what has during your last and I'm not sure how many of you have heard about htap Before it's kind of been popularized the last few years It stands for hybrid transaction and analytical processing The ability to do both OLTP and all that on your same system both transaction processing and analytics on your same the same system and The idea has to have special Read only loads where you do your all up Low queries so that they do not interfere with the online transaction processing But still can access the same data But assumption here with this year's storage is that the lay of the bandwidth of this source is so big that it can actually serve both purposes And This is a big advantage because you know in many systems you actually have one OLTP system Then you have to copy all your data over to some Analytics system for example, you copy it over into Redshift or through some other systems that can do your complex queries But if you can do complex queries on the same Shared storage, then you don't have to do this copying you will also get more up-to-date Data to work on because you don't have the lag of refreshing your analytics system But in order to support the OLAP with my SQL since the shell storage would have data in the inner DB Format so we actually we need to use something that understand the inner DB format here. That is my SQL okay, and So What we want to do is to be make it my SQL better to do analytics one of the problems today is that my SQL You can only have one that use a thread or one thread per user connection. So so When you get to open a connection to my SQL you get a thread and then that does all the work for this user except some IO and that the main worker of the system So what we are working on is a parallel query for but project for actually You can have multiple threads Executing a single query So then we what we do is we have one leader which is the original thread and then we use multiple worker threads that will read parts of our table or index in parallel and We are able to push down Joins aggregation filterings to these worker threads so they can do these operation on parts of the database in parallel And we see the result so far is a Pretty scaly scale Ling for some of queries like TP to feed query six, which is a single table query where you aggregate information And we see that there's some almost linear scaling here if you use 32 Workers on our 32 core machine And for pretty big data volumes like the biggest one here is TPC forum is is like 40 gigabyte Database of There's a dbt-3 is a open source version of the TPCH benchmark, which is a standard benchmark for analytic queries So the goal is to be able to do this on the same system as the old And we can also scale linear for some join some of the we have some issues with inno db and scaling on Secondary indexes in parallel. This is a known problem that has been reported by other people to that if you scan inno db secondary indexes in parallel you there's Not the linear scale so for example if you have a Secondary index or join that using secondary indexes, but the scaling is more in the order of 15 instead of 30 That you see for this query So to sum up What advantages give is that we give this architecture for databases in the cloud is first point Is that we can independently scale the storage and the compute we don't we can have Nodes that are good on compute but does not have much storage for the processing part And then you have the back end that does the storage for you and that means lower cost than adding more read replicas The shared storage has high throughput To multiple nodes low latency through the obtain Hardware you get high availability through multiple copies of the same data and Scaling fast scaling We saw that the physical replication give you less I how you don't have to write the binary log and so on You get non-blocking DDL and you got get efficient parallel redo on slaves So since you can do each page in parallel and with parallel query execution We will also get lower latency for the complex queries In this architecture So that's all I have thank you any questions, so how we handle the Split brain problem when you have the shared storage shared between master and slave both So let's say with three replicas two guys assume it like they become master and They start sending the same data to the shared storage. How we how that situation is handled in polar TV I Think you need a column and now I'm not the expert on the storage layer But as far as I understand you need a column. So if you don't have two of the copies you I don't think you can But I'm not sure how within a cluster if two guys There is a gap in communication and two guys assume as a master at some point of time because of lack of communication and the quorum went in like two guys at the same time assumed as a master then how we handle the Something like data will get corrupt in that case But if you have three nodes and you need to have connection to at least one other in order to be a master then Doesn't that solve your problem? Let's say I have five nodes and then two nodes assumed as a master in a cluster There is a communication gap within within the Within the cluster and then two master at a time assumed as a two nodes in a cluster assumed as a master I don't know the details there, but the one thing I was on the slide But I didn't mention that the graph protocol is used in order to coordinate this at the storage level and There's and they in fact they implement their own parallel raft because the original raft protocol was not efficient enough for this purpose Is there any other math? Is there any method of let's say if you are shared storage got corrupt because of some reasons How we can recover other than the snapshot way With automated backups do you take backups of the storage? Yes Yes, I think that's yeah, you take the backup of the storage level So you don't really have as a database user. I don't think that's It's visible to you. I don't have any experience in that but I don't think so because There should be sufficient Capacity in the storage layer for this. I have a question here. Yeah, so what's the largest deployment that you have in production? What's the data size? Details there, but I think there's a couple of petabytes Because one them I think so far the major motivation for moving from RDS to polar DB has been to get Around the limitation of what you can store on a single Single machine so so many of the first customers to go. It was actually to get a bigger storage for their I don't think I mentioned that part that Polo DB is was made originally for my school five six We have been up the up again merging it with eight my school eight of zero So there's a better lead release coming out soon with my school eight of zero Also, but since because of the extra hardware it has only been available in China so far But as far as I understand they are moving it to all the data centers Not sure there to Alibaba have two data centers here in India, but I don't think they it's available there yet So the scaling on my scale is to add more read replicas So that's for the read road load, right? How about the right load? Okay, so that's as I said, that's the limitation is that you only can write as much as one master can handle So if you need to go beyond that you have to shard currently the goal is to be able to support multi-master, but that I think that's That's Need some work to do also Amazon Aurora they only until recently they support the only one master in their System to they have some activity on multi masters. I'm not quite sure what they stated So would that mean that polar TV scales very well for read sort of queries and less for write sort of queries So, yeah, for why did the scale have to work one? No, but often if you can split off all the read what I read load that is often sufficient So in polar DB cluster it writes to only one master. Is it yes? Yeah, there's one master and the right school to this master and that's the only one who writes to the chair storage Okay Do we I mean polar DB handles isolation levels? Yes, that's a no difference from Usual I'm not but if you if you require for example serializability you would all your load would have to go to the master I would think but the for the lower levels you will have you can use multi-version concurrency control to actually make sure that you see the the right version of The data and then I guess you can use the synchronization I show between master and slaves Make sure that you have the same Thank you