 Okay, good morning everyone and first of all I'd like to thank you all for joining this session so my name is Nirbha Chaubey and I work for MariaDB Corporation and my primary primarily I work on MariaDB Gallera cluster so In this session, I mean I have prepared this session to a level between beginners and advanced So that people at every level sort of get something out of it Just to begin with I would like to know how many of you have tried a Gallera cluster Wow, okay, and is anybody using it in production too? Wow great, so just to begin with Before that I just wanted to share that MariaDB Gallera cluster is basically MariaDB plus a library called Gallera which is out of a different company called Codorship it's a Finnish company and so these two join together to to basically bring in the clustering capability in MariaDB as well as my skill cluster or my skill server Yeah, so this morning. I just found this interesting Code from an American comedian which I think suits really well to be distributed systems it sort of defines the expectation we have Toward the distributed system as in we expect it to be running all the time even if a part of it Comes down we expect the availability to be there so Yeah, so I'll start off my presentation with a quick introduction to What is MariaDB Gallera cluster and some key features that it provides? So as I said, it's it's MariaDB server plus the clustering functionality it gets using the Gallera library So my skill or MariaDB Historically, I mean there was a replication feature, but it was asynchronous as in you can connect multiple slaves to one master and Then the master the data from the master will asynchronously get on to the slave but the asynchronous was the key pointer as in there can be lags and so what people used to do is Normally, they will route all the rights to the master and make sure that reads happen through the slaves but because of this lag we wanted a better technology using which you can basically write to any of the nodes and that's what Gallera cluster provides so it's a synchronous replication mechanism wherein the Updates are Synchronously replicated to all the nodes connected to the cluster and then you can query by connecting to any of the nodes like using the native tools which are available for My skills slash MariaDB so If time permits, I can show you a small quick Demo of how it works just a simple demo and As I said Once the cluster is set up you can read or write from any of the nodes as in you can throw a select or Insert on any of the node and it is expected to work in a synchronous fashion and Another mechanism or Plus point of using Gallera cluster is Whenever it finds that node Non-consistent as in say while it was applying a right set something wrong happened and Node actually became out of sync with the rest of the cluster the node is actually kicked out and Which means it won't perform any further operation to avoid further inconsistency so that's a cool thing because Having inconsistent nodes in one cluster is not a good idea So Gallera takes care of it automatically and it throws out the inconsistent node from the cluster let's say you have a cluster of Four nodes and you want to join one you can easily connect one on top of it and Everything the membership control will be taken care of automatically and There's something called true parallel replication we do have parallel replication in in master slave architecture but in Gallera cluster the Parallelism comes at the row level as in When a query or an update hits one of the nodes the update is replicated to all the other nodes which are actually slave in respect to this query and The updates are applied can be applied in parallel on all other nodes with multiple threads and as I said The look and feel to a client will be as the client is accessing a standalone server. There won't be a difference And last and final point is it's really really easy to set up a cluster I mean it's just a matter of running a few scripts last week. I was trying to simulate this Using Docker, but it's not complete now But yeah, people have started doing it and that's setting up a cluster using Docker and all that stuff Right. So here is a simple diagram that sort of represents what a Cluster looks like as you can see These nodes labeled as n1 n2 and n3. So they are MariaDB server basically and underneath that In in the greenish color, I'm not sure what that color is but that's the a patch which basically Gives MariaDB the clustering capability. That's called WS rep patch and WS rep is basically right-set replication and then At the bottom is Galera replication, which is the actual replication. I mean which provides a synchronous capability Yeah now talking of the versions that are available right now MariaDB currently has five five and ten zero in GA Which means they are ready for production So whenever a community server version releases It takes like one or two weeks for the same version of Galera release to come out So the last version that got released is five five forty three and ten zero nineteen and Equalent version of Galera servers are out as well have been released So they kind of are in sync with the standalone releases standalone server releases But the third one that is ten one four What we have done is we have merged the code base of Galera and the standalone server Which means we have everything together now. There won't be any separate package For Galera server, which means the standalone server itself is ready to To is actually Galera ready But it's it's in beta right now, so yeah, if you want to use it for testing purpose You can but it's not ready for production yet. And then MariaDB community Our foundation Publishes are makes available the packages for all major links the Linux distributions and Like for all the yeah Linux like RHEL Ubuntu divine Open Suze and stuff. So yeah Okay, so that was a brief introduction of what MariaDB Galera cluster is now I'll quickly jump on to The way you can set up a cluster and while covering these points. I'll try to Point the Important things which one should keep in mind while setting up a cluster so In order to start a cluster there are certain mandatory things that we need to keep in mind so all these bullet points are basically system variables which are mandatory for running MariaDB Galera server The first one is the ws rep provider now ws rep as I said is a shot for Right set replication. So what happens is whenever an update happens on one of the nodes That update or that Delta is actually replicated to all of the nodes now that Delta is right set and Okay, so ws rep provider is basically The library that that will give the synchronous capability So currently it's Lib galera SSM dot SO and that's the library that we need to use and This library itself is out of Codership. It's a separate company and Then if you do not set this thing this variable that is if you set ws rep providers equal to none Your server will be as good as a standalone server now second one is ws rep cluster address it is the IP address of the node of Any of the nodes in the cluster as in let's say you have a cluster Which is already existing and you have a new node which wants to join to the cluster Using this option. You basically point the new node to join the cluster. So ws rep cluster addresses It takes the IP address of any of the nodes which is already part of the cluster and It's it's it's not just one single IP. You can actually provide multiple IPs separated by commerce and And the third one is bind bin log format So if you're familiar with familiar with the traditional replication of Mysql Maria DB it provides three different formats for replication One is row based replication statement based replication and mixed replication now In statement based replication. We we just replicate the statement that was hit to the server beat insert select update anything we just Replicate that particular statement, but that has its own disadvantages and and The most important one in terms of Galera is you cannot derive the keys out of that command So that's why there is a limitation here is that you have to use row based replication So row based replication you are actually Replicating the entire Change So it's not just it's not the command. It's the chain that you are replicating now change has all the keys and rest of the data so using the key Galera can based internally find if there is a conflict between multiple nodes But if you talk about statement based replication, there is no way to do Do that particular conflict detection and that's why we need a row based replication format And it is a mandatory setting and the final thing is the default storage engine So Galera or in principle requires a Storage engine which is transactional because let's say Your application is connected to two nodes which both and both are trying to update same row So which means in that situation just only one should win and the other should be rolled back and That's why we need that's where we need transactional capability in in my SQL Maria DB world Currently in ODB and talk you DB are the prominent engines Which provide transactional capability and that's why you know DB is the default setting right now because and support for Tokyo DB is in progress, but Yeah, there is no commitment yet. So Yeah, another key thing while you are setting up the cluster is We have to make sure that the number of nodes that you are connecting to the server are Even Even because when there are equal it's a it's a classical State machine replication problem that how many nodes Should be there in the cluster Let's suppose that we have Sorry, did I say even it's odd actually So yeah, so let's say we have even number of nodes So what can happen? Say during network partitioning if the cluster divides into two equal parts and Both party things that they are in majority. They'll start taking new updates And and the the the data will go out of sync So to avoid that that's called split brain basically so split brain is is actually not good for a cluster because then we'll have two versions of the data and Which is a big problem. So to avoid that it is always recommended to have odd number of nodes. Yeah, this time. I'm correct. So Which means three five and so on and Let's say you do not have a machine which Which is Powerful enough or which doesn't have enough disk and that so you can even use a Galera arbitrator as a as a as a node so you can have two MariaDB nodes and Instead of third you can use a Galera arbitrator. So it's it's a stateless Demon, but it does Count as one of the members. So it will keep the keep the overall cluster Protected from the split brain scenario. Yeah, so once your setup is in place You can bootstrap the very first node by using the first two commands that is Service my skill bootstrap or start with WS rep new cluster So this has to be done for the first node Which is going to form the cluster or first node of the cluster Once the first node is up then you can start you can start other nodes by basically pointing to this node and The third one is is the third Way to actually bootstrap a node or bootstrap a cluster So with this empty G-com What you're saying is so G-com is usually followed by an IP address but in this case it's empty because it's the first node and And which is logically it is not pointing to any of the other nodes because it's the very first node of the cluster So using these three options you can bootstrap the node or start the node and once the nodes are in place as in once the first node is up and Let's say the other nodes start to join the cluster It needs a state transfer so state transfer is the transfer of In terms of database the transfer of entire database database objects from one node to another so that the joiner node becomes the replica of the first one and Then once it is in sync with the cluster It'll it will be basically part of the cluster and start taking queries so in in Galera cluster there are two Kinds of snap state transfer the first one is snapshot state transfer Which is basically a complete copy of data from one node to another So the first node is called donor and the the other one is joiner now So that's snapshot state transfer and the second one which was introduced a little later is Called incremental stat state transfer so it is something Where Let's say there was a node which was already part of the cluster and it has some state But it was taken out of the cluster for some maintenance or something and then now it wants to join back So it has some state, but it is not in sync with the cluster So when it joins back Galera detects that okay, it already has some state I do not need to perform an entire snapshot transfer. So it will just send the incremental Right sets which are missing on this node So that's incremental state transfer So now coming back to the first one, which is snapshot state transfer There are multiple methods currently available which can be Which can be used actually they all have their own strengths and weaknesses the first one is WS wrap SSD our sync So it uses our sync actually demon and client to transfer the state from one node to the other The the second mechanism is extra backup and extra backup version 2 now these these two methods require extra backup tool from Percona and Using that you can transfer these state to and Last one is my skill dump So it's the classic my skill dump tool which basically dumps the data on the joiner donor node and Transfer it to the joiner node now in out of these four Extra backup is I mean it takes lock for least amount of time So for huge amount of data This is considered the best option or best choice Then the SSD API itself is quite simple and Using which you can write your own in SSD implementation for example, let's say these four or Broadly three methods if they do not suit your needs you can actually implement implement one for you So are these these four are basically scripts which are run on joiner and donor node and The last thing is WS rep SSD donor. It's It's another setting or system variable which you can set to force The usage of one of the donor nodes for example if we omit this thing There are four nodes and a node is trying to connect to the cluster Galera can decide to pick any node as a donor node Now when donor starts donating this state It temporarily gets out of the node So in order to donate the state So using this option you can force Galera to pick any one of the nodes that are already there It's again a comma-separated list of names of the node you can even provide name to each and every node and Now we talked about SSD now incremental backup is possible because every node maintains a cache or in circular fashion about all the updates that it is doing on the On the node on the local node. So every node has a copy of a cache of rightsets now In incremental backup what happens is Galera The joiner node joins With a particular sequence number. I mean it is requesting for a particular sequence number that look I have all the transactions Till say thousand sequence number So if you have all the transaction after thousand Just give me those So the the joiner node joins with this intention Galera sort of looks at all the nodes to see which node has all these missing transactions and then picks that as a donor and And then the incremental or Delta basically gets transferred from that node and If this is possible then Galera will always first prefer IST and if that's not possible It'll fall back to SSD then Schema upgrades is Okay, I Mean in an evolving application. We always need to change the schema or schema is basically The DDLs we need to alter the table based on our changing needs and stuff now ski schema changing is a bit tricky on a live system because You might be having Many applications already querying other nodes and let's say if you change the schema on one of the nodes it might First is it has to be backward compatible because The I mean the schema change might break the existing queries. I have a small example, which I'll be presenting in a couple of slides so Galera provides two mechanism to Basically upgrade schema one is the first one is total order isolation where And the second is rolling schema upgrade the total order isolation is the default method and where master first at the parsing stage master Detects that okay. There is a DDL So what it does at the parsing stage? It'll take the that particular DDL command and send it to all the nodes across the cluster at the parsing stage and then he'll proceed with the execution and Galera will also make sure that This particular DDL happens at the same slot and That's why it's called total or total order Now because of which it the DDL actually Executes at the same time or in the same slot on all the nodes now the problem is If the execution if the DDL command is Is going to take time? It's a big altar table for example It will stall the entire cluster if it's short and quick. That's okay, but if it is going to take time T to I or total order isolation might not be the best option, but it is enabled by default So you're creating a table it will right away get replicated to all of the nodes and the bin lock format Internally used for this is statement for obvious reasons because you are just transferring the the the query or the DDL command on the other hand Rolling schema upgrade or RSU What internally happens is? when you When you execute a DDL command on one of the nodes the node will get desynced I mean desynced from the cluster and the command gets executed and during this period whatever Whatever writes actually come on to the or hit the node will get buffered And the buffering make sure that okay. We are not missing anything and once the DDL has performed the node syncs back picks up all the buffered updates and Basically syncs back into the cluster now because of this the DDL itself has not got replicated on all other nodes, but the plus point is It has not stalled all the cluster the DDL operation but this basically turns out that Because of this it's a manual operation as in you will have to now manually run all the DDL on all the nodes one by one and because of which Because we are we are going in one by one fashion we have to make sure that the These schema changes are backward compatible. I have tried to come up with the simplest Example which might help you understand? What is this backward compatibility? For example? Let's say there is a table with with two columns and The application inserts in this particular fashion for example insert into T column one column two two integers Now let's say using rolling schema upgrade Somebody runs the red the first alter command on the on the node so what if this command is trying to do it's trying to Add a column between column one and column two so it it actually changes the order So the the the alter table will work fine on one of the nodes now when it When the data actually comes in The insert those insert query query will start failing So this is not a backward compatible alter command or schema change So and that's why we There is the second alter which is backward compatible as in even after running this alter your inserts will work fine until you go and update your application as well So I mean this is one of the simplest examples I could think of So that's why in rolling schema upgrade one has to make sure that the That the schema changes are backward compatible In Galera actually there are multiple ports which are opened so there is one communication between So when when nodes are connected they are connected using TCP and by default the changes are replicated in Non-encrypted way so there is an option to even encrypt the the communication or traffic between the nodes Using different using SSL so Following are the I mean the two options that I've mentioned here are the bare minimum that you can use to encrypt your communication between the nodes So which is socket SSL cert and SSL key now providing these certificate and key You can encrypt the data or communication between the nodes one important thing to note here is The certificate and key pair has to be copied on all the nodes of the cluster. It has to be same on all the nodes and IST Comes under this in falls under this encryption as in the IST traffic is encrypted but on the other hand SSD is not affected by this particular encryption because state transfer Snapshot state transfer is taken care by different scripts as we said for example our sync Uses a mechanism which is outside the server. So it does not get encrypted But IST on the other hand is encrypted so How to secure the SSD? The extra backup script provides its own encryption mechanism So using the extra backup options you can encrypt the SSD which happens over extra backup But our sync and my skill dump are not they do not support encryption as of now and Yeah, in the initial introduction I talked about parallel replication. So So the parallel replication actually comes at the Slave node Which is trying to apply the right sets from the master node So the the master and slave terminology is is is just to represent that Master is one where the initial command was hit and Rest of the nodes are slave because they are trying to apply the changes which was done on master now Gallera provides another Another system variable called WS rep slave threads using this you can start multiple multiple threads or Multiple applied threads which will try to apply the right sets in parallel on all the nodes. So and So the question is how many applied sets do we need because Every application will have their own load and stuff. So There is a status variable called WS rep sir depth distance it basically tells you about the Average distance between the highest and lowest sequence number of right sets which can be applied in parallel Now this will give you a hint of how many threads you want and The maximum limit would be approximately four times the number of your CPU cores So using these two you can check how many threads Your cluster needs to run how many applied threads your cluster needs so Okay, so My sim is one of the important tables if you're using my skill, but as I said My skill is non-transactional. I mean it is still used for storing all the system database But it's non-transactional and that's why the support for my sim is experimental So you can use it but use it with caution because since it's non-transactional the conflicts and all that thing cannot be rolled back and so by default the replication of my skill updates is turned off you can still enable it with by using WS rep replicate my sim and now talking about cluster load balancing is one of the important things as in when and How you are routing your queries so With MariaDB Galera cluster you can use there are many options actually for load balancing one of the Classical options is using HAProxy Using which you can route the queries to different nodes in the cluster based on different policies and Then the second option is max scale. It's another tool from MariaDB Corporation which can be used to perform Load balancing and this third one is Galera load balancer It's from the original company that ship that ships the Galera library now which load balancing policy is best best it depends on the applications traffic or or the pattern of queries that is running on your on your application The first one is read write splitting now This is something wherein you Try to route all the right queries To certain nodes and read to rest of the nodes. It's more of a master slave architecture and why this is good because Just to avoid the conflicts. Let's say you have rights being thrown on all the nodes You will occasionally get conflicts to avoid that You can do this right read splitting so that right goes to one of the nodes and read Can go on any of the nodes and other are round robin and least connected Least connected is the default policy in Galera GLB Galera load balancer Least connected is as the name suggests Send the next query to the to the node, which is which has least number of connections, right? So another important thing is How would you make sure that your data? is I mean it just outlifts a disaster so the the the mechanism here is you have the Maria DB Galera cluster running in one of the data centers and Now you can relay this thing to Another node sitting somewhere in some other data center using the classical my skill Maria DB replication So the red arrows represent the Galera replication or synchronous replication And the blue one is the classical my skill master slave replication So what it's doing here is all the updates which are happening in the cluster It gets relayed or transferred Asynchronously to a remote my skill Maria DB server So so basically we are using both the replication mechanism in this particular architecture or topology so, yeah So I'll share some more configuration about how this can be done, but yeah now because Maria DB server the slave one S1 is asynchronous and It's somewhere in some other data center It won't basically avoid anything to to they are yeah So here are the settings which can be used log when is to enable binary logging because now we need the Data to be logged so that a slave can read that information because we are not talking about classical master slave replication and Log slave updates server ID It should be same across all the nodes then other settings are for to enable Replication using GT ID and stuff The limitations Limitations as I mentioned is we need to have a key or primary key on the on the table So in ten one there was a new variable which got introduced called, you know DB force primary key if you enable this If a table gets created if somebody tries to create a table without a primary key That command will fail. So it's just a mechanism to secure That thing and then as I said, you know DB storage engine is only supported at the moment And there is a cap on the transaction size because huge transaction will incur Huge lag or stall the cluster. So you can control that using max WS size It defaults to one geek second option is There but currently it's not enforced the max rights at rose, but it's not enforced and Network partitioning or split-brain as as I have already mentioned make sure you have even number of nodes and If you want you can even use Galera arbitrator as one of the nodes but that node itself Galera arbitrator is stateless So it cannot be used to connect a client and start receiving the updates and stuff Mm-hmm multi master conflict is as I mentioned when multiple when you are hitting say for example insert query on multiple nodes and the the insert or update is trying to update the Same primary key or same key. There is a conflict now in that situation the winner Gets the chance and the other transaction gets rolled back to avoid that We can if it's an auto commit query We can set WS rep retry auto commit to certain value so that it just get Get stride again when the conflict happens automatically so and there is There are some more status variables which basically Give you information about where conflict is happening the most like on which node and based on that you can even derive the hot tables which should be Updated like only from one node and all that stuff so the the options or status variables at the bottom gives you that information and then Galera cluster also dumps the right sets if it fails actually to apply them in In a particular log file Now these right sets can later be viewed through my skill bin log To see what happened wrong? so, yeah Detecting slow nodes is There are certain status variable using which you can see or you can determine which of the nodes Are slow and then because they are slow you can okay I should be going a bit faster pitfalls is Make sure that when the new node is joining In a cluster when a node joins it basically inherits the data but if that particular joiner node has More information or later in the latest information It should not be done that way because it will lose its information while joining the cluster so that's this is one of the query I received on the IRC chat and Application should be aware of this as in the the auto increments are Non-sequential on all the node to avoid this conflict So your application should expect that the auto increments will not be sequential all the time and this is controlled by auto increment control variable and Another pitfall is the principle of least variation make sure that the configuration across all the nodes is as close as possible Yeah For example SST method Has joiner and donor Concept so let's say one node has our sink SST another has extra backup So it will be a confusion actually so that's the thing a bit about Maria DB project we have our source on GitHub now completely and You can go to the knowledge base to get more information And then you can even report bugs and feature requests on JIRA Which is highly appreciated if you're requesting more bugs and features You can even ask questions on Maria DB discuss list and You can even chat to the developers and community members on IRC hash Maria channel That's it