 So the original title of the stock was supposed to be corpus colapsum partition tolerance testing of Galera and all that, but Stuart told me to make it less esoteric and you know less spoofy and all that. So I changed it to DoCM, distributed system testing with NetM and Docker. Anyway, doesn't matter. Okay. Okay, so this is the picture that I originally used for that corpus colapsum. It's basically supposed to indicate the corpus colossum which is between left and right brain and absence of which causes the split brain, which is a medical condition and all that. So I was trying to create an equivalence between the split brain of the medical condition and what you see in distributed systems. Okay, so some of the seed codes that I put here, network is reliable, a fallacy of the distributed system. A distributed system is one in which failure of a computer didn't even know existed can render your own computer unusable. Never attribute to malice, which is adequately explained by stupidity. That's Hanlon's razor and never attribute to Byzantine failure, which can be explained by an ill node. Basically, Byzantine failure is one in which a node, a corrupt node or what do you say? Not a corrupt node, but a node with a bad intent, a bad node, can bring down a whole cluster and that is one of the requirements of distributed systems, something that is not satisfied by many and something to satisfy which requires a great deal of complexity, actually. Anyway, but this talk is not about the Byzantine complexity. So, okay, so this is about distributed systems, Galera. Okay. Distributed systems testing with Galera with NetM and Docker. So basically, Galera, I've explained it later on, but it's basically a synchronous replication plugin for MySQL. It provides synchronous replication around multiple nodes and it uses something called eventual virtual synchrony over here. The database is basically extra db cluster, which has something called right set replication API, which the plugin Galera plugin implements. And then there is the traffic control. Some of you may be familiar with this or may have played with this. It's a command called TC, which provides a lot of QDCs and stuff for controlling the traffic. And NetM is one of the QDCs, which derived from NISTnet and which allows you to add packet loss and stuff like that, packet loss and delay and corruption and whatnot. Okay. And other actors are like Docker itself. I've used Docker, but you can, in future, anything else can be used like Rocket or LXE or who knows anything else. And for load generation, I'm using sysbench, but I also use random query generate. I remember there was a talk earlier today on random data, I believe. So I think random query generator is something like that. It's something called a first tester for MySQL. So it generates a lot of queries based on a predetermined grammar and such. And for the network, I'm using a DNS mask container. DNS mask is nothing but DHCP and a DNS server. And I'm using it for DNS purposes. Actually, DNS is not the purpose, but it's one of the ends, actually, because it's required for the cluster usage. Because Docker, as you know, still, it has very simplified networking, but it's not comprehensive in that the linking is not that comprehensive enough for a full-blown cluster. It's good for like something like Redis client and Redis server, but not for a synchronous replication cluster. So I had to write my own DNS mask container for this. And I used NSenter for manipulating the network namespace of the containers here. Because otherwise, probably, I can use Docker exec. Now that Docker exec is available for injection into the containers, but NSenter is what I'm using currently. And it works perfectly fine. You can manipulate any of the namespaces like PID mount, network user, and all of those. I'm manipulating the network namespace to attach Q disk to the network interface of the container, not to the virtual network interface, which is like ETH, not that, but the one inside the container, which is usually ETH0. And then there is Jenkins, of course, because I needed this to be part of our QA testing, because there are a lot of QA tests based on this, on top of this. Because the problem with distributed system testing is in many cases, it's assumed that there's no packet loss. As I said, in one of those seed codes, there's no packet loss or there is no delay and stuff like that. Earlier, that's how the QA test for PXC is run. Now I've added this so that I can create the virtual LAN or virtual LAN and stuff like that. Okay, so I just wanted to illustrate that distributed system testing is like a Kobayashi Maru. Some of you are familiar with Star Trek. It means an unminable situation. So basically, what it means is, suppose you do some tests and you spot some bug, you fix it. Again, you run the test, you'll find some more. Distributor systems are like that. But that's part of the fun, right? Anyway, so why was this testing done? So briefly, to explain briefly, this is test the P in the cap. Cap, as you know, was proposed by Eric Brewster in a conference somewhere in 90s in his keynote. And from there on, it caught up and it has become one of the standards with which you could judge the distributed systems. Anyway, so that was advanced scalability is something I needed to test. Yeah, real reason, fun. That's my reason anyway. And one more thing is, I wanted to test tolerance to latency variance because the problem with the latency is it doesn't, it's fine that you have a high latency or low latency, but the variance, right, the variance around the standard deviation is what creates instability in a distributed system. It should not, the variance, by variance, I mean the, what do you say, square of the standard deviation of the mean of latency. So anyway, so for a percolation, you don't need, unlike other distributed systems, something like HCD as the core OS folks have spoken about, where you have something like a leader and a quorum or Cassandra, where again, you can establish a quorum. What do you need in a cluster, in a synchronous replication cluster, wherein you have identical requirements for writes and reads, that is you read anywhere or write anywhere, you need a consensus and not a quorum. So network is very, very important and so is latency. So to simulate real world networks and for synchronous replication, these two things were the ones that were added, the delay and partition, and there are many tests based around this. And this is basically how I illustrate Galera. And it's basically, to put it simply, it's what you call as the eventual virtual synchrony as opposed to something called one copy equivalence in distributed systems. One copy equivalence is where you have like five nodes, but the five nodes behave as if there is one node in terms of replication, but that is something that is very hard to guarantee. And I think Jim Gray has written a wonderful paper on that, and there is, and I think I forgot the name of the paper, but it's there in the further reading. So anyway, so Galera, to put it simply, and for this short period, I would say it's a synchronous replication plugin for MySQL, but there's a whole lot of documentation you can, and there's a lot of papers around eventual virtual synchrony that you can find useful. And that I have linked also in the further reading. This is just how it looks. And basically, this operates around something called optimistic concurrency control. You may also know it as what do you call it, lock elision or transactional memory in locking parlance. So what happens here is that one node has a transaction and it commits and it sends to remote nodes. It does not wait for an acknowledgement from the other node. It just commits. And what happens on the other node if there is a conflict? And this is what we refer to as a certification conflict. In this case, it avoids its local transaction and it applies the remote one. So this is done to ensure that there is a very high throughput. So around trips are done only when the membership changes, that is the network membership changes, which is not that often. And whenever a node dies or leaves, that's what I mean by membership change. And this is basically a finite state machine which is implemented and all the nodes are supposed to be in sync state, but they often go to donor where you transfer your state to a new node and stuff like that. Anyway, so the problem with this kind of a cluster is one node can bring the whole node down because of certain issues. As I mentioned, Byzantine complexity is where a node intentionally tries to bring the cluster down with corrupt packets or latency or something like that. But in this case, a faulty node with a bad network interface or a bad disk can slow down the writes or slow down and introduce artificial latency. Anyway, so some of these, I have listed some of the tests here. KOS testing is what Netflix introduced as KOS monkey, wherein you have a distributed system and you randomly remove the nodes and you introduce rapid membership changes in the cluster. I actually call it KOS sapiens because it's not a monkey, but an intelligent being which is introducing KOS into the cluster. And that test is very, very interesting in that the nodes are removed and the length of time the node is away from the cluster and they are removed non-gracefully as if you are pulling out the cable and the length of the time they are away from the cluster determines the state transfer that they require, whether they require a full state transfer from another node or an incremental one. And so there are a lot of race conditions and things around that. And there is some flow control testing with Sysbench because in distributed systems you have something called a back pressure. A back pressure is where you are sending packet, one node is sending to another node and another node says, you know, I cannot take it, just hold on and stuff like that. So there are two kinds of back pressure, synchronous and asynchronous back pressure and in this case the back pressure depends on how much of the queue is filled up. It's a whole another topic. So anyway, earlier when we were testing with Sysbench, we did not have something, there was no delay, but in this case I added delay and the layer has something called segments to introduce the concept of vans and lans. So you can have like, so it can distinguish between a wide area network node, so for a simulation of data centers. So with this I am doing that and network loss, that was one of the other tests wherein I drop the random packets and this one, the network loss test was the main source for which I wrote all of this. So because that was one of those things which was introducing, which was making, hold on, I thought I, okay, okay, yeah. So network loss testing was the one which introduced it because it was making, because a cluster has something called a primary component, that is the component which is the same part of it, because when there is a network partition and the nodes which have a quorum from something called a primary component, they are the only ones who can take reads and writes. There are other nodes who can take what you call as a dirty read, but they cannot take writes, obviously. So network loss testing was for that. And anyway, for the future I plan on introducing more fun tests. You know, as I said, it's like a Kobayashi Maru, you know, there's always something you can do and fix. Oh yeah, I get that also, okay. Yeah, that's a random fortune which I add for every 10 minutes so that I keep track of time. Anyway, yeah, so this is something I added. There's no higher menace than distributed system testing, okay. Okay, so this is roughly what the chaos testing is. Nodes are killed at random and less than half of the nodes are chosen because, obviously, I want to keep the quorum. And I'm using docker inspect for inspecting the PID. And I'm using sick kill because I'm using sick kill directly because you cannot proxy sick kill from docker to the process inside it, which is mysqld. You cannot proxy sick kill, but you can proxy other signals in docker. Obviously, you cannot proxy sick kill, right? Anyway, so there is configurable sleep and retry logic around it. Actually, I tried to use docker restart, but with the time of zero to simulate a sick kill, but that doesn't work. I think it's a bug or maybe not, but it doesn't matter. But that is about it. And this is about network loss where I choose subset of nodes and I detach or keep the Q disk. And I have something called reconciliation period, where in I sleep, there is a sleep in between. And after that, I run sanity checks. There are two primary objectives depending on whether I detach or keep the network loss Q disk in the end. It's to check whether the primary component itself is formed, that is, or based on the time to recover the cluster takes. Just to mention briefly, what happens if there is a node which has a packet loss here is that the nodes are evicted through something called stonet. That is, nodes, they themselves talk among themselves and they say that this node is bad. And when all of them agree, that node is kicked out. That's simple. These nodes maintain history of all the bad nodes. And for that node to join back to cluster, it has to be restarted. Quickly to the containers. Why not virtualize? Because I needed performance and there was no need. That's Occam's razor. I just needed isolation among the nodes and namespaces provided. So why should I virtualize? And that's the simplicity. And network is very simplified here. I didn't have each Galera node requires three ports. And suppose imagine 20 nodes. That's a big headache. So I didn't want to do that. Portability, reproducibility and all that stuff. Because this is also part of a QA testing and not just something that I run on my laptop. It runs in our Jenkins every day or so or whenever I want. There are multiple tests, actually, as I said. QMU vis-a-vis Docker. Okay. QMU still has some interesting stuff. So something you can do like simulate NUMA nodes. I think that is something very, very, you know, that's very interesting. You can create like a 16 socket, I don't know, two socket, four core and all that with QMU. It uses LibNUMA, I think, for that. That is something which is very, very interesting with QMU that I used to do. Anyway, so all that. Container network. How much time I have? 10 minutes. Oh, 10 minutes I have. Okay. Okay. That's good, I think. What happened now? This, okay. Okay. So, yeah, the Docker I use because for the reasons of performance again here, I didn't want to virtualize and all that. And it also provided me the abstraction of channels. You know, by channels, I mean, Hors channels, CarHor, the professor who introduced the concept of communication among what you see in Go language and all that. Those channels, the abstraction of that I could make, I could create in this. Anyway, so for the container networking, linking did not help me because linking initially, what linking is in Docker when you use minus dash, dash link? Oh, God, what happened now? This one went off. What happened to this? Okay. Okay. Yeah, so I had to use a DNS mask container for that and basically what I did here is that I'm not using DHCPD of DNS mask because the Docker does not tell of you to change the DHCPCD that it comes built in with, which is a pain in a way because currently every container that you start gets a sequentially, you know, incremented IP address. But if you want your own IP addresses, it's a bit of a pain. Anyway, so this DNS mask container has something called volume file, which is passed to it through a Docker volume and every new container, PXC container that comes up, writes its IP address to that file and the DNS mask container has a sig up that it does to itself every second. So that's like a hack, but as I mentioned in my talk yesterday, I'm also evaluating others like VU and OVS and stuff like that. But this works for my testing for so far. But again, DNS mask also has some issues. As I said, the DHCPCD one is one of them and Docker has some issues wherein like if you restart a container, it does not retain its IP address. Actually, they fixed it, but again that introduced some bugs, so they reverted it back and all that. So I have to do some clumsy stuff there. So what is the noise that I introduced is initially I was attaching the QDIS to the bridge itself. As you see, the Docker when there are multiple nodes, it uses Linux bridge with multiple taps. But then I realized that this, when I added it to these virtual interfaces, it was doing it for egress only, not for ingress packets. That's outgoing, not in going, incoming packets. And that required something like intermediate functional block device, NetM or something. Anyway, so what I realized was I can use NSenter and attach it to the eth0 of the container itself and that simplified a lot of things. Anyway, with NetM, I'm doing all of this stuff like packet loss, delay, corruption, duplication, reordering and whatnot. It allows for some very complex loss models actually. One more thing I would like to mention is that Gallera supports both TCP and UDP. And when I used something like duplication and reordering with TCP, obviously I did not see any issues because TCP handles some of these things by itself. The application did not. So, but the future is where I am going to try with UDP and multicast and all that. So, with duplication and reordering and see how the application itself handles it because the application also has some logic around some of these, at least around reordering. Other noises are like libbitmydata, which I use stewards libbitmydata for fsync. But what I plan to use is that not only to remove fsync, but to introduce artificially high slow fsync. I want to try that as well. To simulate like a really old hard disk or like a tape, something like that because currently with this, you just remove the entire thing with fsync and I can test the network there. I can stress the network in that case. I can remove the storage itself and to see how it correlates with network and all that. And to do this, I needed to pass this library into the container and I used the LD preload with the Docker environment variable. I passed it. The Docker image itself had the library installed beforehand and I used LD preload as the environment variable. In Docker, you can use that for that. Anyway, the fix, I think I have already mentioned it. It uses all these nodes, use a counter and once the counter is, threshold is reached, it uses stoneth, stone the other node in the head, which is very popular. This is, okay. Codems is another issue that I had with Docker because the concept of isolation in containers in general is not that bulletproof. Even when there are namespaces, not all devices are passed. And especially something like sysctl, if you do inside a privileged container, it can change it for the host as well. So especially if you change the codem patterns sysctl. So this required some weird stuff to be done and particularly because MySQL runs as a setuid because it drops its privileges later on. So it's a setuid thing and so for that and there were U-limit issues. So I had to pass a common Docker volume to get codems and yeah, and some hacking around all of these things because it and but for the time being that's how I have to do it because otherwise there is no way you can get codems. It's a sad thing. Otherwise I have to use VMs again. This is about VAN segments in Galera and so it simulates data center and like something like joiner starvation when a new node joins how long it will be starved for waiting for a state transfer from other nodes and stuff like that. Anyway, there's a lot of proof of concept code and a lot of container Docker files, FIG files. I also have FIG files in this so that you can orchestrate and bring up the cluster easily. You just have to do FIG scale, bootstrap equals 1 and members equals I don't know 10. It will bring up a 11 node cluster easily. I gave a talk on this yesterday so you can probably look at it's recording tomorrow or do a time travel. So anyway, this is the Jenkins, but there are more tests that I've not added in the link there. You can contact me for that, of course. To do is more orchestration. Yeah, some signal proxy not for sick kill, but for others actually I use 6xb to generate codems and so I have to proxy it and all that currently I'm doing peak kill and mysqlty. Future work, what is future work? Okay, I want to do more fault injection with memory poisoning, you know, like MADV poison, MADV is poison, that's and I want to simulate ino space with something like devful file slash dev slash full and all of that. Numa, I want to do, but I don't think I can do it with containers, probably I need QM or something, let's see, or probably I can add it to Docker. Or not. Anyway, more network stuff like reordering rate limiting and a lot of stuff, you know, it's fun when you can do all of this with the NetM and with these containers. Anyway, so this is all the further reading. These are some of some really good literature in here, like don't settle for eventual consistency is about eventual consistency. Then Byzantine fault, reaching agreement in presence of faults is a very good paper, network is reliable is a good one. The last one is the paper on extended virtual synchrony, the protocol for virtual synchronous replication, which by the way, what is that? Chorus sync also has it. Chorus sync, which is very popular and which is behind MySQL group replication in MySQL 5.7 has it. That also is the extended virtual synchrony actually. Okay, this is about me. The slides will be there, but again, the location of the slides, if I put it on the slide, you know, it doesn't work, right? So I think Stuart will probably put somewhere and I will also tweet about it. That's it. These are the image credits. Thank you. Do you have any questions? Yeah, you can ask me now or you can ask me later. I'm also available on IRC and a lot of other places. I prefer IRC though, but not for the next five days because I don't have it on this. If you have any questions on Docker, Galera, NetM and something and databases and containers and namespaces, you can ask me. Okay. Thank you very much. Thank you.