 Okay. So, CTDB was first spoken about at LCA, I think it was eight years ago by Trig. And a lot has changed. Trig has moved on. The previous maintainer moved on. Amate is now the CTDB maintainer. And I think it's fair to say we are the two full-time developers. So we see things through new eyes. And we are the only two full-time developers. So, yeah, we're trying to make some changes and trying to find time to do that. And this is our story. So, for those of you who don't know what CTDB is, it's basically a small cluster database that lets SAMBA servers on different hosts share information that they need. I guess the most obvious information here is locking information. If there are clients trying to connect to SAMBA servers running on two different hosts and they're trying to lock files, you need to make sure you don't get any clashes there and that only one person can lock the file and make changes. So, we won't say more than that except now we're going to dive into the functionality. So, what does CTDB do? The first thing is cluster membership and leadership. When nodes join and leave the cluster, this needs to be recognized and we perhaps need to do some reconfiguration and that sort of thing and we need to make sure the database is consistent across whichever nodes are currently in the cluster. Leadership, we don't do the quorum thing. We basically elect a leader, one node is the leader and we can go all the way down to theoretically disregarding performance. We can go, that's very good. Let me just very quickly do that. So, we can theoretically go down to one node out of our five node, ten node cluster surviving and still provide full service, but of course if you provision ten nodes you did it for a reason and you won't get performance. The most important thing is that we provide a cluster database for SAMBA to store information in and when nodes leave and join the cluster we need to do a database recovery to make sure the database is consistent. CTDB also has an underlying messaging system built in and SAMBA uses this for SAMBA instances on different nodes to communicate. We manage services including SAMBA, NFS, public IP addresses kind of regarded as a service. There are a whole bunch of services that we can manage and we can monitor to make sure that they haven't gone bad and that we don't need to fail a node out of our cluster. IP address management is special for us. We manage IP addresses we do fail over and we do consistency checking of the IP addresses to make sure that they've landed on the nodes that we think they're on and that when we've released them they're not still there. And we do logging which sounds like a strange thing to put on the functionality slide at the front but there is a five to seven minute snippet in the talk where we go on about logging. Okay what's the current architecture? Well we have a bunch of demons, three of them and they're the processes that exist for the lifetime of CTDB when it runs. We've got the main demon which does a whole bunch of things we've got a recovery demon which notionally handles database recovery and we've got a logging demon which is why I mentioned logging. Then because CTDB uses a forking model to avoid blocking the main demon and making sure that it can respond in real time to requests we fork a heck of a lot of processes. We won't spend a lot of time talking about those today but most of them probably aren't going to go away anytime soon. Now comes the strange mapping of function to demon that has happened as CTDB has tried to grow up. So we have our three demons. Cluster membership is in the main demon. When a node joins the cluster messages between the main CTDB demons go back and forth and we rearrange the cluster. However cluster leadership is done from the recovery demon. Database access from SAMBA and any other clients is done in the main demon. Database recovery is done in the recovery demon. The main demon provides the cluster-wide messaging service. Public IP address management is done in the demons. You configure your public IP addresses there. When there's a failover that's handled out of the recovery demon and the consistency checking is also done there. Service management, main demon. It's kind of isolated there. And logging is handled in its own demon. It is very kind of neat. So the question is why do we need makeover? Why do we need to worry about doing anything? Well, CTDB is now more than seven years old. It's not proof of concept anymore. It's been in production for more than five years. So we know it works. So then why worry about makeover? Well, the main reason is the limitations imposed by the design and the implementation. Now, the most often for a casual user, we won't notice the casual user would not notice these limitations, but when we start doing performance testing and putting heavy load on the system, that's when these limitations start appearing. Plus, just like any other software, CTDB has grown over the years. Lots of new features have been added. Some of them, well, they are more like hacks than features because it was too difficult to redesign that part of the code just to add that particular feature. So there are lots of things like that. What about refactoring? Why you want to redesign from scratch? We can do refactoring bits and pieces and then keep improving the code. And we have done that. So for example, CTDB keeps track of message lists. So all the SAMBA, which are basically clients to CTDB, they register for what messages they are interested in. And CTDB then does the job of when the message appears actually distributing to various SAMBA processes. Now this message list was actually a link list. So obviously there is a potential for performance problem. So that was changed to hash table and so on. So things like that are easy to do. Again, the locking is very central to CTDB. So whenever SAMBA needs a record on any of the nodes, CTDB has to actually get a lock on that record and tell SAMBA that, yes, the record is now available on that particular node. So for that reason, so locking is important and we actually refactored locking code so that we can improve the scheduling of locks and things like that. So, yes, we have used refactoring. But some of the things can't be refactored easily. Now one such example is the protocol, CTDB protocol. So CTDB protocol is spoken by CTDB. That's the server. And on the client side, it's stocked by SAMBA. Now SAMBA has its own implementation of CTDB protocol. So now if you want to change the protocol, now we are in the pickle because now we have to change not only CTDB but also SAMBA. Plus this is the dream of every new developer who joins that he wants to redesign everything. And we, rather me being a new developer, I always want to redesign some of the things. Well, there are some advantages to redesigning because you can get rid of some of the problems by just designing them away. So over the last three years we have seen some of the examples of where we noticed problems in CTDB. Now some of those things can be really designed away and I'll get into those in a little while. However, the main challenge for redesign is preservation of knowledge. Now we have captured a lot of corner cases and some weird combination of our interactions between various demons or various processes. That's very hard to reproduce in the redesign correctly unless it's captured or documented in some fashion. And with our trend for documenting things, that's going to be a very challenging task. Okay, so let's quickly go over some of the design limitations we have in CTDB. So as Martin mentioned, we have two really main demons, the main demon and the recovery demon. And both of them are quite overloaded with functionality. There are, out of the functionality, lots of functionality is really not time critical. Things like database recovery or IP address failure or things like that. However, database access, then Samba request a particular record on a node, you want to do that as quickly as possible. So there's a bit of time critical code and that is all mixed like a spaghetti in both the demons. So it becomes very difficult to maintain that in a non-blocking and asynchronous code. Plus, as Martin mentioned, CTDB main demon also acts as a transport for Samba to pass messages across nodes. So that means now the main demon, which is a single threaded process, has to do not only message passing on behalf of Samba, but also do all its work. Now one of the example of how the election takes place, that is the leader selection, the current algorithm, it's a very simple algorithm where when the main demon starts, it tries to become a leader or the recovery demon tries to become a leader. And that causes a lot of elections if you have a lot of nodes. So scalability becomes an issue. So if you want to have 32 or 64 or 128 nodes, then election itself might take a few minutes. There are ideas to improve election code. Same thing with database recovery. The original design was to get it right working. So what's a database recovery? When a node leaves or joins a cluster, you want to basically consolidate databases from all the nodes and then redistribute them. So this happens one database at a time. So there's a potential for doing things in parallel, but right now it happens in sequence. So if you have a lot of databases with lots of records, it's going to take a lot of time. Plus there are other limitations. Like most of CTDB state is centralized in the main demon, even though it's only used in recovery demon. So every time recovery demon wants to use it, it has to actually query it from the main demon. So some things like that. Some of the implementation issues. So the protocol is Stux On Wire. Very easy to implement, but in long term has too many issues. Because of the packing, structure packing, the structure sizes are different on 32-bit and 64-bit platforms. And also this is not Indian neutral. So currently we cannot have a heterogeneous cluster. Also the code is hand-martialed. There is no auto-generation of, you know, marshalling routines. So which also means we can't specify protocol versions. It's very difficult to sort of version a protocol and say, okay, we are now moving from, say, version 1.1 to 1.2 with these changes. That's hard. Also the design is very simple. So you have one packet per request and a reply. So what happens is if you want to send lots of data, your packet size becomes larger and larger. Now this is okay, but if you have, let's say, a million records in a database, and now you want to do a recovery, for example. So now you want to send a million records from one node to another node, the node which is doing the recovery. Now we use Taloc in the CTDB as the memory allocator. Now Taloc has an inherent limitation of 100 meg as a single block size. And because we allocate entire buffer as a single block, if you have a million records, we might actually overflow Taloc's limitation. So yeah, there are issues there. And also you really don't want to send one packet of 100 meg because that will block your main server for a long time to send and receive. Because we have only two demons, we don't really have a messaging infrastructure currently. The way to pass messages to recovery demon is a bit ad hoc. It's kind of just fire and forget. You send a message and that's it. So that's a bit limiting. And plus the user interface or the CLI to CTDB tool is unstructured. There are all sorts of commands bunched together. So it's sort of very difficult to figure out sort of there's no structure to using the CLI. And same is true for the configuration. All configuration options are dumped in one config file. So obviously there is a need for redesign. Now the main issues here I've pointed out is the scalability is obviously one issue and maintainability. Because in the current form, it's becoming harder and harder to do any new development in the confines of the current structure. So we started doing this exercise of redesign. Right. So we think we want to do this. We want to make over CTDB. We want something shiny in you. So let's start with something small. Come up with some good ideas and work from there. So what's the smallest chunk? Well, the two most important chunks in CTDB are the main demon and the recovery demon. So the logical place to start is the logging demon. The reason for that is because it is self-contained and what we thought we would do is whatever we did for the new logging demon is that we would use it as a template for other demons. It looks nice and simple. Okay. So what does the CTDB logging demon look like? Why does it exist? Okay. Why does it exist? Well, if CTDB is being hit hard enough to try to log a whole lot of stuff and it's trying to just use the syslog library call to do logging over the Unix domain socket, then that operation can block if you start hitting it hard enough. The logging demon can't necessarily keep up. And there's an assumption there that log messages are more important than keeping up. So what was done was that a demon was created that will log each message it receives using syslog. But the how is via a custom UDP protocol and that makes it lossy, which means that CTDB doesn't get blocked by doing logging. The log messages get lost and that's better than, you know, not being able to provide service. What are the problems of this? Well, it's only enabled when we do syslog, not file logging. File logging can block too if you start hitting the discard enough. And once again, the protocol is a struct that's just thrown on the wire and read by the logging demon. What would we like afterwards? Well, we'd like a shiny new demon with a well-defined protocol. That would be nice. We want something that will handle all of CTDB's logging needs. So we have this big idea. The big idea is exciting. We will create an asynchronous framework for CTDB demons. We will use SAMBUS T event REC framework, which is really good for doing async stuff. We will define a protocol, not just structs, and we will auto-generate the marshalling code. We'll use all of this to write a logging demon as a template and then use the template for writing other demons. It's a brilliant and grand plan. The big problem is that logging is actually hard. Bootstrapping a logging demon and reconfiguring it. We were doing this when we kept hitting corner cases. What happens if we reload the configuration and we run out of memory? We can't log. And all of this stuff. How do you handle errors in a logging demon? Well, you don't. I mean, we're not in the logging business. Why are we writing logging demons? So there are logging demons out there already. There's a well-defined message format. And you can send stuff to syslogdemons via UDP. There's an RFC. So this should be really easy, and this means we can throw away the CTDB logging demon, write some little back-end logging foo, and the world will be a beautiful place and we'll have one less thing to worry about. So how did that work out? Well, the first Linux version was quite easy, but it wasn't merged because we were trying to come up with a unified Samba and CTDB build. Samba was merged into the... Sorry, CTDB was merged into the Samba tree about a year and a half ago, or not quite that long ago. And back in May, Amate did a bunch of stuff to get a WAF build working because Samba builds with WAF rather than Autobild, Autoconf. And however, you couldn't do a top-level make in Samba to build all of Samba and CTDB. So... And once we started looking at that, you start realizing, well, we're about to go to Samba's debug code, but CTDB took a copy of the debug code and a bunch of other stuff a long time ago. And we've got these two copies in the one source tree now and we're going to have to resolve that. So why don't we spend a month fixing the unified build? That seems like a good idea, instead of getting new things done. And, you know, I said it would take three days. Amate said it'll take four weeks. It took four weeks. I'm starting to believe him. Okay, so then finally, I take the first version of the new logging code for CTDB. I port it to the new unified build, some changes to library functions used and all that. I post it to the list and somebody goes, well, why don't you just send to the Unix domain socket in non-blocking mode? Why do you have to go UDP? Of course, that's a great idea. Okay, well, our syslog D doesn't speak RFC 5, 4, 2, 4 on the Unix domain socket. You've got to remember we're not in the logging business. So then we have to learn about RFC 3, 1, 6, 4, which is the old almost going back to BSD logging standard. The location of the Unix domain socket isn't standardized. So, you know, we need to figure out which hash to find. We need to make some guesses on platforms that don't define that location. A whole lot of RFC 3, 1, 6, 4 is only recommended. So, do we want to use it all? Do we implement it all? What do we do? Oh, not all of it's supported. I can't remember where it was, but, you know, if you pass some of the stuff defined in that RFC to the logging demon using either, I think it was over the Unix domain socket, it would basically assume that some of it was junk and add more of a header. And, you know, FreeBSD supports RFC 3, 1, 6, 4, not RFC 5, 4, 2, 4 over UDP. So, jeez, so you tear out your hair. And, okay, ended up implementing a bunch of logging options. Go back to the original, which is just use syslog by default. The default is to use the syslog library call. Logs to the Unix domain socket. If we go too fast at blocks, if that happens, well, somebody can see to be logging to use syslog non-blocking. If for some reason they don't like that, they can use UDP and because I wrote the RFC 5, 4, 2, 4 code, there's no way I'm throwing it away. I'm giving people an option to use that as well. Okay, so after we started this process and started thinking about this logging daemon and splitting it off and trying to come up with a framework to write daemons more than 12 months past, the above syslog options merged into the SAMB master branch but only for CTDB's use, and then we retire from the logging business. And there's no more logging daemon. And there's no more logging daemon. Logging daemon gone. This is good. We might have to get back into the logging business at some stage because it makes sense to promote some of this to SAMB's debug code so that SAMB can do non-blocking logging and a range of things. Perhaps not all of it. We'll see. All right, so we did our first bit and that was really exciting. We threw away code. So the next thing to do is to take some of the other functionality and to move it into individual daemons without having it overloaded into these two other daemons. So it would be nice if we did all the public IP address handling in its own daemon. It would be nice if we did the service management in its own daemon. What we're calling the cluster management daemon there is cluster membership and leadership. We could do that in a separate daemon if we wanted to. And then we would have a separate database daemon. So I started thinking about a new public IP address daemon. Still thinking about it. Single daemon that does the four functions, manage IP addresses, fail them over and does the consistency checking. We give it a simple management and status CLI. It doesn't really need to do anything else. And then a simple IP reallocation trigger. At the moment, because we've got really tight coupling between a whole bunch of the functions in CTDB and between the daemons, when the recovery daemon wants to do an IP reallocation run, it needs to contact all the daemons on other nodes and go, hey, what's up, what's up, what's up, what's up and gather a whole lot of status. And so it's really tightly coupled to a whole bunch of things. And the main one there is the status of all the other daemons, whether they're active in the cluster, whether their services are healthy and all that sort of thing. So let's do something else, let's do something different here. When we do this, we'll add a simple CLI command and we'll pass the nodes that can host public IP addresses to that command. So this will be something that you run, it'll be a command line thing. That's okay, this doesn't have to be fast. If you're failing over your public IP addresses on your cluster every few seconds, you've got a pretty serious problem that you need to solve elsewhere. When the cluster management daemon or the services daemon detects a status change on a node, we can have a callback configured to call this CLI command to say, hey, these are the nodes that can host IP addresses. And what's more, the callback can be a script. If we need to gather more information from all the nodes about what the status is, we can have a pretty simple script that calls the service daemon CLI and goes, hey, what's up, what's up, what's up? And then it can call the cluster management CLI to talk to that daemon and go, hey, what's up? And that way, the public address daemon is completely separate from the others. And with an interface like this that just uses a callback, currently we've got some LVS code in there, but it could just go out and now be a separate callback and could we use HA proxy or something else to manage public IP addresses? There are lots of things out there. Sure we could. That'd be cool. Service management daemon, not very interesting. There are four functions, starting up services, shutting them down, monitoring their health and reconfiguring a service when IP addresses change. We can actually do all of that with callbacks as well, so that all of these bits of CTDB or all these separate daemons don't need to know about each other. And then the question is, can we support some other thing? I'm sure we can. Who knows? Okay, cluster management daemon. What does it need to do? Well, it just needs to know which nodes are connected, which nodes are active, they haven't been banned because the nodes have been unable to perform its functions and an administrator hasn't stepped in and gone, pause or stop because we need to do some admin foo. And there's the leadership thing, the thing we currently call the recovery master, whichever node that is, needs to coordinate database recovery and coordinate public IP address reallocation or allocation. And therefore we need some sort of election thing in there. That's cool. Regist the callbacks with this thing when states change. Can we support heartbeat? I'm really up to date. Can we support SED to do our cluster management for us or anything else? We have choices. Database daemon, sir. So hopefully after taking all this stuff out into separate daemons, we should be left with just the database daemon. And that's really the main focus of CTDB. We want to do provide database access fast, but we need sort of supporting infrastructure to do that. And the basic operations which we need to worry about are the database access, so redraw it, modify it, delete. Database recovery in case of new node joins or the existing node fails and vacuuming in the garbage collection. So we'll not worry about too much of functionality, but one of the key aspects of in the new design we noticed is that we really need messaging. We need, as I mentioned, we don't really have a framework for messaging in CTDB. So obviously we need something which is scalable because now we are going to add more daemons and so these daemons are going to exist on all the nodes. So now we are going to have more messages passing, so we need a way to scale messaging. So there's a plan to use Samba's Unix domain datagram sockets, which are... So the advantages of using datagram sockets here is that we avoid establishing a connection and each daemon has to only listen to one socket. So currently CTDB daemon, if there are a thousand Samba processes, has thousand connections, so it has to really listen on all thousand sockets in a single thread. But with this design there is a slight problem. It's asymmetric, so we need to find the socket where we want to send a reply to. So there is a need for identifying or some sort of registration for each process in this framework, in this messaging framework. So the question is, did we get all this done? Did we? No, it's a vaporware talk. Well, we got other stuff done, so we should tell you what some of that is because it's really exciting. Okay, CTDB stuff. The framework for building asynchronous CTDB daemons, the logging daemon, all those experiments, we did a lot of playing around there before we decided logging daemon. There's a lot of good work done there because it should now be a lot easier to create. Well, we understand the event framework a lot more and doing the work that we still need to do should be nice and easy. We spent lots of time on getting the unified build, SAMBA CTDB build going and chucking out a whole lot of duplicate code. In that process and otherwise, as a result of that process, we lost a little bit of our portability stuff. We lost some corporate knowledge. And we also had other problems reported to us. So we improved portability for Linux on Power, AIX, FreeBSD. We spent a huge amount of time doing lock scheduling work because as people run more and more benchmarks against SAMBA with CTDB sitting underneath it, we hit bottlenecks and I wish I understood all of this stuff. This is fascinating. We fixed IPv6 support. A lot of that was a little CLI issue where you would print out a machine-readable output of status and that sort of thing and there were IP addresses in there and the separator was colon by default. So we fixed that. Fixing tests took an enormous amount of time as well. We had some IPv4 specific tests that did network sniffing and that sort of thing to make sure that some of the subtleties of CTDB were working and we spent an enormous amount of time fixing those things. And then there's Auto Cluster. Back eight years ago or something, seven years ago when I started working with Trig on CTDB, he said, oh, look, we keep creating these virtual clusters. We need something to generate throwaway. I didn't put the word disposable in here. Disposable virtual clusters. And I was going to do that, but of course, as Trig does, he went home one weekend and came back and said, hey, I've written the first version. And I've been hacking on that for a lot of years. It's for testing clustered SAMBA and it's written in bash since 2008. This is the cloud people written in bash. We do hope at some stage to pull out some of the stuff in there. We've done some massive restructuring and hopefully we can pull out some of the bits in there and replace them with some of the cloudy things like Chef. But we're not there yet. Trig and I originally spoke about that at LCA 2009. Time-consuming thing. A bunch of things changed in REL7. We wanted REL7 guests to test cluster SAMBA on. That took some time. Did some modularization work. To test the CTDB IPv6 support, we had to change the stuff that said, no, don't use IPv6. So Autocluster now supports IPv6. If you want to hack on Autocluster, it is a great project. It doesn't have a web page. It should, shouldn't it? But it does have a Git repo. You can check that sucker out and have the time of your life. But wait, there's more. Not a lot more. Okay. So we haven't really done much, but we have identified that there's lots of redesign, need to be done, lots of work ahead of us. The question is, should we start with a clean slate and start implementing things? Because now that we have identified various bits and pieces, sounds good, but as I mentioned, it's a huge step to get the beginning of working code and we probably won't wait that long, obviously because there's a limited development team of two. And we are also doing other stuff for our employer during that time. So incremental update is obviously a better choice. So the main advantage is that we can harness the existing testing infrastructure so that we can, every step of the way, keep testing to make sure we haven't broken anything. But then we realize that we will have to write lots of glue code, which sort of migrates from, provides the compatibility to the old code and the new code and things like that. But, well, that's the cost we'll have to bear. So where do we start? Now the most important thing, again, as I mentioned, is messaging. And the main problem we have with trying to address messaging is it's not contained in any, in one place. So we would like to really contain protocol in one place. So what do we do? We implement LipsetDB. Hang on. Right. Thank you, Trich. So wasn't there a LipsetDB? Yes, there was a LipsetDB already. It implemented few messages, but not very useful messages. None of the database operations were actually implemented. It provided mostly synchronous and a little bit of asynchronous API, but it's actually quite hard to get asynchronous API in a ThreadSafe manner right. So we decided to abandon that. And another thing was there was no other consumer for LipsetDB. Even CTDB didn't use LipsetDB. There was only a little bit. So we abandoned it. And now we have something called LipsetDB API. So we are not doing protocol handling. We'll focus only on marshalling and un-marshalling. So that at least all the protocol structures and everything that can be embedded in one place. And in future, if we change the protocol underneath, Samba doesn't need to know about it. So obviously there are two parts, the client and the server part. So we basically start rewriting. So this part is the protocol marshalling API is done. I started rewriting CDDB's interface using the LipsetDB API. And the hope is that we'll also implement the server side API and re-implement CDDB server side using the server API. So what about for the rest? We have all these big plans. So we can keep hacking at the spare time. Copious amounts of spare time. Yeah. The problem is our pace is too slow to keep up with Samba releases. And we don't want to introduce major changes in between the releases. So we have to do these big changes across major releases. So there's a better solution. So if you know anyone or yourselves want to hack on CDDB, please join us. Thank you. Questions? It's clear as mud to everyone. That's good. Okay. I'm a tie. And Martin, thank you very much. As a token of our esteemed appreciation, there you go. You'll have to share. I'll say one. I'll say one. No. Why don't you keep the bag a little longer? What do you think? Have I stopped videoing? No, we never stopped.