 My name is Vladislav Shpilivoy, and I will tell about SWIM protocol to build a cluster. My talk will follow this plan. First, I will tell about failure detection protocols, then predecessors of GOSSI protocols, then I will introduce you SWIM protocol and its algorithm. Afterwards, I will briefly introduce your Tarantul database in which the SWIM was implemented. And I will introduce how exactly it's implemented and with which extensions, which appear to be very important in production systems. After that, I will show some simple usage examples, performance measurement results and plans for further development of SWIM as a protocol and as part of Tarantul database. In the beginning, there was a server, a single machine which serves client requests and everything was okay until it got overwhelmed and many servers were added to help it and the problem of overload was gone. This is how horizontal scaling was invented, but new challenges appear. There emerged a task of failure detection in clusters. How to detect that a node has failed and it was needed for eviction of dead nodes, for reparation of broken nodes and for making centralized decisions in clusters such as cluster leader change or replication master failover or synchronous replication transaction processing and since 1980s, there has been an incredible growth of articles, researches, protocols, algorithms trying to solve this problem. Starting from simple ones such as let's establish connections between each nodes and just send ping messages and to such things as attempts to develop a reliable broadcast protocol for sending hard bit messages. Unfortunately all these, all such solutions which make links all to wall scale very poorly, they give quadratical complexity on message count and network, on connection count and network so it just doesn't work in big clusters and this is when GOSI protocols were invented also known as infection style process protocols. They are good in this case because they give at most linear complexity depending on cluster size. How does it work? GOSI protocols describe efficient detection and dissemination of events in the cluster and nodes exchange with each other with information which they found in the clusters. Exchange is organized like an infection spread assumes that one node detected an event, for example it found that some other node has failed and it will send this event to every other node one by one in random order. Nodes receive this event will do the same they will send it to their random neighbors and so on. So after some time the whole cluster knows about this event. As we can see the main difference with failed detection protocols' first difference is that all nodes spread the event even if just one node found it, they help each other. GOSI algorithms are usually randomized so we get even load per node almost for free without any synchronization of course on average. And every GOSI protocols follow this idea one way or another and differences are in details. First difference is how to exactly to detect a failure. Simple way let nodes just ping each other and expect an act message and if it works node is alive. More complex protocols can try indirect pings in case direct ping has failed because if an act message wasn't delivered it doesn't mean that the recipient of ping was dead it may mean that this network link was broken for example so some protocols can work around this sending indirect pings. Very complex protocols can detect own failure of a node. They can find it by sending pings to everyone and seeing that nobody responds and very likely that this node has failed and not other nodes. For example it somehow got isolated in its network segment or something. Second difference is how to define failure. Simple protocols just say there are alive nodes and dead nodes. Node is dead if it doesn't respond or responds too slow. Some protocols may define multiple discrete states such as swim. Detail defines alive, dead and suspicious. Especially advanced protocols can define state as a contagious number on for example on a range from 0 to 1 where 0 means totally available and 0 not available at all. And protocols are free to choose any criteria for moving nodes from one state to another such as number of not received act messages or ping latency or direct or indirect availability and many other criteria. And the last difference is how to disseminate an event after it was found. Simple protocols just do push way. It's when a node which found an event sends it to every other node without asking any permission with every message. Pool protocols is when events are pulled from nodes which learned about this event. For example a node receives a message saying give me all the events about some third node and it sends them. Advanced protocols can combine them in some smart way in order to somehow decrease number of messages in the network but after all of them give this very I would say tasty estimations which are way better than what failure detection protocols usually give us first is that if in a cluster any event happens for example a node failure at least one other node in this cluster will learn about this event in constant time not depending on cluster size. After the event is found it will be fully disseminated to the whole cluster in logarithmic time. So every node will learn about this event. And during work of these protocols they give at average constant network load per each node and at most linear per the whole cluster. As you can see gossip protocols are really very attractive for their efficiency in detection and dissemination of events and especially for being really light white both in understanding and in implementation. And this is why they are really widely used even though we sometimes don't know about this. For example Cassandra household name in the world of databases they use their own gossip protocol for monitoring of cluster nodes. In the library using one of SWIM implementations is used in console. Scuttlebot is a gossip protocol for huge clusters with tens of thousands of nodes. It's oriented on reducing network load and still ADB. It's some kind of fork from Cassandra which also uses gossip for the cluster nodes for monitoring lifeguard. It's one another SWIM implementation which aims against false positive failure detection and there is SWIM which is now implemented in Toronto. What is SWIM? It's scalable weekly consistent infection style process group membership protocol. Each part of this means some important part of SWIM. Scalable means exactly what it says is always very good at scalability. In constant time we get failure detections in logarithmic times they are full dissemination. We have constant even network load per node and linear network load per the whole cluster. Weekly consistent means that this protocol is fully asynchronous. There are no any synchronization or blocking between cluster nodes. It can lead to the case shown on the slide that two nodes can see state of third node differently for some time. If this third node failed recently and this event was not disseminated to the whole cluster yet, but we are guaranteed that after some time all the nodes will see state of this failed node in the same way. Infection style process means that it's a typical gossip protocol. Nodes help each other to disseminate events. Group membership means that each node maintains a so called member table. It's a table where every other known node is stored with its last known state, last known IP and port and maybe some other attributes at which we will look later. SWIM consists of two very independent components. In this sense it's really similar to Raft where we have leader election and replication is really two totally independent things and SWIM as well we have two totally independent components which can be used for example without each other somehow. First is failure detection. It's protocol of how to detect failures and this is how it works. Each node periodically chooses a random neighbor from those neighbors who wasn't selected for the longest time for some fairness and sends a ping message and in this way they ping each other ping, expect tag messages and if an act wasn't delivered, node is declared suspicious. It's not declared a dead right way and investigation is started. After suspicion sender of the original ping will send indirect pings through several randomly selected neighbors and will expect indirect acts and if at least one indirect act is delivered then probably with this suspicion was just about a temporary network link broken or UDP packet was lost. SWIM is based on UDP and if not any single act was received the node is declared dead. Here recipient of these pings was declared first from alive to suspicious from suspicious to dead or back to alive and all of these are events they need to be disseminated. Other nodes also should know that this node is suspicious or dead and this is a job for event dissemination component. It works like follows all the events are stored into a queue if there are multiple events for the same node they are merged into one event so as not to send outdated information if a newer event is a right sold one so this queue doesn't grow infinitely it's limited by number of nodes in the cluster and when it's time to send the ping message events are attached to the same packet it's called piggybacking when we attach information from several independent components into the same packet so these components are related only by are linked only by the transport protocol and they are not related in any other aspect. Each event is sent logarithmic number of times and after this it's dropped because we just can't infinitely send it with UDP so we can fit infinite number of events here. The logarithm was chosen because it just gives very good probability that each event will be delivered to each node but I didn't say 100% probability just very good. We will return to it later. It's one of the swing problems how exactly node statuses are updated when events are delivered and a node unpacked this message and sees that some node is declared dead what should it do. Here is an interesting example showing that it's not really that simple three nodes B is not available A node knows this and C doesn't know this. They exchange this information who is right Swim says that in this case assumes the worst so okay they assumed the worst A and C both things that B is dead. Now B is back alive how will it restore its state because other nodes will assume the worst B will send them messages they will ignore them because assumes the worst so Swim needed a way how to refute false Gossips how nodes could clear their name in other member tables and this is where incarnation concept was introduced. Incarnation is really simple think is just like logical clock. It's a counter on each node which only this node can increment and which it attaches to each piece of message sent from this node. And every other node in this member table remembers last node incarnation from each other node so if a packet with greater incarnation is received then the last node then we can be sure that this node is actually alive if we considered it dead before. In this concrete case B after some time will learn that other nodes thing that it's dead this may happen because B connected back to the cluster or A and C still send messages to B hoping that it will resurrect some time. So after all B found that other nodes think it's dead. B increments its incarnation sends it back to other nodes they update B state back to alive and they will help B to disseminate this information because incarnation update is also an event. So B name is cleared and it's again alive. And the rule becomes this assume the worst but greater incarnation means newer information. And since using UDP it faces UDP problems such as packet duplicates late delivery of packets packet losses and incarnation helps to protect us from one of these problems late delivery because if a very old packet somehow got delivered and it contains outdated information it will contain outdated incarnation as well and it just will be discarded. Not the whole packet just events which are considered discarded. And other problems of UDP just doesn't matter for SWIM it's okay for to duplicate a packet or loss to lose it because they are recent anyway. Before we get to SWIM API and implementation and extensions we need to understand something about Tarantul what is it so you just need to know how it works to understand what will happen later. Tarantul is a common purpose database and application server working together on board right in one process in one address space. So you can write your application logic right close to database without wasting time by network to a remote database node or something. It's written in PRC sometimes somewhere in C++ but offers interfaces on SQL, Lua and C. Lua is the main language used in our application server on which application logic is usually written. But you can write it in PRC of course. We offer such, we offer two engines in memory engine and on disk engine in memory has tree, hash, air tree indexes on disk engine has LSM trees like in my rocks in my SQL in memory engine is persistent data is persistent and right ahead log so after it starts it's recovered and your data is not lost. You can write stored procedures of any of these three languages used for application server and here is an architecture of Tarantul. It consists of three main threads. First thread is network. Its task is reading, writing to client sockets, picking and picking data, batching of requests and so on. Second thread is write a head log thread. It's only task is writing write a head logs, tens of and hundreds of thousands of them each second and it's heavily optimized for better memory management for avoiding frequencies calls and one main thread for all the logic, for all the application logic for working with the database creation of tables, indexes, accessing them and this thread is able to solely serve multiple requests simultaneously. How does it work? It works because this thread implements cooperative multitasking. As we know there is preemptive multitasking used in Linux kernel when processes are interrupted without being asked for this and cooperative multitasking when you are able to choose when you want to switch to the next task. So here it works like follows. This thread has tens and hundreds of thousands of fibers so called fiber is just looks from user point of view. It looks like a thread. It has its own stack and it's implemented in user space and users can switch from one fiber to the next fiber to execute them in this thread. And it's scheduled in user space by user. This is how typically it works. The user request arrives to our database. It's assigned to a fiber. Fiber starts writing some transaction. It calls commit. Commit goes to write a headlock thread and while write a headlock writes this on disk, main thread is switched to the next fiber and it doesn't waste time on waiting for it. So with this architecture Tarantel fully utilizes all the SAP course on which it works. Not doing, not needed job. And here is a couple of short examples which we are going to need to understand swim examples. First is configuration of Tarantel on Lua language. I call function box CFG. I pass an IP port on which Tarantel should listen for client connections and that it will be a part of replication set consisting of two nodes, this and some other node. Then the second example how to connect to a remote Tarantel using built-in net box module. Here I call require. It's like import in Python. I import net box module. I connect to some remote Tarantel, send ping message, call some stored information and so on. Now we are ready to look at swim interface. Swim is implemented also in PRC and provides Lua interface so you could use it in application server. It's simple to import just call require swim and it's ready to use. You can create, after importing swim you can create multiple swim nodes on inside one process in case your Tarantel is a part of several clusters at once. It happens sometimes in production. After creation you can configure it. Almost anything what I say about swim can be configured starting from heartbeat rate. It's how often this node should send pings to random neighbors. Again I should say it's not purely random because we send pings only to nodes which for the longest time weren't send a ping message to achieve some fairness. Egg time out is when ping message is considered unacknowledged. URI, IP port, UID it's a very interesting moment. The original swim paper doesn't describe how should we identify our nodes. The naive way would be just use IP port and an ID as an ID, but it's not good when you want to change IP and port without making other nodes thing that old IP import and new IP port are two different nodes. So we identify nodes by UID and it's not only allows you to change IP port without losing information in other member tables but also it's consistent with Tarantel because Tarantel instances are also identified by UID and GC mode also really I would say strange moment in the original swim they say that if a swim is declared dead it should be deleted immediately from member tables but it's not good if you want to catch a moment when the node is back on. So we implemented we made this behavior configurable to conform with the original swim and so as it could be really usable in production you can turn off this thing GC mode means garbage collection mode. You can access member table individually, you can add members manually calling add member if you know UID and IP port you can remove member from member table manually, you can send broadcast message on a given port and send the ping message manually calling probe member why would you need probe member if you have add member the problem is that you don't know UIDs in advance sometimes for example they could be generated after nodes are started so when you don't know UIDs but you know IP port you can send ping message, ping message in swim carries UID of the sender and ACK message also carries UID of the sender so with probe member you can make two nodes exchange their UIDs and they will learn about each other and then they will ping each other automatically. You can access individual nodes to take a look at member table size or fetch individual member and look at its current status its incarnation number UID IP port and some other attributes we will look at them later and iterate over member table calling payers you can subscribe on events so as not to poll member table manually with some period. Your function will be just called immediately after any update and your nodes can volunteer with cluster without killing them and make other nodes that this is dead so other nodes will learn that this node has left the cluster and it is not voluntary. Its method quit. Here is a short example how swim works. I started two swim nodes on different ports subscribed on events on left node and this subscription function just stores event in a global variable so as I could print it later in console. I link these nodes using probe member now they know about each other I can see this in member table size two means this node and some other node. Event was delivered I printed and I see member affected by this event and exact event is new and is something else there are many events. Is new incarnation we have is new status and after I drop one of these members other member notices this and I didn't do any manual things. As we can see each status is shown as dead and now two extensions of swim which appear to be really useful without which swim was just impossible to use in production. First extension is anti-entropy and which extension I will show as a problem which classical swim doesn't solve and extension will be a solution. Here is an example we have two nodes A and B they know about each other and ping messages and node C is added it knows only about B for example B could be an ancient node about which or every other node knows and from which they hope to fetch the whole cluster topology. By classical swim B should now send an event to node A saying that hey there is a new node C you can ping it as well so with nodes A and C would know about each other but this is UDP so this message can be lost and since events have limited lifetime it just won't be sent ever again in classical swim and A and C will never see each other. This is solved by anti-entropy company. It's a third company of swim also independent from other companies and it's very simple. We just take random subset of sender's member table and this allows to synchronize things which weren't disseminated by dissemination company. Here B will always send this anti-entropy section where it will contain ANC and no way ANC will learn about each other eventually. Next extension is payload. The problem is that if we found the instance on a given AP port how will it learn what else works on this machine because we just know swim port and we don't know anything else on what port Tarantul works here or whatever uses this swim and we needed to weigh how to disseminate not just swim events but arbitrary data defined by application and it's a payload. You can set your own data up to 1,200 bytes which contains any information you want to send to other node and it will be disseminated to the whole cluster just like any other piece of data like incarnation, AP port or what else swim defines. Here in this example I started to swim nodes on one of them I started Tarantul on a given port and set payload of this instance as a port of Tarantul and some arbitrary second K. Set payload is a low interface so you can set any low object as a payload string, number, table, array, map or row data created from your C function or something. Now I connect them this time using broadcast as just the same as probe memory but sends from all IPs using given port and I can see this payload on other node it was delivered and I can use this delivered port to connect to remote Tarantul and it works. Existence of payload makes it possible to send here not only IP ports but also login, password or any other credentials you want to send to access for example the same Tarantul and it makes it impossible to use an open networks without encryption so you could encrypt your payload by yourself but other fields of swim would be not encrypted and with encryption company you can encrypt packet the whole. Even UDP header is considered compromised so it's duplicated in UDP packet body and is also encrypted. We have some algorithms which you can choose using set codec method mode of operation of this algorithms and your private key and all the packets will be encrypted inside transparently and decrypted on the receiver nodes. Next extension is restart detector the problem is that incarnation sometimes is not sufficient. Its main problem is that it's not persistent. Let's consider an example. Two nodes A and B, node A has incarnation one and payload blue circle and node B knows this information. Now A is turned off and back on but with different payload. Incarnation is not persistent so after restart it has the same incarnation but different payload and when it tries to tell this to node B, node B shares, received incarnation and last known incarnation and just don't do anything because it thinks that it already knows everything about A. So incarnation just is not enough here and it was fixed by splitting incarnation into parts. Generation and incarnation. Generation is a persistent part of incarnation and incarnation works exactly like in classical swim and all the information first is checked by generation and then by incarnation values. Here is how it works. You can either set generation manually if you're persisted by yourself in Toronto or somewhere else or you can leave a default and it will be current timestamp. Now I configure swim instance, create the second one, I connect them. I restart for instance with a newer generation and I can see this as an event on another instance. So the problem of payload not being disseminated as solved and also we have restart detector and this event can be seen as a new generation. You can somehow use it in your application. Here is finally we are ready to build a cluster as it was said in the talk before. The formal description of task is this. We have end data centers and in each data center we start two nodes and one to make each data center have one replication set with two nodes. I will write a script which will work on each instance the same script. It takes four arguments data center ID, swim IP port, and the new ID of this node. Now I write function which will be called each time when this node finds an event in the cluster. Here we will take a look at the event and check if the event is that a new node is discovered in our data center then we added two nodes table. Then I activate the subscription configure swim set this payload with other nodes will know my data center and IP port of Toronto. Then I start background fiber which will try to find new nodes by periodic broadcasts one per second and I just wait until a second note in my data center is found. After this I am ready to start Toronto calling box CFG with replication and everything will work out of box from here. And we have built the cluster. There are other applications where swim can be really useful. First is monitoring this is why gossip protocols were invented after all and leader election algorithms surprisingly because in leader election algorithms usually it's not only about election of leader usually it's also about checking that the leader is still alive so monitoring of the leader and here this component monitoring can be totally replaced by swim in theory. For example in raft where leader is elected it sends it's monitor it's being monitored by other nodes and we can replace this rough time component by swim. How fast swim is. For checking this I created a program which simulates clusters of any given size in some virtual environment and it was used for unit testing mainly but appeared to be really useful for benchmarking as well. Here is a plot where how fast an event is disseminated depending on cluster size if each node sends things and events once per second here we can see that in the cluster with 25 nodes failure of one node was fully disseminated to all the other nodes for 4 seconds. 100 nodes for 6 seconds, 300 nodes for 7, 500 nodes for something about 7 and 8 and 800 for just 8 seconds. And let's compare with theoretical estimation with logarithm of 2 and we can see that it's very close. So swim is just really fast and it doesn't add overhead to your application because it's implementation is very efficient in pure sea not in Lua so there is no garbage collection or anything else which could waste time. So on the summary what swim can give you. It can give you detection of events in the cluster such as failures or any other bitter events and dissemination of this events on the whole cluster and it's very simple to understand to use to implement it's available, firstly it's available as an open source project so you can just go to github and take a look at all the code commits and documentation and in terms of application it's available before Tarantul is configured before application is started so you can use it to start Tarantul as I did in the build cluster script and it's already appropriated in production systems and it appeared to be very good especially in conjunction with Wichard module Tarantul Wichard is a horizontal scaling module for Tarantul where monitoring is also a really important task as well and what are the plans for further development of swim as a protocol and as a part of Tarantul. First as I said it's a really promising idea to replace some part of raft by swim and we are going to try this. Swim can be used for cluster auto build and auto configuration so you just start serial Tarantul and they discover each other and join into a replica set without any configuration and there are some ideas about more extensions to swim which could be added especially against false positive failure detections. Thanks for your attention. No, we encrypt the user space. Any more questions in the room? On your graph for performance was that measured with any network failures in it? Here it was possible to add some packet losses with a certain percent configurable. What's the exact relationship between the swim protocol in your implementation and the Tarantul? Do they are they independent of each other or do they need each other? Could we please repeat? Swim implementation and then there is Tarantul? Is it one thing or are they independent of each other? They are so independent that even in the code they are almost do not intersect. Thank you. Thank you for the talk. Do any other projects use the swim protocol? It's used in production systems. You mean out of Tarantul? Yeah, other than that. Any additional? Yeah, at least some Cassandra console. I thought those were using more custom kind of gossip protocols. Are they using swim as a kind of... Some of them use exactly swim but also with some extensions. For example such thing as anti-entropy extension is implemented almost in every swim which proves that it's a necessary thing. Okay, thanks very much for coming to talk to us.