 Thanks for coming. I'm David Disodope, this is Samuel Capredo from SUSE, and we're going to speak about the witness protocol and its implementation in SAMBA. So we'll start off just with an introduction to SAMBA, and so we don't have a PA within the room, so I'll just have to try and reach everyone. I'll start off with an introduction to SAMBA and CTDB or Clustered SAMBA. We'll then move on to the witness protocol, some of the details of the protocol itself. We have a demo, so just a video of the witness protocol in action, and then we'll finish with an outlook where we're heading with this stuff, what we need to do to get it in shape for upstream. So starting off with SAMBA, hopefully you're all familiar with the SMB file server. It handles authentication, active directory integration so you can join a domain controller or act as a domain controller. With SAMBA we have one main file server daemon, which is SMBD. It's then generally forked for each client connection. We have a pluggable file system back end, so a VFS layer, and this is then what we use for things like our CEP, we have LIPS FFS integration, for butter FS we have some sort of snapshot specific stuff. We have windbind for the authentication ID mapping or ID integration part. And with the protocol itself, we have a bunch of states which obviously then need to be tracked by the server. So for things like open files, leases, user mapping, we store that in a database. So we use TDB or trivial database for that on a standalone system. What was initially trivial is now sort of, I guess, had a bit of feature creep in that it supports things like transactions, record locking. We have multiple writers, so multiple SMBD processes writing at the same time to those databases. For clustered SAMBA, we then have CTDB, so this then handles basically in the case of an active-active SAMBA cluster. We have these, or a consistent database across all of these nodes. It also has a number of HA features integrated, so we have things like IP failover, service monitoring, it's I guess you could say a proper HA stack as well for a clustered SAMBA set up. Yeah, placement within the CTDB database is then compensated based on a hash of the key, so we can find out from that the master or a node which knows the location of that record. And from that location master, we then have another level of indirection to find out where it's actually placed, so you can move sort of records around the cluster to where they're needed. We have, amongst the CTDB nodes, we have an election of a recovery master, and that recovery master then performs in the event of an outage, things like cleaning up mid-transaction. That also holds a clustered mutex or requires a cluster mutex, so with CTDB that's generally placed on the clustered file system, backing the SAMBA servers or the file servers. With CEP we have sort of a helper binary which uses a rados or rados locks for that. For IP failover, it uses what's called a tickle app, which sort of speeds up the process of a client in the event of an outage reconnecting to another node after IP failover. So this is then a look at sort of how things have changed or are changing with clustered SMB serving. So with Windows, in the past they've had active passive setups where they have this file server role which moves across nodes within the cluster. CTDB has always been active-active to the point that or where clusters are then sort of spread across those CTDB or SMB gateway nodes. So now on to Windows. So starting with the definition of Windows is a new DCRPC mechanism to inform the clients about changes in the topology or the state of the cluster. It helps together with new features in the SMB3 protocol to provide transparent failover for the cluster clients. Additionally, it also can be used to load balance the cluster because you can tell a client on demand to move to another cluster node. So before explaining how Windows works, let's have a look how failover was working in SMB1 and SMB2 clusters. So in a Windows cluster we have a server that is holding the file server role. So the client opens the SMB connection to this node and if the node goes down the cluster moves the role to another server but the client has to wait for the TCP timeout to reconnect. So after the TCP timeout the client reconnects. So the protocol does not define any failover mechanism at all in SMB1 and SMB2. In a Samba cluster thing the failover works better because Samba implements additional measures to speed up the recovery as David said before. The IP takeover the gratuitous IRP and TCL ACQs. So in a CTDB cluster the IP all nodes are active at the same time and the IP addresses are distributed between the nodes. So the client opens the SMB connection and when a node goes down the cluster enters in recovery state and runs the IP takeover algorithm which moves the IP address to another node and after that CTDB sends a gratuitous IRP to the client to inform about the new MAC address associated to the IP and also sends a TCL ACQ that is a crafted TCP packet with a wrong sequence number which has the effect of the client automatically reset the connection without having to wait for the timeout. So the client reconnects. In SMB3 Microsoft added new features to the protocol to provide transparent client failover. One is the witness services that we are going to see how it works now and another one is persistent handles. The idea I'm just going to give the idea of persistent handles and when the client opens a file in a server the server has to store what is called the file handle state which contains information about the leases, the share modes and the logs. So if the server crashes before persistent handle this information was lost but with persistent handles this file handle state is somehow persistent so the client can when he reconnects he can request the server to open to to reuse the file handle state and in cluster environments this file handle state is also synchronized or distributed across the cluster so a client can open a file in another node in the same in the same state. So in an SMB3 cluster the all nodes are running the witness service and the client opens two different connections and it's important to know to remark that these connections are always two different nodes. So when the node holding the file server role crashes two things happen. The cluster moves the role to another server but and also the witness service tells the client about this movement so the client can react and automatically reconnect the other node and now we are going to explain about the demo environment that we are using. Sure, okay so you have those two servers I'm assuming that we are detecting the crash either with the client or with the second server. So the the question was how is the crash detected for failover and in the case of witness it's generally driven by the cluster so the cluster notices an outage and then tells the the client about that. So there is an independent entity that is detecting this or it's the demons themselves? So I mean it depends on the implementation for SAMBA we have CTDB basically managing the clustered SMB gateways and in that case they notice an outage of a node so they're continuously monitoring the gateway nodes they notice this outage and then notify the client for failover. If the quorum or if the majority of CTDB nodes go down then yes yes. So just sort of explaining the the demo setup we have so we have our Cef cluster we have the SAMBA gateways so in this case two gateways in front of the Cef cluster these gateways are using the VFS Cef back in for SAMBA also CTDB then it's the cluster service and key value store for SAMBA. Yeah we have a bunch of changes on top of our current mainline so Samuel implemented the the witness server I worked on the async dcrpc server persistent handles were implemented by Ralph Boomer from Cernet yeah we also have from Stefan Metzmacher from also from Cernet the T event impersonation changes so it's a little difficult to see I mean we do get a pretty graph on the on the right hand side so we can at least see something later on I'll try to sort of guide through what's going on anyway so we have yeah these are the the two node the two CTDB nodes and the corresponding witness server logs are yeah the other two black squares there yeah gee it's it's barely readable so in this case we're just yeah initially connecting to the to the cluster so this is a Windows Server 2012 R2 client so here we can see so we've just done a connection to the cluster and here we can see that the client has then basically registered with the the witness service so it sends this async notify request to witness and what happens now is that that that request basically sleeps on the witness server side and a response is sent when a notification is is then ready for the client yeah so we're trying to show the I think this is the SMB status yeah this is the output of there is a new SMB SMB witness list command that it tells you well you will have to believe me yeah this says that the client is registered on the CTDB node one this here is the client name it's just a computer the computer name here it tells it tells you that the SMB connection is open to node class CTDB node zero and this is the network the network name that okay so this is node node one and this is the SMB status output just to check that the client is connected to one IP that is assigned to the node to node zero this this is the output of the association of the IP addresses to the to the CTDB nodes and this was 13 and 13 is in node zero so I now we are preparing the SM SMB witness SMB with the new SMB witness move command to tell the client to move to the other node so SMB witness move the client name and the node the new node so I so I guess we should say at this point so in this case we're not triggering a failover and we also have at least Samuel implemented the ability to to just do a manual notification so to request that the client then move its SMB or file server connection from one node to another the connection is throttled it's in QBM so now when this here is the the output of SMB status to check the address where the client is connected this is the node one so now when now we have to tell the client to move the connection so it goes down but the copy continue it requires the witness service on the client to accept these notifications or to notice these changes or act on these changes now now we're checking that the connection has been moved to our to the other node with SMB status and the association of the IP addresses in the cluster the what is additional is well this was a could be achieved with Samba gratuitous IRP anti-KLAC keys this is something new defining in the SMB 3 protocol that I mean gratuitous IRP anti-KLAC keys is a Samba addition but it's outside of the protocol Microsoft now has defined persistent handles and the witness services just for that the KLAC keys was like a trick so I think the main point here is that there's no interruption to the application layer so we have failover between nodes within the cluster without the the application in this case just an explorer copy noticing it would work just the same so in that case basically there's an outage at the the SMB protocol layer and the client then forces a reconnect in this case we have a separate witness service which basically tells the client please move to a different node so it's it's controlled and um yeah done it a layer separate to or such that the application doesn't really notice the the outage or the move between servers so it's just a question on basically whether this could already be achieved with durable file handles and now here we have moved again the connection to to CTDB node one and the file transfer continues he was just saying that he's moved it again from from one node to another and then moved back again um yeah just while the copy is in progress so um yeah so that's sort of what we have is a prototype yeah we still have quite a bit of work to do for for upstreaming mostly in the DCRPC server layer so we have this source 3 implementation of asynchronous DCRPC server um so samwell has also worked on then merging the source 3 and source 4 implementations so that gets us to I think a much cleaner point where we can then start upstreaming the the witness stuff on on top of that um yeah another possible future feature would then be um automatic load balancing so basically from the cluster side seeing which nodes have free resources and and moving class uh sorry moving client nodes balancing client nodes between the cluster more more evenly um yeah this is more on the the sef side so currently we have CTDB acting as then the the clustered key value store with sef backing the the file server we we have another option as a a key value store so we can potentially look at um using sef instead of CTDB and then also making use of Rados classes to sort of offload some of the compute to the database so so samba does so the question was um compute uh of the database can I expand on that um so samba does some things like um traversal to you know look up specific records this is like incredibly intensive on at least for a CTDB case um it's very inefficient so using Rados classes for for something like that where we can offload that operation to to where the storage or where the key value store is located would be right us itself and then just call the no so I think we'd still use the omap layer for the most part or at least we'd start with just basic omap on the the samba side and then when we see what sort of can be nicely offloaded to the cluster we could then go go down that path I mean just um yeah otherwise that's it so I think we have a few minutes for questions otherwise just grab us in the two minutes so just grab us in the hole otherwise any very pressing questions yes so the the question was is there a race condition in this case where yeah where we have both things sort of timing out on the client side and and server side simultaneously um for the the witness case it's um yeah it's important to remember that the client automatically tries to use separate nodes for the file server and the witness server so in that case um if the file server goes down we can notify via the the witness channel so we have a separate channel to basically manage the the failover in that case