 Okay. So I think everyone's here now. So the talk today is called Building High Performance Erlang Clients using a framework called Shackle. So a little bit about myself. My name is Wiefleb Gautier. I'm from Montreal, Canada. I work for a company called Adgear. What we do is we're building a real-time bidding platform for online advertising and I've been kind of working on this project for the last five years now and for this project one of the first tasks I had to accomplish was to build a client to talk to the Cassandra database. At first I thought it would be easy but unfortunately scaling clients in Erlang can be quite complicated because you hit some bottlenecks pretty quickly. So the problem I'm trying to solve with this framework is basically for an application to speak to a service. Now what's a service? A service in this case is something where you can communicate over TCP socket and the service is gonna speak a protocol. It can be ASCII or binary and this protocol can be either synchronous or asynchronous. Examples of services are Cassandra, Memcached, Kafka, HTTP2, etc. Now the goals for this framework I had four goals. So the first one was speed. You want to do requests that happen quickly in the sense that we want to minimize the time. The second characteristic that we wanted was concurrency. So it's good to be able to do one request quickly but it's even better if you can do 10,000 at the same time really quickly. The third characteristic is safety. So in Erlang we often say let it crash which is kind of a was presented I'm guessing from the previous presentation where you have a supervisor that's gonna restart your process but this comes at a cost. So if you're doing 10,000 requests a second and the process keeps crashing there's an overhead to that. This code path is usually slower and you'll probably be logging some information about the crash so you're writing to disk now and you're adding some Iowork. So often you're gonna end up in the loop where you're just penalizing yourself even more and the system can crash completely. Finally the last objective was reusability. So this framework needs to be reusable for different protocols and be able to adapt it to different concepts. So design process. So over the last four years I've implemented many, many clients for Cassandra and for other database and services and basically this is gonna be examples from the easiest kind of design to what we attain now using Shackle. So the first design which is the most basic naive way of doing it is basically you have a process whenever you need to talk to the external service you're gonna open a new socket, set up all the session that you need and then send the request. Now the problems with this it's pretty obvious. One every connection you're gonna have to reconnect and that has an overhead cost. You're gonna have to do the setup every time so by setup I mean for example if you're using Cassandra you might want to set up the key space at the beginning of the session so that you don't have to resend it every time and finally the last problem is that if you're doing it this way each process will have a connection and therefore there's no limit so you're probably end up hitting an OS limit eventually by running out of port or running out of memory. So to fix this the second design is basically instead of reconnecting every time you use what we call a Gen server which has a state and is gonna keep the socket in its state. Now we solve the two first problems we don't have to reconnect and we don't have to do the setup every time but now we have a new problem. This the server is running somewhere in the VM but how do we address it so where is my server what's this process ID we can register the name but even then we need to have some way that the caller can find this name. So design number three is to use a pool manager. I think the first pool I use when I started implementing clients was the one called pool boy and at the time at least when we tried it in production at high QPS it just exploded it couldn't keep up with the load. Now the problem with this approach is that this Gen server receives all the messages so it's basically it's a bottle point a bottleneck it's a single point of contention and the problem is if the call is synchronous then you can only do one request at a time if it's asynchronous then you have a problem that you don't have any back pressure so every request you send there's a message that gets in queue in the process message queue and these are unbounded so if anything happened and the clients after block the message queue is just gonna grow grow grow and you're gonna run out of memory so the problems like I just said so out of memory is a possibility if your pool manager is a synchronous it's a single point of contention so if it can't keep up with the work it has to do again the the message queue is going to grow and finally another problem it has is that now you have to send two messages one to the manager and then one to the server and then if there's if the message queues are not empty that means every time you send the message before you process them you have extra latency so adding different queues one after the other you end up having buffer bloat and then it just kind of escalates to adding more and more and more latency so to solve this there's a trick in Erlang where you can use ETS ETS is the Erlang term storage and it's basically a global hash table that's been optimized for reading concurrently or writing concurrently so now to solve this what we do is that we register that we have a pool and we have some information about the name of the workers and how many workers we have so now we can generate worker names and send messages directly to those servers now the issue with this design is that these gen servers when we call them since we're we did kind of a naive implementation is that we're using the gen server call gen server call is a blocking call so every server can only do one request at a time we already have some concurrency because there's multiple servers but each server can only do one request so we need to find a way to solve this problem so the way to solve it is to switch to gen server cast cast inside the gen server is a synchronous but now we lose some information with gen server call we receive the caller PID and some information to return the response in the case of gen server cast when we're returning the response we don't have that information anymore so the trick is to introduce a queue inside our gen server in the state we keep a queue where when we receive a request we enqueue the information we need to send back the response when we receive a response we dequeue the item and send back the response to the caller now this solves the problem for concurrency at the gen server level but once again we we hit the problem of out of memory problems because now everything is asynchronous there is no back pressure anywhere if the service slows down and for some reason there's an IO problem or something then the message queues are just gonna grow and you're gonna run out of memory again so the trick once again we're gonna use ETS and we're gonna use what we call counters in ETS and keep account of the number of concurrent requests we have per worker this design is pretty solid but for performance reason we can still improve it so problems that we have with this design in this case we use gen server gen server offers a lot of different functionalities that we don't need for this so there's code paths where we're losing some cycles where we could be doing other work and the queue on implementation we're using the Erlang one is also not optimal so to fix that instead of using gen server we switch to using proclib which is what gen server itself uses to implement itself proclib gives you some some functions so that you can be supervised and some traceability functions where you can inspect the state of the server now to resume what we have in ETS which again is the Erlang term storage so we keep the pool info which is a tuple in there we also have information about the pool type and the strategy we're using if we're doing round robin between our workers we're gonna keep a counter that we're gonna increment to loop around the workers we're gonna keep a backlog which is the number of requests per worker and the queue so I forgot to mention in the previous slide that we switch the queue from using the Erlang queue to an ETS queue and that was for a different reason but we'll get back to it later so the shackles architecture so at the core there's four different modules that implement the core functionalities the first one is shackle pool so shackle pool supports different strategies so either random or round robin and it leverages ETS so that there's no manager this way we're using the global hash table to distribute the work across workers and since we're only doing reads to the key where the pool info is stored there's no contention and there's very very low overhead it's almost constant time even when you increase the concurrency so how it works is basically you look up the pool info in this info you'll know what the strategy is if the strategy is random we'll generate a random number from one to the number of workers if the strategy is round robin then we'll update the counter and do the reminder of the number of workers and pick the next worker after that we generate the worker name so we know what the name of the pool is and the number so we just create an atom called for example my worker underscore 34 the next part of core is the backlog so the backlog is to protect from out of memory errors the backlog permits us to have back pressure so that the message cues don't grow infinitely until the server dies so we have one backlog per server again this leverages ETS it uses what we call counters so it uses update counter which gives us fast operation and atomic it's also atomic so we can do a write but we also get the read for free because we get the returning value and for this case we use a special option of ETS table called write concurrency which helps even more to reduce can tell people more with contention now let's look a little bit of the code to implement the backlog so we're using kind of a neat trick here to to update counter you can pass multiple operations which are going to get executed atomically so in this case what we do we're doing update counter on the table ETS table backlog and the key name is server name and we're passing the update ops update ops has two different updates the first one which we're telling it to update the well the the second key part of the tuple by zero and then we're again asking the second key but by one this time and we're saying if we get to the backlog size do not go higher that value stay at backlog size now when we do the update counter we get a read for free so what we get back is the current size and the size of the backlog incremented if only it was lower than the backlog so then we can know are we over yet or not and that's what we do after in the case we verify that okay if the two backlog size are the same then there was no increase in the counter and that means we're already at the limit so we return false if the value was increased then we return true because there's still some some space to send more recrust so the shackle server shackle server as I said previously uses proclib again proclib is what gen server implements its behavior that permits your module to be supervised and inspected with some introspection calls and then everything inside the server itself is asynchronous as much as possible we want to keep the loop the message loop that processes the messages as tight as possible and never block finally wherever we can we're using binary matching or just match a function matching in general so that it's as fast as possible in the code itself we also use IO lists IO lists is a special data type in Erlang that you can send to IO drivers for example sockets that will get serialized magically by the VM into a binary that saves the the allocation of a new binary if you had to do IO lists to binary yourself inside the server and then the server itself is skinny by that I mean there's no extra functionalities we kept it as as dry as possible and there's no kind of special codepads for anything it's everything the only code it has is what it needs to work and there's no special functions IO lists is a special like data type that you can send to an IO device and it'll get serialized like on the socket magically inside the VM so it's gonna be done at the C level instead of the the Erlang code so it's a list that can contain binaries or other IO lists and I don't know you can check the specs on the dock I don't remember exactly what's allowed but usually I do IO list of other binary so I don't have to append them together and reallocate memory for the new binary so shackle queue on the shackle queue once again leverages ETS I think there's kind of a pattern if you're doing anything performance in Erlang ETS is kind of your go-to trick to cheat so one of the nice things of using ETS in this case instead of the queue from the standard lib is that you can have out of order items so some protocols for example Cassandra and its new protocol doesn't guarantee the order of the response so if you are using just a simple queue and you're just popping items off the queue if they don't come in the right order you're kind of screwed so with ETS you can just go get the key of the right item you need once again there's no contention since we key these queues by server name and there's only one server reading that that that queue so there's one consumer and then therefore there's no locking so next we're gonna we're gonna build an actual client using shackle this client is what I use to do the testing on shackle so if you do go see the GitHub repo the tests in the unit you'll see implements this client so the service we're gonna call is a arithmetic service basically it talks over TCP port 8080 you have to set up the session to be able to do operation and then it supports to operation addition and multiplication so the setup is really trivial what you have to do is send in it and then you receive back okay and then to do operations the request what you have to do is send a request ID which is a tiny integer send the operator which is a byte and two operands which are two tiny integers and the response that you get back is a request ID which again is a tiny integer and a result which is a short now to implement the client what you have to do is implement the shackle client behavior which has six callback methods the first one being options then you have after connect handle requests handle data handle timing and the next one is the application terminal so it looks a little bit like this so the first thing is when you start your pool the first thing is gonna do is try to create new connections and before it does that it's gonna call the options callback and the options callback is gonna tell which port which IP which TCP options and etc use to create this new connection now once it's gonna call the callback after connect where you can actually do all the setup so like I was previously saying for example for Cassandra like setting up the key space in this case for this client is gonna be sending the init and receiving the okay now when the client is ready or yeah after the client's initiated if there's a caller that wants to do requests when the request is set it's gonna be received in the callback handle request handle request is gonna be in charge of serializing this Erlang term into a binary to be sent on the wire now when we get back a response from the service handle data is gonna be called handle data is gonna be in charge of decoding this data and sending it back to the caller now after all of this is done there's another callback that's gonna be called handle timing handle timing is gonna have some timing info on the request so that we can monitor speed of requests and see if there's any issues performance wise and finally when you're closing the pool terminate is gonna be called where you can clean up what you add in your client state so let's implement our first client so option 0 so option 0 as to return okay and a list of client options in this case we know we need to talk on port 8080 and we set that we want reconnect true so whenever the connection drops we want to reconnect and we initiate a state which is a record called state being empty now these are the client options you can use so there you can pass connect options which are just Gen TCP connect options you can pass the IP if where the service is the port and then reconnect reconnect time max and min these are used to with the exponential back off that we use to reconnect so men's gonna be the minimum value before you try to reconnect max being the maximum value before you reconnect and state is gonna be your state for your client that you're gonna receive in all the other callbacks the next callback you have to implement is after connect after connect is gonna be used to do the setup on the connection for this service what we have to do is send in it so what we do is we use Gen TCP send on the socket that we receive as the first argument if the send is successful we get okay and then we're gonna do a blocking receive to receive on this socket the response now in this case if we receive okay then we return okay state with mean the connection the connection and the clients are good and they're ready to be used if not then we return error reason and is gonna try to reconnect the next callback is handle request handle request receives the request term so in this case operation a and b being the operands and then the states and in this case we'll see a bit later what's in the states but one thing that we have is a request counter and we're gonna use the request counter to generate a request ID for our protocol so first thing that we do is generate a request ID and then we generate the data and then what we return for the handle request is okay the request ID which is gonna be used for the queue so the this request in the queue all the data for this request is gonna be keyed by the request ID and then the data that we want to send back to the service and the state where we're gonna increment request counter so that the new next request we have a new request ID so for this example so the state that we have is has two two keys one is the buffer which we initiate with an empty binary and one is request counter that we initiate with zero and then the the function to generate the request ID all it does is the reminder of with 255 so that we never overflow over the tiny int value that we set or that the protocol defines now to actually encode the request for this protocol we're using Erlang's fantastic bits and tacks so if you look at request the first part request ID is gonna be an 8-bit integer followed by the operation which is gonna be just one bit and for the opcode what we do is that we do some some matching on the atom and then if it's an addition it's a 1 if it's a multiplication it's a 2 and finally we have two more 8-bit integers that are the operands so now we encoded the data it's sent on the wire now we have to handle the data back in the response so the response we handle with handle data so the first argument data is data coming back on the wire and then we have our state which we're gonna extract the buffer so that we can use the buffer from the previous request that might have been incomplete these requests might be incomplete just because TCP with naggle and other algorithms might squish different packets together and so you might get some parts that are not complete so the next so first of all we append the buffer with the new data and then we're gonna parse that when we parse it we return the replies and we return the reminding buffer that we're gonna store back in the state for the next handle data call the replies it's pretty simple it's just a tuple a tuple and the first key is a request ID the second key is the actual term we want to send back to the caller so let's check parse replies that works so again we're just doing some binary matching here if if there's a complete request it's gonna match the first case if there's not a complete request it's gonna fall to the second case so in the first case what we do is we match an 8-bit integer which is gonna be the request ID and in the second second part we're gonna match an integer of 16 bits which is gonna be the response of our operation and our is gonna match the reminder of the binary that we need to reparse because there's possibly multiple responses in the same request TCP packet that we received once we receive that we return the accumulated responses and the buffer in a tuple handle timing so handle timing like I previously said previously said is gonna handle some metrics about the request so the first metric that we get is pool pool is a time from when the caller does the call to when it's sent on the socket then from the socket to the receive where you start handle data it's the service and then the response is from the after the service to receiving the actual response in the caller now in this example what I'm doing is I'm calling stats D Earl which is a stats D client and I'm logging these values so I can graph them and see what the trend is in my request time finally terminate terminate is called when the client is actually terminating so you're closing the pool in this case you well what you want to do there is basically cleanup if you have like a timer for instance like this example you want to cancel the timer and maybe lock something that you're saying you're terminating now you've implemented the shackle client behavior how do you use it so you need to start a client pool to start a client pool you can call a shackle pool start and you pass the pool name you pass the client which is the module you just implemented and you pass options the possible options are backlog size which is a positive integer which is the maximum number of requests that you can handle per client pool size which is the the number of clients you're going to start and pool strategy which is the the strategy you're going to use to pick those clients so if we call our start method you'll see that if we use and grep we'll see on the wire it's sent in it to 8080 and it received okay back now if we want to implement the two calls to our operations add AB we'll do shackle call the pool name and we're going to use the tuple add AB that we use earlier in our handle request and then the same thing for multiply it's but we're going to use the operand multiply so if we do run it we'll see that we receive 15 and you can see so the first part is going to be zero because the request ID is zero then since it's an addition it's going to be one and then it's a five because it's five plus ten and then the last one is ten and then the response you'll see is zero again because it's the request ID and then the the response is going to be 15 because five plus ten is 15 calling clients so again this time for a multiply basically the same thing except that this time request ID is one the operator is two now because it's a multiplication and then you have two six because it's two times six and then the response request ID one and the response is 12 now you can also call using asynchronous call if you wanted to to do an asynchronous call you can use cast so shackle cast pool name and then the operation or the term for your operation will return you ok request ID and with this request ID then you can do a receive and receive your response so you can use shackle receive the request ID and then this would be exactly the same as the synchronous call but basically is broken down in two parts so that you can do the call do some other work and then come back and receive some some tips and tricks so performance tips pattern match everything pattern matches really fast especially if you're doing protocol stuff you want to do like binary matching everywhere a bit syntax and it's actually really expressive way of defining protocols another thing you want to do is a previously said is use IOLIS so if you don't need to do if you don't need to do concatenation of your binaries to do the parsing right away and you already know the length of your your response you can just keep it in IOLIS so that you don't keep reallocating binaries another thing that's useful as you need you should keep your client as lean as possible so for some protocols you again you know what the length of the packet is so you don't need to decode all of it once you have the proper length you can send it back to the caller and the caller can take care of actually decoding the protocol and then finally if you have some sort of global state that you have to keep just keep it in ETS an example is a cache like for Cassandra client we have now we use prepared statements and these prepared statements are cached in ETS so that we can reuse them and we don't have this way they're shared between clients and not just inside one client shackle utils so there's a little module with some utilities there's not much in there but it's pretty useful every client I've written so far uses these so info message warning message error messages which are logging facilities another function you can use is lookup which is used to lookup values of prop lists of tuples of size 2 so for all the options that we pass around we always use that format and shackle and finally now diff to do a diff between two erlang timestamp in microseconds so usage internally at ad gear we have multiple clients for internal services that we use it for but I've also published two clients that are open source one call anchor for memcash and one call marina for Cassandra 2.1 and over they've both been used extensively in production right now shackle handles over a million requests a second at ad gear without any kind of issues performance wise this is a graph of request time for speaking with a service that we call pacing B that we use to pace campaigns you see that during the day the peaks are quite a bit higher that's just because load on the nodes is also quite higher web traffic when people sleep there's not that much so in this case you'll see that during the night when the machines are not loaded and it has all the cycles it needs to be as efficient as possible it's around 250 microseconds for a round trip to to do a request at the 99 percentile and during the day it gets a bit worse but that's just because the VM itself is kind of overloaded and so some links shackle is available in github and so are the two other clients I talked about and I'm planning on writing some more soon I'm working on a Kafka client now and eventually when HTTP 2 comes out well it is outish but when it gets more popular and we start using it internally there's plan of writing a client for it too thank you is there any questions yes well the shackle library itself offers kind of different pattern and gen server itself like just gender like you it's you'd want to more like compare proclips to gen server because shackle also offers back pressure the pool and all that which doesn't come with gen server but I don't have any numbers obviously every step of this was done kind of one it was tested in production under heavy load and there was like micro benchmarks done to make sure it was always a step forward and not backwards but I don't have numbers for every kind of optimization we have done but you there would be a big difference I assume so just to sort out like amnesia is built on top of it yes but it's done to distribute data across nodes and in this case this pool is just running on one node and it's not made to distribute load across nodes it could maybe be done but that's never been like a use case for us and to answer your first question basically shackle yes it could be you could implement kind of the same thing using gen server but as I said gen server there's so many is so general that there's so many code paths different that they have to check a lot of things that I don't have to do because my use case is super specific and because gen server has been like I don't know how long it's been in the Erlang distribution but there's a lot of a lot of people that added different cases for different use case and to keep backward compatibility and stuff like that so if you look at the gen server code it's it's quite a big module compared to the shackle server module where it's like 300 lines maybe so just because you can understand it I think should be one of your selling points to using it yeah I don't have the number I can't I can do a benchmark later and post on Twitter kind of what you could expect but honestly it's gonna depend like it depends what your use case is for us like we're doing hundreds of thousand requests a second so it makes sense to do these micro optimization if you're just starting and you have a website with 20 requests a second you should probably use a gen server and not use something custom yeah currently like at adgear in our clusters we were doing over 1 million requests a second using this no there's no special load management we kind of have a share nothing infrastructure so we just keep enough capacity in all the parts and we don't really load like the only load balancing we do is at a layer over so we use like DNS geo load balancing but at the application level we don't have any special balancing it just balances itself naturally so we're handling HTTP request and in front of the VM we we usually have engine X engine X will distribute load across the VMs and then we have also DNS returning multiple a records so that if one is down it's gonna go to the next one and then it's also geo load balance so that if there's a data center closer that's where it goes but then we still need the provision enough machines in each DC for the capacity we need there's no magic you could do some the problem is at this capacity if if some nodes fail and you do auto load balancing to other nodes if you're not if you don't have more servers you're just gonna end up penalizing those servers where you're sending the load and you won't be able to handle the new capacity so it's just gonna cascade down and everything's gonna fail so usually when I guess I didn't mention it but so the backlog what it does when the server is full it returns error busy so when it's busy we prefer to fail fast instead of sending a request that's gonna go through the whole loop and take resource that because we have like strict latency constraints if we receive the the answer too late and it's too late we don't care about it so we prefer to stop early add back pressure and say no okay we can't get this continue next step and we'll just not take care of talking to this data base or whatever service we're talking to all right well thank you very much