 posting us today. My name is Mohammed. I am gonna talk about something pretty fun that Trageco has been doing for the past six years, which is using SideKick to run our e-commerce infrastructure. Now let me just go quickly over who I am. I'm an engineering manager at Trageco. I really love technical books and I'm a really bad impersonator. We've had like two years ago we had like movie posters impersonation. I failed miserably at that. So I want to start out with a number. So this is the amount of jobs, amount of operations that Trageco has been running for the past year on top of our SideKick infrastructure. And so what that means is per second we synchronize about 350 jobs to different partners, different APIs, different stores, different users. And to reach that scale what I want to do actually is first take a deep dive into how SideKick works and how we utilize it. And then I'm gonna talk a bit about the challenges that given SideKick architecture as a code, how we've actually faced certain challenges, how we've solved them, and then how it actually helped us now to reach that number. SideKick, just by a raise of hand, who has used SideKick before? Their production and their systems, cool. So SideKick simply is a job server, job processor. From your web server you can basically push data into Redis and SideKick basically servers will pick up the data and will process it. We'll go over a bit more details about it to see the internals but this is the general overview. Now SideKick itself as a gym runs in two modes. It can run in a server mode and in a client mode. If you run it within Rails it runs in a client mode. If you run it through the bin SideKick executable it runs in server mode. And it differs a lot how you configure that and how you actually start the process. So what I'll start with actually is talking more about the journey between the server and client. So this is an example of a job that you can write in Rails. You can just put the perform method, you just need to implement the perform method and you can do work there. You can also specify options and arguments but we'll see that later. And this is how you can call any job that you've defined. You can call it either synchronously or asynchronously or in delayed execution. So the journey of any job that you fire up usually is as soon as you call the job via perform async SideKick will actually wrap up the job into a hash and represent the arguments, the data, the names, everything and put the Q name. Now as it goes down it goes through a layer called the middleware. SideKick middleware is basically there's a client middleware and there's a server middleware. The client middleware can choose not to send the job to Redis and that is something you can figure. So you can basically use default middlewares through SideKick extensions but you can also create your own middleware and the middleware basically can take a job and decide you know what no I don't want to send this to Redis and basically return nothing. So your code will just continue executing thinking that it emitted a job but the middleware could say no I don't want to emit a job. If that's the case then the job just goes discarded. Otherwise it will actually go through the middleware, come back up and it will be wrapped up and pushed into Redis. On the server side your server will be listening and we'll talk in detail what the server has but there will be an agent inside the server called the processor that listens to specific queues. So the lists are named in Redis by the name of the queue and one of the processors inside the server will be listening and blocked until one item comes into the queue. If there are no items the processor will listen to the queue and will wait until something comes in otherwise if there are jobs to be processed it will just keep picking them up. As it picks up the job it will actually pass it through something called the server middleware. Now the silver middleware is also a middleware layer that allows Sackik to decide I don't want to execute this job. So you can control your jobs from two areas basically. You can either decide not to push it or if it's pushed into the list then maybe by the time you pick it up you decided you know what I don't want to do this job. It's plate or it's outdated. And then as it goes through the server middleware then the actual execution starts of the job. Now inside Sackik what happens usually and I'll start with a bootstrap process of a Sackik server is that Sackik when you run bin Sackik the most important object is the Sackik CLI. And Sackik CLI basically has few responsibilities to do. One of them is it validates that the bootstrap arguments are correct. So it looks at your shell arguments and it's like oh did you pass me the right concurrency? Do you pass me the right names of the queues? Other things. It loads rails. It depends on it. So it loads rails and it does few health checks for the Redis version. So it looks at Redis version as Sackik progresses certain features can only be enabled for specific features or a specific version of Redis. So sometimes the Sackik process will exit early and say no I don't support this version of Redis. It will also check your Redis connection pool. So if you give it a zero connection pool, a pool that cannot connect to Redis it will the fact that it's like you know what no I don't want to actually start. And it will eager load the server middleware. So the code that is the server middleware will be eager loaded into the Sackik parent process so that the threads that we will talk about later and workers don't have to load the middleware to copies of it. So that saves up memory. And more most importantly is it configures single signal handling. So if you're deployed in the cloud like Hiroku AWS and others Sackik needs to listen to the OS signals. If the OS says no shut down Sackik needs to be able to listen to those events and stop the queues processors and exit. So configuring signal handling code happens at this stage in the bootstrap process. And then one actor and that's what we're going to talk about now how Sackik is designed as a code base is one actor the main actor called the launcher starts working. This actor everything here is on the main thread. Starting the launcher and beyond there will be multiple threads spin off to actually enable concurrency. So Sackik uses actors. I'll go through a bit. So actor model or actor oriented design is popular on our concurrent languages like you'll see Elixir, Erlang, Aka, Scala, Aka. A lot of other frameworks and languages have tried to implement this or they support it and enable people to write things in actor model or actor oriented programming. The goal of it is actually you write your code rather than just object oriented you write it as consumers and producers and an actor has an inbox in a sense and has an outbox and actors talk to each other through messages. So when I want actor A to do something I'll send them a message into their inbox. They might be alive or they might be busy. I don't care. I'll just put the message into their inbox and I'll continue my work and I can get their response when they're done into my inbox. So that enables a sort of free concurrency or less complication when you're dealing with concurrency because you deal in actors. You think I'm modifying this actor class. What does this actor do? What should they receive? What should they send? But how actors talk to each other is managed by the actor framework. So Sackik started early on in earlier versions using celluloid which is a concurrency gym. But celluloid as Sackik evolved celluloid had a lot of memory overhead on Sackik performance. So the I think starting Sackik 4 the maintainer is actually the maintainer and the team of maintainers actually rewrote Sackik completely to take out celluloid and write their own actor system that is lightweight. There's a really good block about that. He basically didn't say like celluloid is bad but he said look for our usage celluloid does way more than we need. So we're going to trim that down. Use only what we need from celluloid and we're going to write our own actor system. That improved Sackik's performance way way much. We'll look at that later. So basically this is how actors are built. Now Sackik actors there are four main actors in Sackik. The launcher is the actor that basically bootstraps the system. It basically creates all the other main actors in the system and the launcher's job is to basically control the life cycle of both the actor called manager and the actor called scheduled puller. Now it also listens to the signal handling that comes in from the main thread. So if somebody exits Sackik the main thread will talk to the launcher. The launcher will say okay cool let me actually stop all the system make sure everybody stop processing we're back to a clean state. Now we're ready to exit. So that's what the launcher does. It also runs a heartbeat thread. The heartbeat thread makes sure that Sackik sends continuously information and detects that Redis is alive. So that heartbeat thread is just making sure that things are okay. Now the manager is actually the one responsible for creating another type of actors called processors. Processors are actually what process your job. When you are on Sackik and you find your job running it is the processor actor who created the job class, fed it the arguments and made it start work. Now processors could die. In a sense if your job raises an exception the processor fails. The manager can detect that and basically say processor died. Okay let me remove it from my queue from my list of processor that I spin up and I'll spin up a new processor. So this makes the manager the only process that is alive and that cannot fail is the manager. By the way if any of these actors fail other than the processor if the launcher or manager fail or the puller the whole Sackik system shuts down. That was intentionally designed so that you don't have inconsistency where your launcher is running your Sackik server looks like it's running but it's not doing anything. So it was designed so that any main actor dies the whole system shuts down and cleans up and says exit. So when you feed your Sackik process a concurrency argument that see number of processors this argument is given to the manager. The manager is the actor in the system that is responsible to understand and use this information. The processor as I said is basically the one that actually does all the heavy lifting. All the real work that you see on Sackik happens by this actor. What this actor does is that every processor that you spins up goes and listens on its own to Redis. That was one of the trade-offs that happened in late Redis design sorry late Sackik gym design. Originally there used to be a Fetcher work actor and the Fetcher actor used to listen to the cues and not let processors go and talk to Redis. What that meant is that you have less connections going to Redis so you maintain 50 processors but one Fetcher but that also made a lot of overhead of transitioning information because that Fetcher now has to transition to 100 processor all their messages and make sure that they're all busy otherwise you're not utilizing your server. So in the redesign of Sackik that Fetcher role or that Fetcher actor responsibility was ditched and it was given to the processor. Now the processors what that means is that now you connect more often to Sackik and you have more connections coming from Sackik to Redis which is it depends on your deployment and your production parameters but that is something that they intentionally did. What that meant is that processors are independent. They can go and fetch their work and basically extract it. They are the ones who take your job and pass it through the server middleware and if they see that the job did not succeed in the server middleware they will just discard it and go fetch another item of work otherwise they will basically create your class re-instantiate your class from the hash and basically start running it and until it finishes that processor is busy and they also what this is important they orchestrate the retry logic. So when the job fails and you configure your Sackik worker to say hey if the job fails retry it 10 times or retry in a backoff strategy manner this processor is actually the one that is managing that information and it actually re-incuse it into something called scheduled set or retry set based on your strategy. It also manages killing the job so if the job keeps failing the processor will decide well you ran enough 25 times you kept failing time to go to the dead letter queue. The puller is a side actor it's not really an important actor in the system. Well it is important but it doesn't do as much as the processor. The puller basically makes sure that any scheduled work or any retried job is put back into its original queue after the backoff strategy is finished. So let's say I ran a job and I put the rules that if it fails try it after five minutes and then if it fails again try after 10 minutes. When the job fails for the first time the processor takes the job and says well what's your retry strategy? What's your retry rules? Oh your retry rules after five times okay here go into the retry set the scheduled set. The puller is the side actor that is not involved in this whole system but it just watches that list of retried items and it basically at the time at the right time it pulls up the item sees where it should be in queue and it queues it back again. So basically the processor doesn't care anymore after it puts item into scheduled it just focuses on processing things and that's how in terms of design that processor actor doesn't have to do all the logic and the puller doesn't have to do all the logic. Each of them are specialized as actors. So basically checks read us every n seconds and basically makes sure that if the time stamp assigned to it has passed then it pops it up and puts it back into its original queue. Now this is what Sidekick actually internally looks like it's usually if it's few actors and a bit of supporting logic few like retry logic strategies few classes if you look at its source code every actor has a class it has a file so you'll find puller you'll find manager the schedule.rb other things like so every actor is just one file and you'll see all the code of the actor everything that actor does is in one place. Now this worked really well in early years of Trey Gecko but as we scaled we realized that there are certain patterns that Sidekick itself doesn't do and they don't have to as a gym it's a general purpose gym but I'll go through over those challenges first I'll show a bit a small diagram about how Trey Gecko works itself so Trey Gecko integrates into a lot of systems we have a lot of partners and we basically process everything inbound and outbound over our queues so everything that should not be performed synchronously goes into Sidekick and everything is designed that way we have our queuing system designed based on the partner type or based on the urgency of the job all that is well so just ship so basically Trey Gecko's infrastructure deals a lot with asynchronous events and notification to our partners receiving their events and processing things asynchronously so basically what we realize is that there are as we scaled there are few problems that started emerging from high gossipy customers gossipy partners that want to send a lot of events so we had issues where you know if a customer is like on boarding on the system they would upload a CSV with 10,000 items each item needs to be updated and sent to other partners 10,000 items 10,000 jobs on the queue nobody else can do anything until this 10,000 these 10,000 jobs are done so what we realize is that there is a problem of how we can rate limits our or how can you actually work with the rate limits that our partners have so if they if you in queue 10,000 items and then let's say a partner of yours can only receive 50 items a minute then you're gonna wait for a very long time to finish those 10,000 items so we realize that what happens is 10,000 jobs in Q'd 50 pass 9,950 get failed sit into scheduled back into in Q'd 50 out and then that keeps on going and psychic basically juggling around the jobs until we're done so third party rate limits become a really big problem as you're scaling up another problem and the consequence was that we realized that our psychic dinos were basically wasting their time a lot of the servers are just wasting time juggling hashes around but there's no actual synchronization being done the other thing is that we got a scare of like oh suddenly all like 9,000 jobs failing like what happened and you realize like oh okay we're trying cool okay that's no problem and then that happens every few seconds so developers did not sleep well and our partners we want to enable our partners so when our partner says back off I can't handle more data we go on and say no no no no we're gonna send you 10,000 jobs and you have to tell each one of them you can't go through and then the partner is like okay you didn't back off at all you still called me on 10,000 jobs you just got 50 through and I just told you for the rest queue them up and that kept happening so to enable our partners and to make sure that their infrastructure scales to and can handle the low to we have to find a way to back off early as soon as something goes wrong we're like okay you can't handle more data I'm gonna back off everything back and wait for you a bit so that was one problem that we realized the other problem is fair processing now I say fair because some customers have more data than others so multiplexing priority becomes a difficult thing and psychic doesn't have a lot of priority control by default you can't say this job has higher priority than others except in general terms you can say this queue is higher priority than the other queue but you can't create a queue for every customer because you'll have to manage and juggle a lot of queues too so what we ended up with is saying okay sometimes it happens that a customer sends out 10,000 jobs another customers want to in queue 5,000 or two jobs even like very little amount but that two jobs customer has to wait for the 10,000 jobs customer to finish their processing and that customer is like waiting and why is it slow well because other queues or other customers have their data being processed so what we wanted to do is we find a way to intermix jobs if one partner wants to send a million operation we have to batch it up in a way and make sure that we allow other partners to get their data through too so these two problems don't get solved by set in psychic natively so we had to do is actually we had to create a new job abstraction so what we created is a job that internally we call in tracheco accumulating jobs now accumulating jobs basically don't try to touch the complexity of psychic so after understanding how psychic works from the inside and how its API is evolved what we wanted to do is we wanted to introduce minimal assumptions job abstraction we didn't want to go and hack and monkey patch psychic we wanted to basically introduce an abstraction that does minimum minimal interference with psychic code but at the same time solves these two problems and to do that what we did is in the accumulating jobs we over it basically perform async and the perform operation that's it and what we do is we do our own control into redis so we create a set and that set basically has it's a sorted set so that means that it's unique if you try to incubate jobs into it there's no duplication so if a partner mistakenly sends us the same event millions time million times we our infrastructure can take it in and basically make it one job if the job is the same so we help our partners with that to if our developers as they're coding release something mistakenly into the the platform and that introduces duplication the same thing this job this job abstraction can actually take that in and basically hide that information until we escalate it we actually have a way to escalate these duplications the other part is actually making sure that the accumulating jobs control us that set and if any of the failures happen the whole set gets backed off so rather than dequeuing or basically rescheduling one job at a time you reschedule the whole set all together so what happens is usually when the accumulating job starts processing it picks up items from the set in order we can you can configure the set to say hey you know accumulating set process a thousand item at a time and then back off so that enables you you have a back off strategy for the set and that enables the set to say okay if I have more than a thousand item in the set I'll process thousand I'll just give the queue give the priority to someone else I'll just move on well it will basically reschedule itself at the same time you can define failure conditions so you can say hey if you detect a rate limit error back off and stop processing if you detect an error in the job itself don't worry put it aside and continue processing so we can actually differentiate the different types of errors and basically route the right behavior so if it's just one job failure continue processing don't stop if it's a rate limit back off the whole set and wait for your turn again so that also enables us not to call our partners a lot when the sorted set is empty the accumulating job considers that this child said that it's controlling is done and there's no more jobs than the accumulating set terminates otherwise the accumulating set basically keeps her scheduling itself and re-enqueuing itself until that set is empty that's all so I hope this was useful um I actually can't look there we are hiring so I would like to join us come join us and solve these problems with us thank you any questions questions yes that's a really good question the client and server middleware and psychic are not designed to reschedule jobs they're designed to decide whether you can actually pass the job in or not so we wanted first not to actually have too many job hashes sent to and re-enqueued like we could have basically done that if we wanted but we want also to minimize the payload so in the set we actually encode the payload differently the payload itself rather than having the long or the big psychic Jason we only encode the arguments because because accumulating set controls already all the other parameters of the job so that also minimizes our consumption of redis memory any other questions so psychic pro had a feature called has a feature called batching batching is not similar to like there are differences between batching and accumulating batching for example still represents jobs in their own units but it basically includes them into a different set it doesn't back off so batching will still try to process everything so as to my knowledge I don't know if there's anything else out open source in the wild for this like similar to this any other questions all right awesome thank you