 Thanks guys. So I'm Khoang. I'm currently a platform engineer at Viki. I It's been an honor to be here with you today My talk is about building reliable and scalable workflows with step functions. Okay, let's get it For the gender, yeah, of course, I have to Talk something about my Viki first and then we go on to discuss workflows. What are they? What do we have at Viki? What are the problems we face and the journey from when we discover the problems, how we approach it and how we design and solve the problems and Of course, I will conclude with some lessons we learn along the way Okay, so Viki. So we are an OTT company that offer subsequent video-on-demand or ask video-on-demand With the focus on Korean and Chinese content currently we have around I think more than 3,000 shows with 70,000 80,000 videos and Okay, so what's the Difference between Viki and other OTT services is that our subtitle is actually delivered by our fans So anyone that use Viki services, you can go in and then Contribute the subtitles and then you will be recognized as a qualified contributor and we even give card a coupon for subscriptions Oh They give you some numbers. We have around 1.5 billion Ministers of view of video stream per month and on for 5 billion more than that was translated On the engineering side, we are quite small team. We're only around 30 engineers in total or back end is all is half of that With managed 50 plus microservices, which we do around 100 views per day for deployment, maybe 10 to 20 per week Okay, so let's get on with workflow. So what is it actually? So on the other Dictionary definition is a sequence of process that will enable some work to go from initiation to completion and In other words, it's simplified. It's just a list of steps Can be represented either by a graph of flow chat or a diagram So why is workflow important? I This is a bit over generalized, but almost everything or anything Is a workflow or can be represented as a workflow? Like for example, if you're going to a Fair price to buy groceries, there's let's leave all the steps that you need maybe first time you write down all the items you need and You go through a loop of picking up each item and then checking if it's you have completed or not And at the end you are go to the counter pay and go home That can be considered a workflow when you buy groceries So in engineering term workflow are usually associated with the important business decision business logic and Yeah, so what are the workflows that we have a big key? We have all kinds of workflow ranging from user email verification as you can see here We even use a click an email link. We check the token that is embedded in the link It is x5. No a valid and and not x5 then we can go for Otherwise, we display the error and send a new one. This is just a simple version of the flow We have another flow for app release Let's example when we want to release a new app version of Viki to the Google Play Store or the Apple App Store We have to first start with a release build and then we go to some unit test integration test Someone has to sign off that and after that we can push it to the market We have workflows for videos and coding since we our main business Flow around videos. So of course we have need to have them multimedia workflows. So I will focus More from now on on the multimedia side because that's what I was working on since I joined Viki You're free to stop me at any point. You have any question So Talk a bit about our content injection pipe life So where do we get those videos and what do we do with it? So our content providers from either from China Korea with Japan as well recently We we just launched Japanese verticals that we Give them a portal so they can go in and then upload their content to our content management system called CMS from there if you go to our business logic with a video service called here apparently they Guided started this survey there like Greek code. So the TI is the name of the Greek code and we have a service called media engine which is charged up and coding all the videos and After we are called all the videos you go to storage right now is a Currently reside on AWS S3, which is an object storage. So other problems that we see with okay for example, let's say a This is a encoding flow in our pipeline When we got the video first we extract some information some metadata Then we go on to encode it to multiple resolution say 240p after 240p is done We fairly do four things we release 240p first because we want to get our fan to start subtitling in the meantime, we also that encoded to other resolution and release everything so a Video needs to be encoded in multiple resolution in with different formats and different dependencies In the sense of maybe a video format needs to have something Encoded before that so they can reuse it and then put it in a different format. I And explain it a bit later. So a bit more. So let's say I want to encode it in format B But in order to encode it in format B, I have to have it in ABC first. So yeah, that's a dependency chain there Our implementation for that was purely a VI based With a change of network calls What I mean is that? You remember this is our Video business service, which all the logic reside here and this is just an encoder So how did we encode those? from this Business logic service we treat we do many API call to this Median gene the encoder and when it finished it come back with a We are called back and then we do some Post-processing like update the result in our database etc so for this flow in order to Sorry in order for it to complete it first we have to do okay 240p after it completed it come back to the video service and we do when we process the code back we trigger another three and When each of these three completed? Yeah, again, we do some post-processing there and release so One problem you can see here is It's just a chance of network calls. It's very Unreliable that you can you cannot trust the network anytime when it failed is Very hard to detect. There's no way we can notice that We can know where it failed and why yeah Usually then our content operation team will just go to us and say hey hey why is this video encoding? It's taking so long and what happened to this as such a And of course, yeah, it needs to be whole and redesign. So at the end of 2017 we sit down again. Just me and another guy. He is a There yeah, thanks and that we wanted to redesign and make it Better so how better actually we list now our priorities when we wanted to redesign this First we have to model it as a workflow It is a workflow. So we cannot we have to model it as a real world as workflow It cannot be a chain-up network calls anymore We need to have visibility into the system in a sense of when an encoding is running we need to know Oh At which phase is running what and how long it might take to finish? Have debug ability Of course there we are reliable and it has to be scalable because we were getting more and more content at the rate of maybe 5x or 10x that time So okay, why step functions Hmm, we were considering a lot of tools for workflow orchestration. So we have Maybe up a jf flow AWS also have another service called simple workflow. And so why did we choose step function? Yeah, first thing is it has many cool features Particularly because it was a young service at that time. It was at the end of 2017 It was only one year. So I think And it's easy to use what what we feel like It's easy to integrate with other part of the AWS ecosystem. We were already using AWS S3 and other Component in the AWS ecosystem It has nice visual of workflows. I will show you later what it means and yeah service at that time Okay, this is one example by the AWS step function console You can see clearly that this is a visualization of a particular workflow I just give an example with the color code now you can see which route it went and What happened at each step the green color means success We if it's red color then it failed needs to be tried or it's blue color is running in progress and for each step you can click into and see here I Click into and then see what's the input output of this step and Together with the input output of the whole flow as well What are the concepts of the step function Now we have a state machines. It is not what Is mostly the same, but it's not what the finite state machine that we have been studying so Just for the purpose of this talk Please help me try to forget a little bit and then yeah, just focus on the concept of a step function It consists of finite state for each step It have several types. It is either a task type where we do some work in that state It could be a choice type where we make some choice condition like if else and yeah either Like a switch case or something equal to something or not It could be a fail or succeed state It is a fail or succeed state We will stop an execution and indicate that it is the flow has failed or has succeeded It could be just a past state where we pass that input Pass through the input to the output and there's nothing It could be a wait state where we delay the flow for a particular amount of time or until some point in time So either you can wait for 10 seconds or you can wait until 2020 First of January at 1 p.m. Like that Or it could be a parallel state where we will begin branching out. You could indicate that In this parallel state you will do two, three, four, five branches in parallel When a parallel state is considered succeeded when all the branches succeeded The syntax is they call it Amazon state language, but it's just an extended version of json with some json path And then for example, this is how we define a state This is a state name is hello world is type task the resource here sees What resource will do this task right now is a you can see it's a lambda function lambda function is a AWS services where you can just write the code and then let AWS run it for you You don't have to care about servers or provisioning all the machines on the cloud. Just focus on writing the code Yeah, this is a unique name of a lambda function a particular lambda function that you have may have Write code for it and then in here. You just define that this that function will do this work this task Next this will indicate the after this task, then what state we a transition when Transited to and you can have some commands on the state So how to use step functions, I say it's easy to use so what can We how can we use it first you should draw your stand machine on your whiteboard or on paper Record it it any way like and defy your state machine in a json file actually a Amazon state language and Implement your workers What I meant when I say implement your worker is that? AWS with step function will take care of controlling the flow like any condition you may like parallel Ranging weight everything and you just have to focus on the task state where you have to do some work and Eliminate your worker to do that work. You can implement it as an activity with polling based mechanism in a sense that your piece of code will call step function API To get the task give me the task and I do it after that you report the result back This is only applicable for the task type or as the lambda function as well So what's the good thing with the if you implement your code as a lambda function is that it's that push base not a polling base When you have a task in the queue step function will trigger your lambda function So you don't have to pour it Constantly pulling it to get the task. It will only be trigger when you have an actual task with if you do it as an As a polling base and you have to constantly pull it and say just wasting your resources Okay, this is a an example of a state machine You have a start up to indicator This which state when your state machine Execution we start with this is an s state You have a timeout seconds ten seconds. This means after Ten seconds if your flow has not completed it will be marked as timeout and This is a list of states that you have in your state machine For s state it is a task type. So you have to have a resource Which resource will do this task and true this means this is a final step Final state of the state machine Yeah So what are benefits we can see when we're using? Step functions each test we can configure with exponential retry. Yeah, of course any task When you fill it you want to retry We can configure timeout time so the whole timeout of the task or the heartbeat timeout, let's say simple you have a long-running task of like two three hours and Suddenly it hangs somewhere So you want to have a heartbeat for that every two or three minutes you will call set function to say hey I'm still running don't terminate me. Yeah, and If you don't have this heartbeat, maybe in your worker, it hangs somewhere and you have no way to detect it and Yeah, if you have already Have a way to detect it. You may restart it and then let it and Start the work again. It may complete faster than just wait resource error handling so each task can report errors back to set function you can And What's a good thing is that you can make choices for the workflow depending on what type of error you report back for example when we do videos and coding there are some time content providers give us a corrupted fight and Doesn't matter how many times we retry there that fight cannot be processed and we wanted to distinguish it With some maybe database fellow network fellow that can be retried so eventually for database fellow said network fellow which wanted to retry it for like Tens I mean few hours or it to even in today's when it comes back it will continue Automatically Yeah, I just show you the step function course. So it's easy to visualize your that's input and output and Visualize the current state of the workflow So how do we use it in Viki? This is an example of a Statement of a flow that we have in step function and Viki I would not go to details into that But yeah, I just want to give you a sense of like How the flow looks like a from this state This state is a job completed Dependence on what's the output of it. We might go to three or four stage All together Yeah, first version we All our workers are polling base. So we did not use any step function because as sorry We did not we did not use any lambda function because we did not have resources at that time to to try it out and then Yeah, to see if it works We just want to have a Quick prototype So we see that okay. We just call it simple call step functions API get the work do some work and report back it run on our servers our servers and And at that time we decided the number of workers for each task Based on our old judgment some tasks we think okay this that may get called more times than the others They will give it a three or five workers each Sometimes you may get called less or one worker is enough What are the pitfalls that we We stumbled into Hmm When the system kill scale the fellow with skill as well That's true. So After we build that service with step function now we can scale very easily any encoding flow just trigger a step function so we only do the work in the test and Result that we can easily do 10 times 20 times more but When we have a bug in the code then is it the fellow is much more catastrophic because they Yeah, right now we're running much more than previous That's feel too fast. Okay. What I meant is that so we set a retry time to only five six five times for each task With only one second delay in between at the exponential base is two So the duration from the first time the task is Being processed and to the fifth time which a final time is only 32 seconds so in that time there's Nothing we can do even if we detect it after one minute Very strong error when building tasks. Yeah, what I just say that we want to distinguish between five errors or and Database errors network errors, but we had some bugs in the code that we reported wrongly So database error will and your database error will end the flow Immediately and at that day we have a actual database errors that I Think few hundreds of our encoding was failed and we have to run it over again from a start We did not set time out for some tasks. So it was running for a very long time and Not until our team asked us again. Hey, what happened with this video that we actually go in a check It did not send a heartbeat for long running tasks there were no proper others set up because this was just a prototype version when a task failed or Execution failed We do not receive anything to act on that and So little we get get to use cloud watch matrix. So any When you step function, they were already some default matrix like number of execution number of sub tasks completed failed time out DC That was pipe to cloud watch in Viki we use Signal FX as our main monitoring system. So we use an integration from cloud watch to Pump the matrix to signal FX. So we have a single place to look at Okay, now we saw some some problems It's not the end So API calls to step function carry throttle. Okay, this is a quite a Big lesson that we learned when we use any third-party system Third-party product, they know they will always be quota. You cannot take all the resources so one day we I think that we got like 100 videos Encoded and suddenly we saw half of the videos failed and Okay So we went in and see so some our API calls to step function was throttle I think the limit was only ten requests per second something and we exceeded that limit We had to contact another support They get back to us and increase the limits after one or two days Not immediate actually Yeah It could not scale easily a It I mean it could not scale easily because okay Previously we only set the number of workers for each task a As a high-coder number. It's a task a few hmm. She starts him heavy. Give it five workers It's just be system light. Okay. One worker is enough and Suddenly we realized that hey, okay, we want to give That's be more workers. What do you have to do do the lead to the limit of system at that time? We have to change our conflict Is it you can imagine it's a conflict five with a five B three something like that and then deploy again Deploying services taking a lot of time so we could on scale Could not act on it and then scale it in a short time Using a common worker for sharing tasks may block execution. Okay At that time we try to reuse the worker as much as we can because we don't want to build implement new workers There were time pressure as well. So we yeah, we would like to reuse the workers one of the workers we reuse is to check if something has finished or not, okay So imagine you have task A and then touch B A is dependent on B. So B has to completed first before I can be completed but The task to check if it's completed first it was scheduled before the task to check B So is Let's say the task to check a schedule ten times. So in In so in that time even B has completed we don't know because task to check B has not been processed yet So during I think maybe 10 minutes one hour or something it that time could be safe if we Actually check B first so the order is important Some of our workers are slow so that time In the first version we use Ruby to build our workers. It was on our team The two main languages we use is Ruby and Golang and I was more comfortable with Ruby at the time so yeah, we decided to write in Ruby and since Ruby is single-threaded and For some heavy tasks is quite very slow. So what did we do? we moved some workers to Golang and to lambda function this go hand-in-hand at that time we Say yeah, it's better to move some workers to lambda functions. So if we have ten 10,000 tasks, step function can trigger 10,000 of them in parallel and then finish it at once If we still use our workers and we have to do it in sequence and And since lambda function did not support Ruby at the time. Okay, we want to use Golang and Since Golang is a It's better for concurrency. Yeah, it was much faster So yeah with the help of lambda functions and new workers and our push base instead of polling base Okay, seems like we saw a lot of problems. So What's a current state of the system what it have come to? Right now we have around 20 state machines and type with 50,000 execution per day If around 50 workers type with 50,000 tasks completed Have less than zero point to two failures in the half last half of the year I think this is much more important point because It makes our life as an engineers better And in terms of business now we can support around 100 videos and quoted per day into 30 jobs per video which is like 3000 jobs Okay, the one flow I showed you this Correspond to one video But it's just a simplified version in actually is a much more complex Even though step function is such a great tool that I have Presented I still have some Drawbacks and some pain point that I wish it will solve in the future It's a chai state machine support so Right now step function does not have a concept of parent that trigger a chai or one Said on-flow trigger another flow wait for the flow that flow to completed and then come back if I Have chai state machine support then I can build more Abstraction on a state machine like the first few tasks is pre-processed And then the actual work is carried out in a whole different flow and then after that some post-process signal an execution Right now there are no way for me to Pause an execution or resume. Let's say I need to take a buck in my coat in the worker I cannot pause the execution and Say hey wait for me to fix my box first and I Cannot do that and Yeah, I cannot wait for a callback as well. So let's say I call some third party Services that trigger some job. I wanted to wait for that job to complete before continuing with my flow Right now I cannot do it. I have to overcome that by have a Loop that is constant Leguating Yeah It's got silly waiting for like what after one minute I check again. Is it completed? No come go back to bed something like that Dynamic branching based on output of tasks. So now in a parallel task I have to pre-define that I have I want to run three branch four branch in this in parallel It would be nicer if I run some tasks I get the output may be a list and then Depend on the length of list. Maybe we have ten items. I run all the ten in parallel Then you make my life better So summary Building a reliable workflow is difficult. We have to go through many iteration even though with some Rate two coming from AWS it will still take you few months to try prototyping and few iteration bug fixing Recognize that this part of your design is not good to actually come to a reliable and scalable version new tools will always equals new problems and When you bring in a new tools in your system, you have to deal with It's all problems as even step function is great, but then Every time now I want to scale Suddenly I want to scale up. I get throttle by API call Hey guys, can you increase my limit something like that? And it's always very important to define your priorities when you approach your problems and Most of the time you should have these criteria as your priorities Yeah, okay with that I come to an end of my presentation you guys have any questions