 Okay, okay, I'm Dominic and this is Gunny and this is PyeongChang So we are working on a serverless platform at Naver and We are so glad to be here today and we're gonna talk about the potential performance issues in container-based platforms and What we did to overcome that issues? We will mainly talk about the Apache OpenWISC, but I think it will be also useful for those who are working on Any other container-based platforms Okay, so first, Gunny will give a speech. Please welcome him. Thank you Hello everyone, my name is Gunny and I'm developing a serverless platform and Naver in Korea I'm here to explain to you briefly What serverless is? People, we talk about the main topics. How many of you have used serverless or Using serverless like Amazon Lambda and Azure Function. Okay. Thank you Now let's get started So this Google trend graph shows that serverless interest has increased over the last five years and There are also many serverless conference around the world and There are also a lot of open source related serverless Apache OpenWISC, Google's K-native and Many more open source are actively competitive and there are even a variety of platforms that make up for the lack of serverless Authentication, security and management, monitoring data science And there are a lot of articles on the topic is serverless the future of the cloud So what is serverless? So when you hear the word the serverless is easy to think about it as learning without a server But serverless is not a serverless There's actually a server in CloudBender that provides serverless platform hardware, hostOS, hypervisor, and so on are all managed by CloudBender Of course for users, this is meaningless Users feel like there is no server serverless platform offers users the value of independence from infrastructure and CloudBender provides users with a monitoring tool for observability Because users have to make sure that they are functional and not well Anyway, users only need to focus on their business logic and code. I think this is one of the major benefit of serverless and Serverless includes backend service, database as a service, function as a service Today we're gonna talk about the function as a service Function as a service has five main characteristics event-driven, isolated execution, stateless, auto-scaling, pay as you use The first is event-driven The fast platform mainly executes the function in response to external events External events are your backend service, database-changing event, share-line, mobile, MMI, ChromeJab, and so on Because it works in response to any external event, fast platform are open-use and sort of as the glue between servicing or cloud environments Next, and second is isolated execution We can take advantage of these characteristics when we run fast on top of containers Each function run on independent environment like a container or virtual machine This approach ensure resource isolation among functions And increase reliability when monitoring individual functions This is also good for managing resources effectively for system With recent prosperity in container technology, many of us are choosing this architecture And third is stateless When a function finishes execution, each function returns the used resources to the system At this point, our data stored in local storage and memory will be reclaimed That is, we are unable to reuse the context in the previous run And usually most assume that the function is called in the new container each time The fourth is autoscaling If there are a lot of requests, the fast platform can increase the number of containers to execute the function in parallel The scale out is usually supported by default at platform labor The last one is pay as you use This feature is highly related to scalability You usually need to pay only for what you do with the fast platform If your request comes in three times, three functions will be executed and you can only pay for the price of three executions Then if the request suddenly doubles and six functions are executed You need to pay only for six resources Normally CPU, memory and execution time are taken into account when settling the payment If you have an app that doesn't use a lot of computing resources and run for a few minutes a day Such as a batch program or backend service that run at certain time only The fast platform can be very attractive alternative So in the past when you create an application, you create it as one system This set of problems of having to redeploy the entire service even with small changes Because of this throwback in the traditional system, the microservice has come to the floor In the microservice architecture system is comprised of many small independent services Now people are trying to split up into smaller functions I think this is a kind of trend to minimize the scope of changes to make the system more flexible It's fast, our future, let's take a look at the openness going over the most popular fast open source To see how does the fast look like and what problem it contains and how we solve the problems Let's introduce Zhang Feng-cheng Okay, thanks, Goni I'm Zhang Feng-cheng And next I will introduce Apache OpenWisk It's an open source fast platform which is used in the IBM Cloud Functions It's even driven by using triggers to connect extra events And the OpenWisk was a social action trigger with rules with several actions And the actions here are the functions Goni mentioned before There are just some slips or calls or users' customer bio-codes embedded in a doc image So with these triggers, rules and actions, the OpenWisk can run users' calls in response to extra events Like database update or code commit into GitHub or just simple HTTP request Now let's see the internal flow of the OpenWisk It's quite simple because it only has two particular components The controllers and the invokers We can easily find out that the controllers are used to handling all users' requests that forward from the engines such as the action create, update, and delete And of course the invocation request Once controllers save an invocation request for an action It will forward that request to one of the invokers using Kafka And when the invoker got that request It will then use the doc container to execute that action's code So what's the procedure of a doc container to process the actions invocation request? First, we need to create a container And second, we need to initialize the container with the action code It will inject the action code into that container And finally, we can execute that action code inside the container And one thing that needs to be mentioned is that if a container is already injected with some action's code For security reasons, each action needs to be isolated from other actions So this container cannot be initialized again with other action's code This means that an invocation request of action B cannot use containers that already initialized with action's code And from these three steps, we can conclude that there are three types of invocation requests The first one is the code start, which has three steps The container creation and the code initialization and the code execution And the second one, play warm start, we don't need to create a new container for it Just initialize the code and execute the code And the last one, which only has one step, the code execution Is so called the warm start It's clear that the warm start is the best case for invocation request It only needs to take several milliseconds Now let's have a look at the left circle of a dark container in the open-wisk At first, the invocation request will receive an invocation request of action A And start a container creation And when it creates, the container is completely warmed And we initialize the action A's code into it And the container becomes warm status And then we execute the action code So now the container is built And once the execution is completed, the container will go back to warmed And it's ready to handle another invocation request of action A It's only for action A, not for any other actions And if there is a new request, this is a warm start And when the execution, the second execution completes again And the container becomes warmed again And after some time, this container will be timed out and be terminated If there is no more incoming request for the action A And release the resources to the system And in a real case, the code start can get even more worse Let's say we have this warmed container in the system And here the invocation of A comes to invocation zero This is a warm start Now it's great But when it comes to invocation two Yes, we first need to delete a warmed container And start a fresh new one And initialize the code And finally we can execute the code So there is an additional container deletion step to the code start Now here comes the question What's the best way for the controllers to decide which invoker should be selected for an invocation request This is what scheduling means in an open-wisk And as we already know that the warmed start is much more better than the code start So for the best system performance We need to reuse warmed containers as much as possible So is this difficult to achieve an optimal scheduling for the open-wisk? From this diagram, it looks pretty easy Because we already know the invoker status are like this So for the invocation request, A, B, A, C, we can send the first A to the invoker's error And B to the invoker one And the second request to the invoker's error again And C to the invoker two And at least we can get three warm starts If we are lucky, we can even get four It's only perfect But is that possible to get the invoker's status while we do scheduling? The answer is no The reason is that the execution time of an action can be very short Like two milliseconds That means the invoker's status can be changed in every two milliseconds So at this time you may remember that the invoker's status are like this And it's correct But after two minutes go, it's changed And changed and changed again But you still think that the invoker's status keep the same from your first set Which is totally wrong And there is a larger issue That the scheduling point can happen in many controllers Nearly in the same time So one controller needs to consider other controller's duration without one key middle choice Because other controller's duration may change the current invoker's status So to achieve an optimal scheduling for the upmask We need to first collect the real-time resources status of all invokers And then factor in the scheduling duration from other controllers And for all available invokers, we should choose the best one And all this process should be done in only two milliseconds You'll believe it or not I think this is kind of impossible So now you can think about it How would you design the scheduling for the upmask? Okay, now let's have a look at what upmask does first The upmask will decide the target invoker's action in advance with a hash function For each action, it will calculate a hash value for it and use the hash value and the number of invokers to get a home invoker The home invoker will be acted as the index of the invoker's success So for every action, the upmask can get the home invoker in advance without considering the location of existing containers But here comes another question Do we need to consider the capacity of the home invoker? If the home invoker runs out of its resources Should we still send the invokation request to it? We should do that because there are still two invokers in idle state So for this issue, the upmask has another way to resolve it There are the steep sizes, which are co-ply numbers that are smaller than the number of invokers For example, if the size of the invoker is 10, then the steep sizes is an array of 1, 3, 7 You can think about that why open-risk use the co-ply numbers here So for an action A, the steep size of it is its hash value more than the number of steep sizes Then if the number of invokers is decided, for each action, its home invoker and the steep size are also decided And while do scheduling, the open-risk will first check the home invoker And if the home invoker is not coupleable, it will pass a steep size to that home invoker and check that invoker And again and again until it find the suitable one If all invokers are busy, the open-risk will fall to our plan B It will randomly select an invoker for the invokation request And for the issue that one controller needs to consider other controller's decision choice The open-risk uses a sharding strategy to divide each invoker resources into different and independent parts And then each part to each of controller So with this approach, the decision on one controller will not affect other controller's resources The invokation request not handling by the controller zero will have low effect on the controller one And so as the invokation request not handling by controller one, it also has low effect on the controller zero's resources And so there is also no influence for this request And once execution is over, the system will release all resources to each controller And as we described before, the open-risk uses a hash value while do scheduling So the invokation of A that handled by controller zero or controller one will all forward to the invoker zero Because its home invoker is invoker zero And when there is no resources at all, the invokation request will be randomly sent to one of invokers And wait until the resource is available in that invoker So with this sharding strategy, the open-risk needs to consider other controllers while do the scheduling And with these two features, it seems that the issues we mentioned while we do scheduling are all resolved And the system shows a good performance Let's have a look at some benchmark results The first one is very simple We only invoke the one single same action during benchmarks and we get a pretty nice TPS And this is great Now let's try another case We will invoke 100 actions randomly during the benchmarks But we got a pretty awful TPS, it's only 90 So you might think that there could be some problems in our environment But trust me, we have a lot of benchmarks against the 100 actions case And they all showed a similar result So what's wrong with the current scheduling of the open-risk? Open-risk Now take it easy, we will first show another benchmark result for the dog demo first Okay, this is the benchmark against the dog run to dog remove and we get only 10 TPS Which is quite slow But in many people's opinion, including me, we also think that the dog controller is quite fast Yes, but how can it show this? But the result, the reason is that the dog demo process requires a sequentiality And because of the same reason, the performance of Docker post to Docker unpost also didn't show any good It only show about 40 TPS So from many of our tests, the average time of the dog run to dog remove will take about 500 milliseconds to 1,300 milliseconds And post to unpost is about 50 milliseconds and 400 milliseconds Now some of you may have guessed the reason why the TPS of 100 actions are so low Maybe they are mining code start during benchmarks Yes, you are correct They are indeed mining code start during benchmarks Next I will explain the reason why it shows on the 100 actions case The first reason is that there are some interference among actions Let's say we have two actions which have the same hash value So they have the same home invoker And now its home invoker has four billion containers And at some time, one of the billion containers of A finished its job and become warmed But here comes the invoker B What will happen? So there will be a denition and counter-creation and initialization and finally execute So this is a typical bot code start And when there will be a container for B become warmed but here comes the invoker A So the same person happens again There is a delete, create and initialize, negation and execute So in this example, the counter-creation and negation happens on every invocation request While there are still two invokers in idle state This is the worst case because the two actions have the same hash value If these two actions have different hash values, this will not happen The counter-creation and negation will not happen and all containers will be fully reused But although this is a quite coincidence In a red case, this can be happening evenly because there might be hundreds of actions in the fly So if there must be some actions share the same home invoker, then code starts happens So because of this reason, even an action's execution time only takes about two milliseconds But because of the counter-creation, denition, the actual execution time can be very large And it can be 650 times slower And all subsequent requests are also delayed The second issue is that the invocation doesn't wait for a previous run What does this mean? I will use an example to explain this Let's say that the Action A execution time is about 20 milliseconds And its invokers are busy now So for the new request of Action A, it will send to another invoker like Invoke 2 And this code starts, create initialize and execution And it will take about 520 milliseconds But what you don't do is schedule, but just wait In this case, we will wait one of the busy containers on the home invoker to become warmed It will be in 20 minutes later And then we send that request to the home invoker And this is a warm start and it only takes 40 milliseconds So it's about 13 times faster than scheduling it immediately to an idle invoker This means that if the execution time of an action is smaller than 500 milliseconds Which is the time spent to create a new container Wait better to wait for a previous run finished than schedule it to an idle invoker immediately And there are still other issues in the current scheduling algorithm You can refer to this URL to see the details if you are interested And because of the so big difference between the TPS of one action and the 100 actions Here comes a new question that we cannot determine how many TPS or cluster can provide it For example, if there are only 100 active users and each user invokes one action We may provide max for 20,000 TPS for it But if each user invokes 10 actions, the TPS can be reduced to 6,000 And it can even 4 to 30 each user invoke 100 actions randomly And the max TPS also changed if the active users are changed Next this and next this and next this So under the same environment our systems performance can vary according to the number of active actions And the number of active users this makes cannot decide when and how to scale up our clusters The resource planning is very difficult So now you already know that there are some flaws in the current scheduling of OpenWisk Next Dominic will talk about what we have done to resolve these issues and what we get Thanks Okay until now you guys have seen how OpenWisk does work and what is the issues in current OpenWisk And from now on I will share what we did to overcome the issues So we introduced the production queue and pulling-based scheduling and we separated container creation from invocation paths And we introduced a couple of new components So I will share the details in the following slides In OpenSource, the invocation request for Action 8 is sent via invoker 0 topic For Action B it is also sent via invoker 0 topic as well It means that if invocation of Action A is delayed then invocation of B is also delayed So we introduced a production message queue Now each action has its own message queue A and B So request for Action A is sent to message queue A And request for Action B is sent to message queue B In this way even if one action is delayed it does not block the others And if each invoker fetch or messages from all these message queue then same thing can happen in invoker side So we introduced the pulling-based scheduling Now each container itself pulls their messages And one more benefit of this approach is that controllers do not need to consider the existing location of containers And each container will fetch messages and it will invoke it and fetch messages and invoke it again and again So the container reuses maximized And in OpenSource when we invoke on an action, if there is no container then we should create a container first Only after then we can invoke the Action A It means that container creation time is included in the invocation time So if creation is delayed then invocation is also delayed So when we send the invocation request we asynchronously send the container creation request as well Then invoker receives it and it creates a container and finally we can execute the invoke the action It is seemingly that it is slower than the previous one But actually these two passes are working almost simultaneously And once the container is created then only these two passes are working repeatedly So it can be faster And if one container is not enough to handle all incoming requests Then we can send another container creation request and it will create a container and it will invoke the action While these red passes are active, still the black passes are working as expected It means that container creation does not affect the existing invocation anymore And next one is a new component Previously controllers should schedule actions based on their own resource Now each invoker periodically stores their resource status in SED And we introduce distributed transactions before creating containers So with this we could get rid of these fragmented controller resources from the controller side And we could manage our resources globally The final one is a new component, scheduler We implemented a new queue replacing Kafka I think many of you guys are already familiar with Kafka Kafka is very popular and famous messaging queue And sometimes it is even considered as a de facto in a messaging queue area But we could not take advantage of Kafka So let me share why we did not use this In Kafka a topic is comprised of multiple partitions And partitions is the unit of parallelism It means that if we need two consumers Then we should have at least two partitions And if there is more than more number of consumers Then it cannot fetch any message And it's not easy to change the number of partitions at one time So normally it takes several seconds So what we can do here is we can make big enough partitions in advance But how can we define enough? We cannot make sure how many containers will be required Some actions may require 300 containers But the others may require 1000 containers So we cannot decide the number of partitions in advance The worst thing is if there is one consumer Then it will be in charge of all three partitions And if one consumer is added Then they will distribute each partition And one more consumer is added Then all three consumers will be in charge of each partition respectively This is a process to allocate a partition to a consumer In Kafka, in Kafka world we call this as a consumer rebalancing And consumer rebalancing happens whenever a consumer is added And it takes a proportional time to the number of consumers In our test, it took around 50 seconds to rebalance 200 consumers As you saw in the previous page 1.3 seconds of Dr. Damon performance is also quite big overhead in the service world 50 seconds is definitely not acceptable And this consumer rebalancing is controlled by Kafka We cannot control the routing In the future we may want to add our own custom routing rule But we cannot make it with Kafka So we implemented a new message queue It acts as a message queue for invocation request And as the name stands for It decides when and where to add more containers And we could achieve the full control of routing We ourselves implemented it from scratch In this way, we could resolve action interference with the full action message queue And with pulling-based scheduling We can schedule actions without considering the existing location of containers And we separated container creation from invocation So creation does not block the invocation anymore And we could manage our resource globally with SDD and distributed transactions And finally we could have full control of routing with a new component scheduler This is our performance comparison between open source and our new scheduler And it shows around 155 times more TPS And actually we did not enhance or improve the TPS itself We just made the system shows consistent TPS No matter how many users and how many actions are used Okay, this is the end of our speech Thank you for listening So do you have any questions? Yes, actually... He asked about that Have you considered running multiple invocations in one container? Yeah, actually in OpenWisk we call this as intra concurrency And it is actually already supported in Apache OpenWisk But it may introduce some security issues Because if we run multiple actions in a container at the same time Then they may introduce some confusion between context And all the invocations will share one resources So it may can be crashed if it consumes many memories and so on But anyway, Apache OpenWisk currently supports that kind of intra concurrency My question was if there are 500 concurrent requests to the OpenWisk So does that mean there will be 500 containers created? Yes, actually I think it's not... Work in a way that all 500 requests come to the system at the same time But if it is so then it will create 500 containers in parallel But while processing them, if the execution time is short enough Then maybe a few of them can be reused Okay, thank you So any other questions? In the contribution? He asked me about are you willing to contribute this to the upstream? Yeah, sure Actually I'm a Committal and PM's member of Apache OpenWisk And I am trying to contribute back this implementation to the upstream So we are working on it now Any other questions? Thank you for listening