 Hello, my name is Manuel and I'm here with my colleague Parijat. In this presentation, we want to take a look at serverless workflows and how using service measures can reduce their completion times. Let me start with the obligatory introduction to serverless. The idea of a serverless platform is that developers can solely focus on writing code. When the code is ready for release, it would be stored for the platform to use it. And we also need to specify what triggers our code. This may be a schedule execution, a reaction to arbitrary cloud events, a request from our mobile apps, from our shops or factory site, just anything that could trigger the code. The serverless platform handles those triggers and manages deployments autonomously. Function as a service means that the platform creates container instances equipped with a runtime and an event loop that calls the function for every triggered processes. So the developers would never have to think about servers again. But naturally, developers don't just write one function. There may be plenty of functions involved in an application, large and small, to produce the desired results. Of course, over time it gets more complex, so we would organize the work into a reusable function code. When we put the functions in the order of execution, there might be some branching, conditional invocations, mapping or error handling. A single trigger causes a couple of actions to be invoked, all of which we'd want the platform to manage and run for us. And this is what we'd like to call a serverless workflow. Like any modeling approach, the serverless workflow captures the process in a common notation, so developers and domain experts can agree on a formal definition. A serverless workflow composes multiple invocations. It specifies the order of executions in a control flow, and it puts all actions of a workflow invocation in a shared context. Depending on the workflow language, invocations may be called actions, steps or tasks. The flow may be represented as a task graph, a flow chart or a state chart. Data maintained in the context of a workflow are called artifacts or just workflow data. The CNCF has given serverless workflows a sandbox project to create a community-driven workflow language that meets our requirements as application developers, and I think the foundation is the perfect environment to create a vendor-neutral specification. If you'd like to know more about the work, stop by the project's booth or visit serverlessworkflow.io. The workflow language is independent from the runtime implementation of the platform. And we asked ourselves, how can our platform achieve fast workflow completion? From a workflow perspective, the platform does two things that are on the critical path. We need to pass control from one action to the next, and we need to pass along the workflow context. This talk is organized as follows. First, I'm going to talk about our decentralized design approach and four ways to deploy on Kubernetes in search for the fastest and most flexible communication. And second, my colleague Parajad is going to discuss how we can group functions to avoid most of the communication and still be able to balance load across resources we allocate on the cluster, followed by design to dynamically control the rebalancing. To run workflows, we could do a central workflow engine that is batches in vocations to function workers. We want our workflows to complete fast, but that way we'll always have a centralized component that is on the communication path of all our workflow executions. And it could easily become a bottleneck that we would need to scale separately. So instead, we're using a decentralized setup. This way we can hand over control directly between instances and we can pipeline workflow in vocations. Workflow control logic is evaluated at each instance. Now let's explore four options to deploy this on Kubernetes. Our first pattern are just plain microservices. We could use Kubernetes deployments and expose them as services. To pass control from one step to another, we can simply use service requests. The event loops here have to be implemented as asynchronous APIs to hand over control and not interlock the services. To benchmark this, we've set up a series of five steps and we've traced the time to hand over control between every step of the flow. The implementation sends a cloud event with one kilobyte of random data. The chart shows a box plot for each of the steps with the time measured between making the service request and receiving it at the next step. The median time for transmission across all steps is only 620 microseconds. Our next option is Knative Serving using Istio as inquires. Every Knative Service deploys with a Q-Proxy. Knative Serving uses a gateway to accept connections. For a running service, the request goes through the gateway and Q-Proxy before it hits hour or one time. The gateway can redirect requests even when there are no deployments, which means Knative can scale the deployment to zero instances and spin up cold ones when there is new demand. This also allows for traffic splitting, for example to migrate load between revisions of a service. But the indirection comes with a cost in latency. We've measured again our series of five steps and the time it takes to hand over one cloud event with one kilobyte of data. The implementation is the same, only now we're using Knative Services that have already running instances and no cold starts. In this case, half of the deliveries take about 2 milliseconds or less. Knative Serving introduces some latency, but we've also moved to a solution that can scale to zero and that can split traffic for revision management. We'll be using a service mesh to do so. Let's take a look at the third option. Knative Eventing has been around for two years and provides composable primitives to enable late binding event sources and event consumers. One of these primitives to define a message passing topology is the sequence. It's a simple pipeline that automates the creation of channels and subscriptions. When we trigger a sequence, each stage makes synchronous calls to the destinations and the result is used for the next invocation. This makes it pretty easy for developers to use. It also provides multiple technology bindings. For example, there's an in-memory implementation, a nuts binding and a Kafka binding available for channels. We've used the in-memory channel implementation that handles the communication in our scenario with a single pod. But message passing almost always uses a store-on-forward pattern and we would expect this to add some latency. So let's take a look at the benchmarks. We've measured a sequence of five stages with plain Kubernetes services. So we're using simple deployments for our runtimes. Again, we're measuring only the time it takes from the event leaving the service until it's arrival at the next service. And again, we're using cloud events with one kilobyte of data. Interestingly, we found that with the in-memory channel, the time between the second last and the last stage in a sequence always has a much higher latency. But since we're looking at the median of all deliveries here, that doesn't make much of a difference. 50% of all the event deliveries take 1.45 milliseconds or less or 1.4 milliseconds if we left out the last step. Our last option combines both the Canadian eventing sequence and Canadian services because it gives us the features of both, the coupling and late binding of the eventing sequence and the ability to scale to zero and migrate between revisions of a Canadian service. Let's look at the numbers. Again, we're measuring a sequence of five stages. Only here we're deploying Canadian services. And again, we're measuring the time it takes for one kilobyte cloud event to be passed from one service to the next. 50% of all the event deliveries take 4.59 milliseconds or less or 4.7 milliseconds if we left out the last step. Now, let's compare the four options and let's also take a look what happens when we transfer larger events. The combination of eventing and serving had a median of 4.5 milliseconds for a kilobyte of data. But for 2 megabyte, 50% of the deliveries take almost 90 milliseconds or less. The eventing sequence alone invoking basic services requires 60 milliseconds to pass on 2 megabyte of event data. Canadian serving without eventing channels delivers the event from one instance to the next with a median of 54 milliseconds. And of course, basic services achieve the best figure here with a median overhead of 40 milliseconds to pass on 2 megabyte to the next service. When we look at even larger data sizes, we'll find that the linear trends continue. We can see that the absolute overhead of sequences increases with larger data sizes, which would be due to the store and forward communication pattern when procuring messages. To summarize this part, we can state that every additional interaction adds latency, whether it's a proxy or message broker. Proxying HTTP transactions is faster than using a store and forward pattern, which is especially noticeable with large data. When we want to use communication features like complex routing or message queues, it's always better to separate control from data. And we've also got a feeling now how much latency each of the solution causes. My colleague Parachart will now present how we can avoid transferring data completely by grouping functions while still being able to balance the load across allocated resources. Thank you, Manuel. So till now, we have looked at communication mechanisms that can be used to communicate between functions of a workflow. We looked at how the overheads of these mechanisms increase as we try to transfer more data between the functions. And we also measured that calling the function directly with the data versus a store and forward approach is actually better from a latency perspective. Eventually, we are trying to look for ways to reduce the overall completion time of a workflow. And another factor that affects the completion time of a workflow is how you package the functions of a workflow together, essentially how you group the functions together. In all the previous approaches which Manuel presented, there was an implicit assumption that all the functions were running in different containers or pods and that are not explicitly sharing resources. Now, this choice can affect the workflow completion time, particularly when there is significant data that needs to be transferred between F1 and F2. And in this case, either F1 directly calls a remote instance of F2 with the data or it calls it by reference by first writing the data locally and then sending a reference and then F2 pulls it or it could write to a global data store and then F2 pulls from there. And we have measured the workflow completion time of this approach in an experiment as we increase the size of data that needs to be transferred between the function. And the legend here, the file ref means that we are passing data by reference and then pulling it. And data call means that we're calling the functions directly with the actual payload of the data. Now, what is interesting here in this approach, in this graph rather, is that what happens when you compare this approach with an alternate one where both F1 and F2 are residing within the same container or a pod that is explicitly sharing resources. Now, here F1 either directly sends data to F2 or via reference, but to a local instance of F2. And here you can clearly see that there is a difference between both these approaches and keeping the data local is performing better. Now, this is not very surprising because there's a lot of research that indicates that we should try to put compute and data together to speed up. So what I want to indicate here at a high level is that there are certainly benefits to be gained if we were to simply relax the assumption that all functions need to reside in different containers but instead package multiple functions of a workflow together inside a single container. And potentially we can package all the functions of a workflow inside a single container to speed up the overall workflow execution time. And this strategy can work well as long as the container has enough resources to handle all the incoming workflow requests. But now you're also dealing with a larger unit of deployment which is a workflow with multiple functions that need to be scheduled together and balancing load in this strategy can become tricky. Consider a situation where you have multiple replicas of the workflow available. And now there is a request going through the top container which causes one of the downstream functions F2 to start doing some heavy work. The top container may continue to request more, sorry, continue to accept more requests as the initial part of the workflow may not be aware of congestions happening downstream in the downstream functions, particularly if the workflow structure is more complex than that is shown here. But now that we have admitted this request in this container, our choice of staying local would continue to push the request down locally, potentially degrading the completion time of all the requests going through this container. Now, even though we have admitted a request, we can potentially redirect or rebalance this request internally from the middle of a workflow to a remote instance of F2 if that is not loaded. And that way we can still benefit from additional resources if they are available elsewhere or we can even create resources on the fly. What I'm trying to indicate here is that when we group functions together and then treat that as a deployment unit, one may have to dynamically decide between keeping the requests local versus sending some requests down remotely to other instances or downstream functions in other containers. And this situation can change dynamically depending on the load and this makes the load balancing more tricky. And what happens if you now also have to transfer data between functions while rebalancing a request to a remote container? Here, this request will now have to take the hit of pulling the data first in F2 and then processing it and which may also affect the completion time of this rebalanced request. So we did an experiment to try that out and what we had was that there were already two requests that are concurrently being processed by the top container and now a third request arrives at the same container. Now we make a choice between pushing this request down locally within the same container or calling a remote instance of F2 so that we rebalance this request while pulling this data. And then we measure the completion time of this third request. And what we find interestingly here is that the situation is completely reversed. Even under load and depending on your applications, as the size of the data is increasing to be transferred between the functions, it might still turn out to be better to send this third request remotely and take the hit of pulling the data than to push this request down in the local instance of the container. So overall what I want to say is that co-locating multiple functions of a workflow inside a single container can actually accelerate workflow completion time and we should certainly take advantage of that. But treating workflow as a unit of deployment particularly under load will also require internally rebalancing workflow executions from somewhere in the middle of a workflow which you may not know beforehand where that needs to happen or this situation can also change dynamically. Now the communication mechanism that you as a serverless platform developer choose for communicating between functions should be flexible enough to support these use cases. The mechanism should ideally provide dynamically reconfigurable routing at runtime without having to modify the core logic of your application. It should also provide the possibility of fine-grain observability to monitor the situation and should also support configurable load balancing strategies without your application logic having to do all that. Now fortunately we don't have to do all of this from scratch. Service meshes such as Envoy Proxy already provide most of these functionalities for microservices and we could adapt this to our use case. Essentially a service mesh takes on the communication on behalf of a service and provides a large set of tools to do that effectively such as retrying, load balancing, active health checks between services and all this can be configured at runtime via a centralized control plane. So when, say when service one is trying to call service two, service one simply hands over the request to the local Envoy Proxy that is running as a sidecar. The proxy on behalf of service one load balances this request to one of the replicas of service two while providing facilities such as retrying and active health checks. This load balancing can also be configured dynamically via the control plane to say something like that now 70% of the requests to service two should go to this specific replica instead of a purely round robin manner or we can even have more complex load balancing policies which Envoy already provides. So we feel that service meshes such as Envoy is a nice choice of communication mechanism that can be adapted to our use case where there are multiple functions of a workflow co-located inside a single container or apart. It would provide a dynamically configurable and a unified communication mechanism for both intra container and intercontainer communication. So now when F1 wants to call F2, it would simply hand over the request to the local Envoy Proxy which could be configured to pass down all requests to the local instance of F2 to begin with. And if now there is back pressure exerted by the local instance of F2 say via 503 status responses to multiple health checks, then the control plane could intervene to reconfigure the proxies to now only send say 20% of requests locally and 80% of requests should go to remote instances of F2. And this situation can change again depending on the load conditions and may require further reconfiguration. So this way we can see how we can leverage the service mesh as an enabler for accelerating serverless workflows by providing a unified communication mechanism which lets you benefit from locality while at the same time being dynamically reconfigurable to support rebalancing of requests when needed. And the benchmark experiments that I presented that measured the workflow completion time without load and under load with rebalancing were actually conducted in this kind of a setup with the Envoy Proxy redirecting invocations from F1 to F2 either locally or remotely to other replicas of F2 running in other containers. So overall in this presentation we explored different communication mechanisms for function to function communication in serverless workflows and we measured their overheads. We also measured how co-locating functions of a workflow can accelerate work for completion time. We looked at the load balancing challenges this approach creates and the need for dynamic workflow execution rebalancing. We also presented how a service mesh can act as a unified enabler for this dynamical rebalancing situation while also being leveraged for locality aware communication. And we have already instantiated some of these ideas and concepts in our open source serverless platform that we are developing at Nokia Bell Labs called KNICS Micro Functions which provides workflow support as a first class citizen. We already accelerate workflows by co-locating all the functions of a workflow inside a single container and then load balancing externally between replicas of these complete workflow units. We also provide a K-native and an Istio based implementation where our workflows are packaged as K-native services. And currently within the container hosting the workflow functions we provide a custom local message bus for communicating between functions. But we are moving towards utilizing the Envoy Proxy that comes packaged with Istio to be used for intra-container communication as well and then further utilizing the service mesh in conjunction with a control plane to provide dynamic workflow rebalancing with locality aware communication. So do check us out at knics.io and our GitHub repository which is knics-microfunctions-knics where we are actively developing these ideas and also check out our Slack channel. And you can also find the code for all our benchmark experiments shown in this talk also at our GitHub repository knics-microfunctions-workflow-mesh. So with that we finish this presentation and we thank you for your attention and we are happy to take questions now.