 Hi, thank you all so much for joining. My name is Seth Leesman. I'm a Solutions Architect for Verica and commit on Apache Flink and I'm really excited to be giving this hands-on demonstration of building state for serverless applications using Apache Flink stateful functions. So if you're already familiar with Apache Flink, you likely know it as a world-class stream processor. It's very popular in the data engineering space for continuous ETLs, real-time aggregations, and reporting. And so there's this obvious question, right? What does a stream processor have to say about serverless? And at the end of the session, I hope you walk away thinking quite a lot. So to begin, I want to do a little bit of table setting. I think these terms serverless and stateful serverless have become a bit loaded in recent years and that a lot of people are using them to mean many different things. To me, serverless is not simply these commercial function as a service products, although that certainly does fall under the umbrella. Really, it is a realization of modern infrastructure capabilities, allowing us to iterate more quickly and with more confidence. So if our business is running a web app and business is doing really well, traffic spikes, we need to go from one instance to three. We're no longer requisitioning hardware, installing VMs, setting up networks. Instead, we simply increase our replic accounts. Stateful serverless at its core is really just about bringing these advances to the application layer, along with some key primitives that any real world application needs. So consistent durable state. Your application needs to be able to retain information it can act on in the future. Cloud native fault tolerance. So as we are maintaining the state, we want to do so in a way that leverages what this underlying modern infrastructure is really good at. And to make our lives as easy as possible in production. And simple messaging primitives between systems. Your business is not built on a single application, but a whole host of systems that need to communicate with each other in arbitrary and complex ways. We want to make this as easy and intuitive as possible. So this diagram shows a very traditional two tiered application architecture, something I'm sure everyone here is very familiar with, where business logic is deployed via a stateless tier, giving us those nice serverless benefits. State is then managed on a separate tier via database or other data store details here, not so important. But when an application receives a request or other message, something to trigger computation, it will likely communicate with that data store and potentially update some values. And then do one of three things. One, it could do nothing. It could be that this message was simply to update state. We have done that and we are finished. Two, we may send a result back to an end user. So this message or this trigger was to read some information or querying something. Or three, we're going to invoke another service, repeating the cycle. You can think of these different components interacting with each other as a sort of data flow. And the obvious question arises, what happens when something fails? Well, a fundamental problem with this architecture is that for any failure in any call across a service boundary, it becomes very hard to reason about which of the desired outcomes were actually achieved. Applications are forced to have a method of determining whether or not to attempt to retry or somehow make their state updates item post. But what if we rethought this problem in the beginning? What if we inverted it so that messaging runs through the database? Well, it turns out this is exactly what stream processors like Apache Flank have been doing for years to provide what we call exactly one state semantics. Business logic remains stateless and is deployed as a separate service from data storage. But this time, messaging is going to flow through the database and the data store is going to invoke functions and stateless containers. Supplying state is the part of the payload of each message. These application functions will then return back both state updates and messages to be sent to other functions. By moving messaging from the compute layer to the storage layer, state in messaging is easily made atomic. And if messaging were to fail for whatever reason, the state update is also rolled back. So retries are always item components. This is exactly the approach taken by Apache Flank stateful functions. We are using a flink cluster for message routing and state management while allowing the actual functions containing application logic to be deployed in a separate compute tier. This gives us a very powerful runtime where compute and state are logically co-located for consistency, but at the same time physically separated. All state accesses and updates are integrated as part of the function invocation request in response. So our business logic can be deployed however we so choose. Could be standard Kubernetes service using orchestration tool like Knative or even a wholly managed service like AWS Lambda. Yet we are able to retain consistent state and messaging. So as we go through our checklist of a proper stateful serverless framework, the initial requirements under pure serverless are easily met by deploying our business logic in stateless containers separate from everything else. But what about the stateful specific requirements? Well, to understand that, let's talk about some of the core concepts in stateful functions. When developing an application, you're going to implement several services or what we call functions that are basically small pieces of code or logic representing entities within an application. You could, for example, define a function type representing a user with a single instance of that function type representing a single user within our application. Think of this as an object or in the terms as being that it's between a class and an instance. These function instances are invocable through messages and do not consume resources while inactive or simply when they're not being invoked. What this means is the runtime can host a theoretically infinite number of function instances within a fixed finite set of resources. And this whole thing is polygraph from the ground up. So we are deploying these functions in our own containers, meaning you can do so in any language you so choose. The only requirement is that the language supports HTTP, GRPC or UNIX sockets, which is to say we support virtually every language. Communication between the flint cluster and user code happens through a very well defined and small protocol. Certainly something you could develop against yourself. At the same time, we realize that most people don't want to do that day to day and so the community does ship a number of predefined SDKs that wrapped up protocol in higher level idiomatic constructs for that language. So there's an SDK for Python, active development on Golang and Rust. There's even a Haskell SDK that recently popped up in the community and hopefully your favorite language coming soon. Adding new SDKs is very high on our prioritizations. So we can have user code, we can write different languages, lots of people can do that. Where things get interesting is that we can run these functions with dynamic messaging and consistent states. So if you have used Apache Flint in the past and you're familiar with this idea of a Dataflow DAG, that is completely gone. Instead, we support arbitrary communication between functions using logical IDs. And so the only thing an instance needs to know to message some other function instance is its function type and ID. So what sort of function do I want a message in which particular instance? If we were maintaining user function to keep track of users of our business, we would have user as our function type and there would be an instance for myself. My ID would be someone else's might be John or Eagle or Gordon or whoever else. And all of this can be done with exactly one semantics. So function instances are able to maintain local state. While the runtime ensures that messaging and state updates are integrated so users can have out of the box efficient consistency. This is true across that inputs to the application, application state itself and outputs delivered from the application. And I think most importantly, all of this is no database required or better put, we are using Apache Flink as our database. So Flink has long provided large scale consistent state management through these concepts of state backends and distributed snapshot. State is stored locally within the cluster for fast accesses and is periodically backed up to simple blob storage. This could be Amazon S3, Google Cloud Storage, HDFS and NFS drive, MinIO, whatever you already have available. In the case of failure when a pod that is part of the Flint cluster itself restarts for whatever reason, it will simply download its latest snapshot and continue on processing. This means we are not reliant on stateful sets or persistent volumes for high availability of states. The only thing we need highly available in the system is our blob storage, which is the easiest thing to achieve. Using those model organizations of scale to managing hundreds of terabytes of state within Flint applications themselves with the confidence they're delivering consistent, reliable results. So that is enough on concepts. Let's take a look at some specific SDKs, actually build something. Today we're going to be looking at the Python SDK in particular, but all of these concepts translate to all of the different SDKs. They all offer the same core primitives. So we need to begin by thinking about types. Because remote functions can be implemented in any language and a single application can be composed of many functions written in many different languages, we need a uniform format for communication. And for that we've decided to standardize on ProtoBom. If you're not familiar, it is a serialization standard out of Google that has very strong cross-language support. And so all messages passed between functions must be encoded as ProtoBom. And in particular they must be encoded as ProtoBom any, which is very convenient because it contains what the logical type along with the serialized bytes. And so within a particular user function, you can then quote unquote unwrap that any message into a specific concrete type using your language specific ProtoBom library that you can then work against. Same thing goes for state type. So anything we want as consistent durable state must be ProtoBom any. And this allows state written in arbitrary languages to be uniformly maintained by the Flink cluster. Flink's state backends are simply going to store the serialized any record. At the same time, we realize this is kind of boilerplatey and so, and it is if you're working directly against the protocol. But for all of the language SDKs, we do offer higher level constructs so that you only ever have to develop against specific ProtoBom types. Using say the Python SDK, you will rarely if ever actually see in any record. So as with any good introduction to a new bit of software, we're going to start with Hello World, but make it state fun specific. And so we're building a greeter application that is going to greet users of our service based on the number of times that specific user has been seen so far. So every user is going to get a personalized greeting. First time I'm greeted it might say welcome stuff. Second time it may be welcome back stuff and third time third times the charm stuff. Yet if someone else is greeted they are going to get their own personalized greeting. And this is going to show off some very important primitives. So we're going to talk about messages, right? How do you a Greek request for myself specifically? How does that get to a function and state? So we need to maintain for every user a count of how many times they have been seen. So each function instance is associated with a function type and ID as I said before, which forms its unique address. This logical address is what we do use when messaging that function. So when I am to be greeted we're going to send a Greek request to the function type greeter and the ID set. We can see that as our input to the function. So this is the message that was being passed to us. Again as I mentioned while the runtime is using protobuf any by leveraging Python three types in this case we are able to have the SDK automatically unwrap that for us. Similarly we can send our result to another function. We'll look at the middle bit of creating the greeting in just a moment. But we are going to both pack our result into an any type so we can avoid that boiler plate and send it to another function instance. In this case we're sending it to an email sender that is going to ship out that greeting. When messaging we're using our address so we have our function type which one is it well it is email sender that's the sort of function I want to message. Which specific one well I want to message the email sender for this specific email address. And then we can go into our personalized greeting itself. And this is showing off what I think is our most powerful feature which is durable state. All this method is doing is keeping track of the number of times this particular user has been seen so far and then generating a message based on that count. So our state is being accessed via our context and we are able to read out our state based on some name and specify the type. So we're keeping track of scene count which is a portable type I've predefined. We can both read that type out and write it back. And you know what that's it the rest of this is standard Python there's nothing state fund specific about the rest of this method. The only thing that we have done differently than say building this in your CS 101 course is that our variables are being managed via the context instead of basic instance variables right so we're using our context but otherwise it's just Python right and we get all these nice primitives like durability out of the box. So we've written our code right but we have to make it available right it's it's running in some remote container. And so the first thing we need is our function registry. This is going to map logical function types to concrete bits of code. So in this example we have written both our greet function and our send email function. They're both written in Python and they're both written in the same file but neither of those requirements. The send email function for example could be a rust function or it could be implemented in Haskell and it could be running halfway around the world from our reader. But we're going to bind these for registry and we're giving the function type. So when we shoot off a message to that type we know how to associate it with a specific concrete method. And then we need to expose it to the flint cluster and we need to ensure that it actually works against our protocol. And for that we ship a request reply handler which dispatches the invocation requests to the bound functions and then encodes their side effects both resulting output messages along the state updates as an HTTP response to be sent back to the flint cluster. And then we simply expose this handler using your favorite HTTP framework in this example and the later examples I'm using flask but that is not a hard requirements that is just something that I chose to use. Plug in your favorite library here. Okay so readers are interesting readers are fun but that's not what you're building your business is not built on hello world applications but it might be built on model serving. So we're going to take a look at building a fraud detection application specifically for credit card transactions. So as a transaction comes in we want to build up feature vectors which require looking at states right we need to remember things about our users and our merchants. We need to query these functions in dynamic ways and then we want to score that against something that was likely provided by our data science team. Giving us back a score on whether or not we believe this transaction to be fraudulent and at that point we can take some action. Okay let's take a look at the code for this model survey example. Again we're going to be sticking with the Python SDK and for simplicity all the functions are implemented in a single file as a single flask application but just reiterate make it very clear that is not a hard requirement. Simply for simplicity of this demonstration these functions could all be implemented in different languages they can be packaged and deployed separately that is supported and expected workflow of many state fund systems. So we're going to be building up feature vectors whenever a transaction comes in we need to get information that we can use to send to our model and one of those features is a fraud count. So how many times over the last 30 days has this particular account reported and confirmed fraudulent activity? The idea here being that the more often that we see fraud for a particular count the more likely we are to continue to see it in the future. It's a rolling 30 day sum because people's behavior changes and so as things become further in the past they become less relevant. So our function type is ververica slash counter right this is the logical type we will use to message this function and we take into parameters our context it gives us access to capabilities like state and messaging and the actual message that was sent to us. Leveraging Python 3 type annotations we get to avoid all of our any protocol boilerplates and I'm using a union type here because we support working against multiple message types. So let's start with this confirmed fraud message. A record is going to come in say from a Kafka topic that tells us that a user has confirmed fraudulent activity against a particular account. This function I forgot to mention is always scoped to a particular account ID so fraud count is our function type. Account is going to be our ID for the logical address. When this comes in we need to increment our accounts and so all we're going to do is go into our context we're going to read out the current accounts and then we're going to increment it if it already exists or initialize it if there has been no fraud over the last 30 days for this particular account. Once we have done that we will simply repacket we will set that value and we're done. So while we have switched to using a context versus local variables we are otherwise just writing very simple Python code in getting fault tolerance and durability from the runtime. But I said we also need to do a rolling 30-day count. So every time I increment this fraud counts value in 30 days I need to decrement it. Well we're able to send messages to other functions but it turns out we're also able to send messages to ourselves. And more interestingly we can send messages with delay. So after we increment our count we are going to pack and send after and we are this means we're going to send a message where we're sending it well we're going to send it to ourselves. We have the context we can get the current address and so we're going to send ourselves an expire fraud message that tells us to decrement but we are going to give it a delay of 30 days. So this message will not arrive until 30 days after we send it. And the runtime is able to ensure this message is consistent and durable so that if we have failure over the course of that 30-day period for whatever reason this message will not be lost. And we do that and we're ready to go. So we see that expire fraud is also an accepted type and so after 30 days it will arrive and what are we going to do with it? Well we're simply going to decrement our value so I will read out our fraud counts. We'll decrement it and then if it's zero we'll go ahead and delete the state entirely just freeze up a bit of space and makes things more scalable. But this is really an optimization detail. Otherwise we are going to go ahead and set the new value. So if it was five it's now four we have decremented it and we are good to go. But storing state is fine. We also need to act upon it and so the third message type that this function accepts is query fraud. Someone can message a particular instance of this function, right, they can query for a particular account and ask how much fraud have you seen over the last 30 days. When we receive this we'll simply check our state value. If it's not already set if there is nothing there we'll give it some default and then we will reply. So send this message back to the caller. This is everything we need for distributed durable consistent state and messaging of this function. Let's not see how it's used. So I have some other functions in here we're going to skip past but the main function in this workflow is what I'm going to call the transaction manager. This is what coordinates the whole workflow and builds up our feature vector every time a transaction comes in. So again we have our context we have our message types the main one being a transaction so every time a user say swipes their credit card or does something else we will get a transaction event that contains the account ID it contains the merchant ID of where they were making this purchase and the amount of the transaction and so we see this we are going to cash it in state we want to hold on to this and make it available later on and then we're going to farm out to our different functions that we are using to build up our feature vector so you can see here we are querying that counter we just defined above and we are going to the instance for this particular account we're also getting some merchant information and some other values. When these functions respond right we saw that our fraud count replies back with a reported fraud well here it is when we get this what are we going to do well there is a bit of business logic to ensure that we have gotten all of our features if we haven't will store that reported fraud count in state until we get all different features back but when we have them all we are going to build up our feature vector and message our model this is likely living somewhere else it's provided by the data science team they're going to iterate and deploy the separately of the rest of the application and it will take in that feature vector computer score and respond back when it does so we are going to get this fraud score so this is our confidence interval from zero to 100 of how likely we think it is that something is fraudulent so zero being is absolutely not and 100 being this is absolutely fraud when we get that score we will compare it to some predefined interval and if it is above the threshold say 80 percent well we will send an alerts to a Kafka topic called alerts that says hey we think this is fraudulent and the user will see that and they can act upon they can maybe confirm it and the bank will block that transaction or they can say you know what this was really me please let it go through we're also going to delete all of our state values at the end because hey we have scored and alerted on this transaction we don't need to retain this information any longer as we have built all of these up we are making them available via the request reply handler and we're packaging this as a flask application I've defined an endpoint slash state fund that accepts a post and so whenever data arrives whenever we receive that we'll simply send the whole payload to the handler and it will manage dispatching to our functions encoding our effects our state updates and our responses and we will simply send that back to the caller of this endpoint which is the flink cluster when we go to package this so let's take a look at the dockerfile you'll see that there is nothing state fund specific here this is a plain and simple flask application there's nothing about the flink runtime we're going to see there's nothing special about this in any way and if we look at our dependencies as well we are including the state fund sdk which is what wraps that high-level protocol and then we are pulling in flask and whatever other python dependencies we need so where this the model function we might be pulling in numpy or scipy or any of those good data science libraries we have full flexibility here and when it comes time to deploy this well we are going to deploy it as a standard kubernetes deployment so i have written this deployment specification i've pushed my image i want 10 replicas of this because i want to be able to scale out we are exposing it under port 8000 but this is all stock and standard kubernetes additionally there is a service that is making it reachable so that gets us our user code but what about the flink cluster right how does it know where to all right so this file is our module that yaml file this is the configuration we give to the flink cluster that tells it how to map function types logical function types to addresses under which our functions are reachable so we can see here i have our counter function i have said that this is the logical function type so when you see a message that is targeting ver verica slash counter this is the metadata you should use uh the function is exposed as an HTTP endpoints and this is the specific endpoints you should use we also have at the bottom our ingresses and egresses this is how the functions communicate with the outside world so you saw for example that we were sending alerts to a kafka topic we're also reading our data from kafka topics uh let's look at the example of our confirm fraud uh message well i have said that this is coming from kafka uh i have given a name and i have my kafka specific configuration so where the brokers live uh consumer group ids things like that and then uh we give it a list of topics to consume from so we are reading from the confirmed topic we've specified our type url so what sort of data are we reading and then we give it a list of targets so what function types do we want to send these messages to we give it a list of types the id is implicitly pulled from the header and it will route our messages to the appropriate function to begin that computation along with kafka we support AWS kinesis out of the box and then if you're comfortable writing a little bit of java code we also support a whole host of other systems including JDBC elastic pulsar for vega rabbit mq and as we see demand we'll add more first-class yaml support for those other systems we we're going to take this file after we have written it and build our docker image so this base image flink state fund contains the entire Apache flink runtime along with all of these stateful function specific runtime code and all we need to do is copy our module yaml file on to the image there's no java code to write there is no flink specific code to write I am also including a flink comp which is some flink cluster configurations but this is stock and standard if you have written other Apache flink applications in the past and this is the image we are going to use to run our cluster and so I am in fact already doing that let's go ahead and take a look at our pods I have a Kubernetes cluster that is running three Kafka brokers for our data I have a data simulator that is simulating transactions and confirm fraud accounts and all those good things and then we are running our flink cluster and our user code so I'm running three instances three nodes in my flink cluster each of these only have a single core so it's very small and then we are running our user code and I have a replica set of 10 so I want to really scale out that compute if we go ahead and let's go ahead and believe I'm already port forwarded so I can pull over the flink UI you can see that everything is up and running this is the flink UI if you're not familiar it tells us what our application is doing and so we can see here we have processed since I started this roughly 200,000 messages let's these are all calling out to our user code it is being routed through this application it is syncing the results into Kafka we take a look at our checkpoints at our fault tolerance we can see that things are going through smoothly there was a failure but that's okay we handled it gracefully and currently I'm managing about 7 gigabytes of state within the flink cluster remember this is being stored locally either in local memory or spilling to local disk but it is always local we are never using persistent volumes of stateful sets minio is providing all of our fault tolerance and when it is time to go make a change so maybe I want to change my replica count or I want to deploy a new version of my user code all I need to do is apply those values that we have for our function so I can qctl apply-f statefund functions and this has our deployment YAML and our service YAML and this will apply those changes in this case I haven't actually changed anything we can also apply a horizontal load balancer so perhaps I did not want to have a static set of functions but I want to scale as my load goes up and down throughout the day we can do that and we will be able to do so gracefully and we can multiplex different function modules together so I'm running this code perhaps I'm more of a data engineering team and so we're in charge of building the feature vectors maintaining that state the data science team has their python code that is our model they're going to deploy that separately and make their own updates and we can all do that gracefully and consistent so I really appreciate you taking the time to listen to my talk today I hope you are excited about stateful functions and the future of stateful serverless applications if you have any questions I'm always on Twitter at SJ Weaseman also the Apache Flink user mailing list is the most active user mailing list of any Apache project and it's a great place to get help thank you so much and I hope you enjoy the rest of the conference