 So, yeah, I'm Paris Carbone. I opened it, right? Yeah, I'm one of the core committers of Apache Flink if you know it It's a very cool processing system. I'm also a PhD, but I don't know if it doesn't if it matters in this audience So hopefully many of you know the system truth is I Notice recently that there's a lot of confusion when it comes to processing guarantees and guarantees in general You see all sorts of opinionated posts in forums not necessarily Flink forums or discussions more like general open source discussions and Hiker news in general That is why I created this talk and basically I want to hopefully help some some of you folks to understand What is it really going on there? In case you ended up in this dev room by mistake or you were just filling adventures It doesn't matter because I don't assume that you know much. I will just Generally explain the whole idea and hopefully you will get it All right, so I have only one introductory slide What do data stream processors do basically they get a recipe of a data processing pipeline and they compile this down to an execution of that pipeline Basically This usually ends up in a distributed data flow that covers runs continuously and covers your general data processing needs So if you're considering starting with data stream processing now I think there is no better timing because at the moment Steam processors can offer you subsequent latency processing and high throughputs at the same time Lates data, which means when your data is not sorted across your your infrastructure Doesn't matter because your system the system can actually sort the data and proceed at the right event time there's a nice Google paper about this and by the way There's some pipelines run 365 24 7 consistently without any issues in general I don't know if you Can understand this it's crazy, right? so if you're a DevOps if you're working working with DevOps you probably know the pain of Dealing with application updates handling failures Installing adding more or less removing workers Reconfiguring the system in general is a pain because there's an application or many applications running in the background And you need to make sure that everything will run correctly and you won't lose any processing or in data That's very very very annoying and if you do any mistake this can cause heavy impact Of course in your in your company So Failures can happen. We cannot really eliminate entropy, but in general we can use a fail recovery model to Turn back time and run Run our computation again basically We we we need to save our work today so that someone tomorrow can Continue our work because we don't know if we leave tomorrow from then some if something bad happens, you know Yeah, so in system theory this called fail recovery model and most system actually systems employee this model And they also offer some guarantees with that Here's a map of some guarantees you can see several discussion forums Yeah, it's a bit annoying so I would say most of them on their own say nothing If you see someone saying exactly once or at least once it says nothing Is it exactly once processing is exactly once output is it exactly once end-to-end? It doesn't really, you know work on its own, so I'll try to to resolve this ambiguity right away So when we talk about guarantees first of all we have a system, so a system is this box It's closed even though it seems distributed Proper distributed systems gives you give you an a picture of a single thing So this is a single thing. It's a system and then you have the outside world This can be a database can be a log it can be for example a file system and When we're talking about guarantees we have first of all processing guarantees So always remember pros and guarantees have to do with the system. So It's about what the internal state of the system We were talking about output delivery or end-to-end guarantees We usually talk about the outside world So the effects of the system the side effects of the processing of the system in the outside world So let's start with the pros and guarantees So why why are they needing the first place? The idea is that processing creates side effects inside the system. It's where your application is running Let's say you do work on that's I know it's it's very boring, but that's the first thing that comes in mind unfortunately If you do work on your account might not be correct. So it might not reflect the actual state of the outside world inside the system This is this is why it's important, right? If you work with transactions and money this can make a huge difference, right? If you don't get the numbers, right? So less or more processing sometimes means in correct internal states and Guarantees come in three flavors. Usually pros and guarantees, sorry At most once means the system might process less at least once means the system might process more and Exactly once means that the system behaves as if data are processed exactly once, right? There's no other side effect I Will not talk about most months because it's a bit boring. You can just discard records. That's all I'll talk about at least once First so this this is also very simple actually And now all we usually need to have is a Kafka for example Any any queue or message log that's durable and lets you go back in time and replay some input You can also do manual logging if you remember storm Like the first version. They were doing some manual logging This doesn't reflect any system It's just an illustration of what at least once means so we have three records green blue and red When a record goes inside the system it leaves some side effects in this case It leaves a count right the count is one because we process one record. It's correct If something goes wrong and let's say we reconfigure the system. We're able to replay this record, right? so if we replace this record we get Counts of two as well, but there is at least a count of one thus at least one no pun intended So that's how it works Now let's go to the serious stuff exactly once processing all right This is a bit trickier We need to make sure that data leaves side effects only once and any underlying mechanism that does fail recovery Should not impact the the actual execution of the system or the application, right? If we go a bit aggressively and say, okay, I can solve it. It's very simple I will just run a transaction for every record the system processes, right? I can write I can just write in a in a key value store All the all the mutations of the state and all associated with the records that cause these mutations And then I mean I'm able to do anything I want I can just go back in time Fetch up any state in the history of the system and so forth This is what mill will actually did and it worked pretty well for Google because they had a very good key value store but apparently what? Can I go back? Yeah But apparently this doesn't work everywhere Usually people don't have fine tuned Key value stores and perfect stores with that can you know deal with high congestion and write and append congestion So this is not always the best case. It's very aggressive. I would say Perhaps we we can do better So the problem should be simpler than that to solve actually So first as I said earlier input can be usually rolled back, right? in fact, this really logs make sure you can always achieve this and The process looks similar to cassettes, right? If you remember cassettes, if you are in my generation, you probably remember how they work So a cassette is durable log actually It stores an input stream Let's say sound or video and let's you play it and at some point rollback To a specific song for example or offsets Noble logs actually today like Kafka do this for multiple cassettes. So they can actually Play a lot of distributed streams and then roll back to a specific set of offsets. This is very cool That means that we can reverse multiple streams stream partitions back in parallel in a very consistent way How can we use that so? Since we can't be truly reverse streams Then maybe we can do the following. Let's say we split the stream into parts Course grain parts. That's why I called course grain Fault or ends and this is what most modern data stream processing systems do actually in different different flavors So you see that we split the stream into parts, right? the only thing we need to do now is to process these parts and Whenever something goes wrong, we revert back to where the previous part ended For that we we need to be able to to reverse the input that we can do we have the logs but we also need to be able to capture the global states of the system after processing all these records and all these records, right? so we can reverse everything back and We can do this in different ways This might seem familiar. So spark people you might feel a bit deja vu This is the microbots It does the same thing in a discrete way. So there is a planner That prepares part one It sends the records and then the records are being flushed to the system They're being processed they create some side effects the side effects are being stored as the state of the system in a system store State store and then you know, you can reschedule you can schedule part two and then retrieve the old states Continue from there. You can flush all the inputs of part two and then create side effects of part two So this happens in a very transactional way This is like rdd processing and then you the new structure streaming does something similar There's a state store actually. It's called state store This is a nice way I don't say this is bad actually This is a fine example of discreetly emulating continuous processing as a series of transactions It's very safe It's simple to understand actually, but there are some side effects when you when you create the system like this you need to Make sure that the user writes code that works in you know subsequent batches and that affects the API another side effect is that There's a very high periodic scheduling latency Actually, I don't know if some of you know drizzle drizzle is a research project that Berkeley They they made a study that says, okay We can actually reduce the the processing latency the scheduling latency by pre scheduling many batches and that then we amortize the scheduling costs This is true, but that's that happens the cost of higher reconfiguration latency. So there's a trade-off there I would go with something like long-running. I like long-running systems because There's not there's no need to reconfigure, right? You can just let the thing run you scheduled it You just let it run if you need to reconfigure just reconfigure And that's what happens With this solution for example, let's say we want to Take a snapshot while each part is being processed So when all the green records are going in we hold the execution we say stop everything Stop the tunnels stop the processing take a snapshot In this case the snapshot contains the states Plus some interesting events because this part of the system state. It's inside the box, right? If we recover from there, we need to replay those events, right? and this is one okay approach, but It has a problem The problem is that we stop the execution and that's not continuous processing So I would say we could do something better We need to do two things. We need to not enforce Discrete processing the API and not disrupt the execution with any underlying mechanism, right? and Also, I don't like this in transit events, but we can deal with them later Some of you might know Leslie Lambert. He's the father of the civil systems or arguably I don't know So the thing is that Leslie Lambert wrote about this with snapshots in his classic paper And she said that the global state detection algorithms not be super Should be superimposed in the underlying computation and not alter the execution of it, right? this really inspired us to Come up with this technique on fling Which basically snaps the state just in time While the execution is running and this is why I call long-running pipeline State management So the idea is very simple. We insert some markers the markers signify is a promise actually that Whatever comes after the markers belongs to the next part So basically we need to take a snapshot whenever we reach this part, right? And this goes along with the pipeline the normal data pipeline So if we flash the records along with the markers what happens is that the markers stop in some Operators inside the system then the operators can actually take their partial snapshot. This is not a complete snapshot. It's like a you know a part of the global snapshot and Then what happens is that operators with multiple inputs Have to make sure that they process all records of part one, right? In this case what we do is that we prioritize the channels where we haven't received any barriers markers, right? So we need to make sure that we process all the green records first and then We take a snapshot So then we broadcast the barriers and eventually we have a complete snapshot that is pipelined along with the data And as you see there's no records in transit. That's pretty cool. It's like magic So some facts so This algorithm basically pipelines naturally with a data flow. It's pretty cool. It respects buck pressure and all these things We can actually get at least once processing guarantees by dropping the alignment, you know where we actually prioritize And if you want you can try it on paper. It's homework The algorithm basically tailors the original sounding lamp or algorithm to To create a minimal snapshot state. So we don't need events in transit basically and it can also work with cycles So with cycles we do the following This is still in a pull request states for a year. Sorry about that But it's correct. So the idea is that we We flushed so there are a lot of records inside the loop, right? They're just going around infinitely That means that we run the same algorithm. It will never end. It will never terminate. So what we actually do is that we create An upstream block here To actually store everything that's inside the loop. So that's part of the snapshot We just replay it after we recover and then we have a global state. That's correct Okay, output guarantees this is a topic of very High debate So try not to piece of anyone That's why I recommend you always answering a very diplomatic question answer. Sorry So these are three possible answers to that question Which one would you pick? Can we have output guarantees? Can we guarantee exactly one's output in general? or Sorry, I don't have time I know I know It's just the same concept, right? so Yeah, the right answer is depends I think everybody can agree that alright, so There are many ways to deal with this problem Actually, we're talking about the outside world and the outside world can be anything can be a database can be a database that has Versioning support can be a file system that can roll back can be files system that cannot roll back. I don't know I mean it depends right so in the system like fling we have special things that give you Exactly one's output guarantees And probably you know the concept of a dependency. That's also how a spark structure streaming Provides exactly one's out guarantees. It's very trivial. That means it's a pros and property that Guarantees you that how no matter how many times you run something you will get the same output basically so that means you will write the same thing in the database and we have a fling thing that does this and also another another thing that uses the hdfs rolling files and truncate to actually transactionally So basically it's it writes a heads our heads files buckets This respect the snapshotting parts Whenever snapshot is complete it marks this part as committed. That means it can be read Otherwise it rolls back. It's very simple actually So, okay, no design flows, right? We have everything I guess remember there was this is a job manager there That means you can fail probably Well, not really because we support high availability and that means that We can have multiple instances of the job manager running impassive standby mode and Retrieve the the the active state of the let's say failed job manager whenever this happens And of course zookeeper provides leader election Atomic rights and so on so all the metadata associated with the the active jobs that are running in your pipeline are Are there? So that's more or less it actually If you start using fling today Where we have a 1.2 release coming up? And there's some cool features like key space partitioning job rescaling From snapshots I think state snapshots in rocks to be that means that you the operator doesn't need to wait until the snapshots complete can just say You know create a snapshot and then wait until the data the rocks to be says snapshots complete Monad state structures these are pretty cool this append only state that actually serializes data in writes it ahead in rocks to be That speeds up a lot of checkpointing Also, if you have mutable state you would use value states or reducing states and there's also mob state coming up And yeah, there we have Exnalized checkpoints there are two ways to do checkpoints that you can do ad hoc checkpoints on fling You can say create a global state of the system now. I want to use it sometime later There's another way saying do it periodically, but also let me cherry pick which Checkpoint to refer it from and that's externalize checkpoints There are both very useful features actually and they can run concurrently At the same time it's very very cool Coming up next we have auto scaling in committal snapshots and durable eternity processing. This is work that We are doing it's in research right now, but It will be very cool if you want to do structure iterations on on on streams. You will be able to do it Marius is one of my colleagues working on that Yeah, and some acknowledgments These are some core people that worked a lot spent a lot of time fixing everything and creating all these backends the cool apis and also all sorts of things and Yeah, if you have any questions not the time the work count Is that for output or for for output on the input side So what happens is that But Kafka does this for example, I mean the Kafka sources do that do that huntshake How that works Okay, so so the question is how they the the input processing huntshake is Possibility So there is 5-4 guarantee, okay if you use a one of these durable durable sources logs There is a 5-4 processing guarantee. This is implemented already by people at Kafka and so on Now if you use like a socket and you're trying to the same thing Yeah, you will have this problem. You have to implement your own protocol to to ensure like a 5-4 channels, let's say So it's a yeah, it's something with the outside world