 Hello everyone. Thank you for coming to my session today. I hope you can hear me, see me fine, see my slides. Thanks for taking the time to attend this talk. It's a middle of the day in Japan, I think I'm in Berlin in Germany and it's 5 a.m. So I can't promise that my brain will be fully functioning. I'm Marta. I'm a developer advocate at Berberica. And today I want to give you an introduction to stream processing and Apache flight. So I tried to make this talk accessible so that you can follow it independent of your background or if you ever heard about flink before or not. I put a little poll here on the side just so you can tell me if it's your first time hearing about flink or if you're maybe already use it using it or maybe you've heard of it before so if you can click that and let me know that'd be pretty cool. And for those of you that never heard about the company Berberica before, you might know it as or it is a company of the people who created Apache flink. And this means that every day I get to work with the core maintainers of flink and be involved in this really active and growing open source community. And what Berberica does besides working on flink is offer an enterprise enterprise version of flink called Berberica platform. And what it does it just it's just it really just streamlines a lot of the operational side of deploying and maintaining stream processing applications. And since the beginning of last year the company is part of Alibaba, who is also one of the biggest users but also one of the biggest contributors to the open source project up there. And so I'll just start with a really quick recap that introduces you to stream processing if you're not really familiar with the concept or you have a different different background. So if you have at least been working with data for some time, maybe the scenario will be familiar to you so not that long ago. This is what analytics used to look like. So you had a bunch of transactional databases, maybe you had some static sources of data as well, you would run some ETL processes processes to integrate and combine all this data into something useful for your business. And then you would store all of this away in a data warehouse in nicely in nicely structured tables that serve against some business process. And maybe you'd already have something like a data lake where you just put the raw source data and that you use instead but this is basically the picture of how analytics looked at least for me very recently in my in my previous job. And so your your quest to your quest for data would look something like this you, you would run some really long nightly jobs. And after some hours you would have your results. But in reality what happens is one or one or more of these nightly jobs would fail because your processes run out of memory or because someone decided to use single single quotes instead of double quotes in their input. And then someone would have to wake up fix the problem reruns his jobs. A lot of times that person was me and someone then probably a stakeholder would complain that their data is late. And after all of this, then you get your results. But they're late. And in the end, if you look at it, most of the source data that you are using for all these processes is continuously produced so it doesn't really make sense that someone is waiting for for yesterday's data or that I am waking up in the middle of the night. Because most of the logic that you use to run to run this ETL processes doesn't really change it's not something. It's not something that is evolving or changing every day. And what is evolving and changing all the time is the data that your process. So something we can't really escape is this concept that nowadays everything is a string. And what used to be your static batch data are now events that are continuously produced and you should also be able to continuously process them. So you have a set of event sources that can be anything these days so from connected devices and vehicles to web clicks application logs, financial transactions, you name it. And all of them are continuously producing events that end up at some point in some in a centralized distributed log something like Kafka or Pulsar or Kinesis. And over this sequence of events that are produced, you likely want to run a set of transformations or computations to kind of multistata into something that is useful for you. So you might want to do things like filtering out some garbage or correlating events, or just doing aggregations over time. Maybe in the, in the summer in the process you want to persist some intermediate results to some long term storage like S3 or HTFS. And then in the end you publish your output to a sync. And the time it takes from data to go from the event sources to your syncs might be more or less critical depending on your use case, but you certainly don't want to wait a whole day or sometimes more to get your results. So, in the simplest terms that I could find to try to explain this stream processing is is really what I what I just described so you have an infinite data set that is continuous continuously flowing sculpt a data stream. And you have your code that is going to perform whatever transformations you want to do on this data. And you want to process this data stream event by event. So, use your code to apply transformations, and then just output this transform the events downstream. And things start getting a bit more interesting when you want to do something that is more than just stateless operations so something like mapping or filtering are stateless operations. And this is when you you enter the world of stateful stream processing. And here you have the same simple model as before, you have your input data your transformations and your output. But now you have this concept of memory, or the ability to remember events as they flow through your code. The real challenge of doing stateful stream processing are of kind of keeping track of what you are processing to influence what you're going to process in the future is exactly this memory or your state, because you not only have to manage the states distributed across multiple machines, but you also have to make sure that it doesn't just vanish when you have some kind of machine failure. So, what is Apache flink the whole reason why we're here today. So flink is an open source framework, and the distributed engine that allows you to do exactly what I was explaining before stateful stream processing. Flink can continuously consume data from whatever sources you want to plug it into. It applies some stateful computation, computations on these data streams, and as it processes it builds up some context so it keeps track of the state as it goes. And then it produces some output is this could be anything so an API call updates to database other data streams. And what what makes flink really really powerful and what differentiates it from other from other stream processors is is the way he really handles this distributed state. What then makes it really flexible is that it's able to do this one at a time event processing consistently. And because it is such a wide, such a such a wide paradigm and such a flexible framework. This really gives you a very, very solid foundation to address a wide range of use cases so the use cases that we see companies using flink for kind of fall into three different categories. At the core, you have classical stream processing use cases so here are use cases that really build on the core primitives of flink so events, states and time where data platform data or platform engineers are really exploring really trying to max out flink to do complex or heavy computations a lot of logic customization. And the goal here is to maximize the performance and reliability of the systems that you're building. So some examples of these use cases are there there's a lot of companies out there using flink to build their core data infrastructure so one example is Netflix. They're using flink as a basis for their internal data platform called Keystone, and they're processing, I think, around three billion events per day. So, large scale data pipelines is a very common use case. Then, for example, you have Fujitsu that has also built a real time IOT data platform, for example to process data from for autonomous vehicles. And another example is AWS, who is using flink for log analysis to monitor and detect anomalies in their clusters. And then on the other side, kind of the rising use cases are on streaming analytics and machine learning. And here in contrast to what I was saying before like this more core infrastructure use cases. In this kind of cases flink is used a bit more on a high level and in the main specific situations so that can be easily modeled with something like SQL or Python and simple abstractions like tab like tables. So here, the focus is not so much on implementation details, but it's more on quickly being able to build the logic to meet business requirements. And these are use cases where you might have also mixed batch and streaming workloads. And I didn't mention this before but flink is also able to do efficient batch processing. It's a streaming first, it's a streaming first framework that can also do efficient batch process. And so, here you might want to do max mixed batch and streaming workloads for example for things like backfilling historical data. And the goal is again to maximize developer speed and autonomy so basically make users more independent in their data needs. And this is made possible by using, like I said before rapid prototyping languages like SQL Python that allow you some also some degree of freedom like writing your own user defined functions integrating with useful tools like notebooks or machine libraries. And, sorry, I skipped the use cases. And some examples here on the streaming analytics and machine learning side you have for example, weibo, which is a social really big social network in China is using flink to build unified pipelines for online and offline model training. Uber is also as also has an internal data platform that that allows users to build end to end streaming analytics pipeline so users can just submit SQL statements and then we were built a platform that just compiled everything down to flink jobs, but users are able to just build a defined SQL without any kind of code. And in the last criteria also has as created a platform that makes it really easy to generate features for machine learning model training. And on the other side of the spectrum there's also event revamp applications. And I'm not going into a lot of detail here, because it's a bit of a new field for flink and it might be confusing if you've never heard about flink before. But if you are interested in stable serverless and and all this, all this universe. Then, then I dropped some links in here and you can check this new API called stable functions. And with this really wide range of use cases that I showed you before flink is powering a lot of the largest companies in the world and it serves very different or very diverse industry vertical so anything from entertainment to I wrote tech. And to give you an idea of the scale flink and go to the biggest production use case that we know of is really what Alibaba is doing on the 2011 or singles day. And on this day, their infrastructure flink is is backing most of Alibaba's real time data applications in this day so like search and recommendations advertisements and even like this huge GMV dashboard that you see all over media. Flink is running on the art flink is crunching all the data to actually produce those numbers so and so so so that you have a specific idea of numbers here, they run flink on thousands and thousands of machines. And at peak they are processing four billion events per second. And they do all of this with sub second latency so including updates to feature vector vectors that go into the recommendation system so this is really flink maxed out. We always say it's flink at Alibaba scale because it's the biggest, it's the biggest, biggest scale use case that that we that we know. But this doesn't mean that you can only use flink for huge production setups. You can also go go small so in one of our in one of our conferences recently there there was a company called you hopper that showed how you can run flink, simply on a cluster of five raspberry pies to process and aggregate real world IOT data from connected vehicles. So this is really a big contrast. When you see the use case that I just talked about where you have thousands of machines running running flink and here you have flink running on a on a mini cluster of raspberry pie so it also gives you this flexibility in use cases. And you can just use your laptop, you know, fire it up, open an ID and then just run our debug your flink applications locally. And now I'd like to talk about what really makes all of this possible under the hood so what makes all these use cases. What allows flink to to to go to go to the scale that Alibaba takes it for example. So what what makes flink flink. And it's it's basically a combination of four different things that relate to how flink was engineered since the beginning so it has a set of really flexible apis. It allows you to do stateful processing. It's really optimized and built from the ground from the ground up for high performance. And it has fault tolerance mechanism that allows you to that allows you to achieve the highest degree of consistency, even when you have machine failures. I would also consider that the community is also something that makes flink what it is because it's a really, really active open source community. Yeah, from from flink release to flink release we see the number of contributors growing and growing. So it is it is one of the most think the most active project in the patchy software foundation so I would say that community is also really a really fundamental part to to flink. And so I'm going to try not to give you an overview of how all of this is achieved. Oh, hopefully without getting too much into the neat degree of it. So starting off with the apis flink has a layered structured with different apis that trade off how easy it is to use and how expressive you can get with it. So this means that at a higher level you have apis like sequel and the table api and pipe flink. There are closer to a relational way of thinking about data. So these apis allow you to quickly express problems in a very concise way and do and let flink do all the operational heavy lifting for you so you don't have to worry about state management and all these things. You can use languages that are familiar to you and they're a bit a bit more. Yeah, a bit more either the main specific or a bit quicker to to to use or more immediate like sequel and Python. And as you go down the API side using using the API starts getting a bit more complex, but you also get more and more control over the programs that you are implementing. And also how they are executed. So and when you reach the core building blocks of flink. So if you're really working at this level of process function, which are like the smallest unit in flink if you want to call it that. And you're dealing with state and time and having fine grain control over all of this. You really can you really can do anything with. And it's pretty straightforward to look at this API stack and kind of think where the use cases that I mentioned before, or the categories of use cases fall into. And the good news is that you're not really you don't really have to choose one API. All the API is mixed are integrated with each other so you can mix and match all the API is you can, for example, invoke a Python user defined function in a flink sequel query, or you can convert a table into a stream and vice versa. So you really can there's a lot of interplay here to really to really fit whatever use case and whatever level of abstraction need you have. And at the core of all of it, and no matter what API you choose to work with what everything what what it boils down to is again the very simple model that we started with. So, your code will define where to consume the data streams from which here is your source, what transformations to apply to it, and where to sync the results. So in this case I'm using the data stream API so one of the lower level API to write a very simple Java program that consumes temperature sensor data from in this case Kafka it could be anything. And so, this is how you build a flink program so you have you add Kafka consumer as a source. Then you apply some transformations first a map operation that converts the temperature from Fahrenheit to Celsius. Then you key your events by sensor ID. And then you collect all these events in a time window of five seconds. And then you calculate the average temperature and sends all of this into an elastic search sync. So what we are basically doing is we have an incoming stream of data from sensors. Every five seconds, we are calculating for each sensor, the average temperature, and then we are outputting it. So and this is how you would write a streaming application with fun so this is this is then underneath all converted into a logical representation of operators so and this is your streaming data flow. So no matter what API you use, even if you use flink sequel, and you're just writing pure sequel statements in the end flink will compile everything down into this streaming data flow. And then you don't really just run this on one machine that's the whole point of using flink ready want to run this distributed across multiple machines and segment the work and and and process it. And what flink does is it takes care of distributing the workload across all the machines, and including resharding your state so that each group of keys. So like each sensor in this case, or each group of sensors is processed in a in a in a different instance. So in our case we have a window operator, which is stateful. If you remember what I said before about remembering events. What we are doing is in the small code snippet that I showed before is that we are collecting events for five seconds before we trigger before we trigger our average computation. So flink for every five seconds flink needs to keep track of what comes in. And flink always stores the state locally to the instance that is processing the data so this is done either in memory on the JVM heap, or on disk in an embedded key value store called rocks to be that is just embedded into into flink. And this means that state access for your computations is always super fast. So it's either in memory or at this speed. And, like I mentioned before, one thing that you kind of don't want is to lose all the state if something fails. So flink flink really allows you to make sure that your applications can survive any kind of failure or downtime, but still produce correct and consistent results. And the way flink make sure that this happens or that's your state is fault tolerant is by taking periodic snapshots of this application state. Write this snapshots to persistent storage like S3 or HFS or another blob store that you have in your cloud provider. And this action is done synchronously so which means that flink is backing up your state, but it still continues to process the state during this snapshot in process. And this case because we are using Kafka as a source which is durable but also replayable. This snapshot of state will include not just your window operator operator states, but it will also include the offset or the position in the input stream that that you are consuming. So when something goes wrong, like if you lose a worker or your jobs get canceled, then flink just automatically recovers all the embedded states based on the most recent snapshot. And it just because here we are using a replayable source, then it also resets the positions of the input stream so you can continue processing your data like nothing happened. And you can still achieve the highest level of consistency so you can still achieve exactly once processing. And exactly once here doesn't really mean that the events are processed only once. It means that even if they are processed multiple times, they only affect your application state once. And something more that flink offers based on this mechanism, on this snapshotting mechanism is also the possibility to trigger this snapshots manually for whenever you need to do planned manual backups of your application. This allows you to handle downtime situations like when you want to do, you want to upgrade your flink version or you want to migrate to a new cluster. But you don't really want to lose your states or you want to make changes to your code and restart the processing just when you're ready, like if you want to increase the parallelism of your jobs, for example. So with this snapshotting mechanism that you can always recover and restore your application state and resume processing like the application was never done. And the last thing that I want to mention here is the way that flink handles time because it's also important to understand this consistency story. So in flink you have support for two different notions of time. So you have event time and you have on the other side processing time. And the easiest way to explain the difference between these two is to look at Star Wars movies. So the order in which each of the movies were released is not the same order in which the events actually happened in the story timeline. So in flink, choosing one or the other, so choosing between using event time or processing time for your application mostly affects the latency with which you're able to process your events. And also the correctness of your results. So if you want to process your events exactly in the order that they happened in the real world, you can configure flink to use event time. And this guarantees that your results are deterministic. So always the same. And in our little sensor data processing use case, for example, this would mean that even if a sensor was down for any time, let's say 10 minutes, 30 minutes, flink would still be able to process it in the correct timing. And because flink gives you tooling that allows you to really reason about and handle out of order or even late events if you use event time. And you can always just choose what trade off you want to make between result completeness and latency in processing your results. And on the other hand, if you only care about speed and you don't really care about how correct your results are, you can also configure flink to just use processing time. And just to recap, what makes flink different to other stream processors in this is this combination of characteristics. So on one side it gives you flexible APIs that allow you to choose between ease of use and expressiveness and allow you to cover a really wide range of use cases and also skills. It treats state as a first class citizen. It has a rich time semantics that allow you to not give up or not have to choose between correctness and completeness, and also allows you to reprocess historical data consistently. And it's optimized for high performance with local state access that allows you to perform computations at in-memory speed and allows you also to achieve high throughput with really low latency. And lastly, it also ensures that your applications are fault tolerant and that can handle failures with the highest level of consistency if you need it. And yes, so like I said, my intention was to give you an overview of flink, not really to dive really deep into it. So here, if you want to know more about flink or if you're interested in trying it out, I'm leaving here some links that you can use to do that depending on whatever background you have or whatever programming language you prefer. So if you're a Java or Scala person, you can start with the self-based training that is in the documentation. If on the other hand you just want to write SQL and not code at all, there's a really good GitHub repository that has a flink SQL walkthrough. And if you're a Python person, their documentation also has a really good pipeline tutorial. And you can also get started on Apache Zeppelin notebooks. There are also some guides out there that make it really, really easy to write your first flink application. And other than that, you can visit flink.apachi.org and you can subscribe to the user manual list if you need help or just use Stack Overflow. The community is really responsive in both. And you will get an answer usually from a maintainer or from someone else in the community considerably fast. And if you want to keep up with what's going on in flink, so for example, we are about to have a new release of flink 112. The best way to stay up to date is by following Apache Flink on Twitter. And another way that you can get started is also just using the Vverica platform community edition. It is pretty easy to set up. It's free forever. You don't have a limit on the size of the applications that you can build with it. And also it recently introduced support for flink SQL. So it has a nice editor where you can just or a nice interface where you can just write SQL statements and submit jobs to a fund cluster. Yeah, and that's it. Thank you so much. I will take questions if there are any. Think there are any questions. There's still time so I will give it a couple of minutes. If, if you want to ask a question. Otherwise, you can always, you can always just find me on either on the open source summit slack. Or you can follow me on Twitter, send me DM and ask away. I will also leave here. The link to the sites. In case you want to check them out. The links are clickable. So just go for it. If there are no questions. I will close the session. And please feel free to reach out to me at any time in any platform. I will be glad to chat and to give you some more directions into going to start with flink. If, if you need them. So thank you so much and enjoy the rest of the summit.