 Thank you for the introduction. Good morning to everyone Today I want to talk about as the title says about the the evolution of the the open source data processing space and Because I am from data artisans I also want to give you a bit of the flink view of this evolution of data processing before we start so that you can put this in the context I Have been working at this company data artisans since beginning since we founded it We are the the original creators of a patchy flink and we created this company so that we can develop a patchy flink into open source and also work with Companies and develop products based on a patchy flink that make flink easily usable by by those companies These are some of the the companies that are using flink so some of the big names are Ali Baba This is a very big flink user and also contributor uber is Quite interesting interestingly. There's also lift on there and some some Spanish companies as well and working with these Companies we have been part of the evolution of data processing and we also know what the future developments of of that space Could be or what they should be or what they what they mean for the development of a patchy flink So that's also what I kind of want to talk about today Also, I will talk about different systems and I might represent what these systems do or I might Give you the wrong date for when they are created. So this is not because I'm malicious or because I mean to them It's just that sometimes people make mistakes, but you can come talk after me to me afterwards and correct me if you want So the main the main question or the threat through this presentation is how Can we process data and what are the systems available to us for processing data? And I think the need for processing data is obvious We have all these sensors and information from users and information from all these systems that we're using and we sometimes how need to process that data and sometimes somehow need to make sense of it so that we can make decisions in The company or wherever we are working and this need has been there basically since people had some kind of access to data so At the very beginning so since basically that's what computers were invented for for data processing Initially, we had to write our own custom programs for processing data So initially you would write you would put a program into a punch card Then you would maybe write assembly code then you had more purpose-built programming languages You had Fortran then you'd see then your Java and since the beginning of computing basically people wrote programs that are purpose-built for a specific data processing task, but Software engineers like to think that this is kind of hard and that they are a bit special and programming is actually is kind of hard and this makes data analysis or Means that it is not available to a somewhat larger circle of people so for example there's data scientists or there's business people that that could Express a need for data processing so for example They could say I want to know the average churn rate of users or I want to know how many users In a given day or in maybe in a season They leave my my telephone service so they can express this but It's hard to actually write a program for this You need software engineers that can then write this program in these special purpose languages and that's why I think the invitation or the in invention of databases, which was Around happened around in the 1970s We can argue that but that's when people worked on these relational databases At IBM IBM where these databases came up and where sequel came up that for the first time This made data processing available to a larger audience if you will because then you can express a Sequel query that reads somewhat like English even if you don't know sequel You can look at a sequel query and roughly understand what is going on and this allows more people to get Access to data processing you don't know you don't longer need the programmers to write these special case programs to analyze this data and then Because we had sequel which is somewhat standardized across the the database industry at least nowadays is in the beginning It wasn't like this, but there are also tools to generate a sequel and that make using data processing technology very accessible to a wider range of people and Then I will get back to this in a while, but an important thing is that These databases They also enabled a new type of applications where you have application services that are running somewhere and they can communicate to each Other and they can put data in a database and this allows these business type applications where you have for example business event like a customer signs up and then this kicks off some computation and puts data somewhere and this sends a message to another service and this again Does the thing so I will go back to this later, but This is This is Basically the prehistory before the advent of big data. So we are at a conference called big data Spain So obviously we are somewhat interested in that so big data Arguably was started when Google released this MapReduce paper where they described this This programming model is very simple model MapReduce We basically write a map function and reduce function and then the framework can take these functions and Paralyze computation Across a massive amount of parallel machines. So this Before we had MapReduce people of course they could write parallel programs that use multiple computers They had there were supercomputers you could use MPI and other things to write parallel programs, but MapReduce made it Available to a wider audience and had provided this framework idea idea and then with Apache Hadoop which was an open source implementation of this MapReduce idea that started at Yahoo this also became available to Yet again a wider audience because this MapReduce thing that Google published it was especially the internal system But no one could use it Apache Hadoop now is a system in the open source That people can use and there's big companies that sprung up around Hadoop so there's Laudera there was or is Hortonworks depending on how far they are in there in the process but the thing is or What is what is the defining idea when you use these These these MapReduce type systems or Hadoop is that You store your data in some distributed file system so far loop This is HFS for example you store all the data all the data that is coming in and then at some later point You can go to that data lake and you can start asking it questions. So keep that in mind that this was The model that we were working with at that time and we this is kind of history repeating itself So we have had these custom made programs Then we had SQL then we had MapReduce and this big data stuff and it was again a bit Hard to program this because you again had to write custom programs to use this you couldn't use SQL But then of course systems came up Around this so there's a patchy hive and pick which provide so hive provides sequel for writing Queries that are executed via a patchy Hadoop or some other execution systems when also this test and get more things You can run hive queries via spark and so on and there's pick which is another querying language for this So here we again repeat this history that we have a system that allows us to process data, but is somewhat limited to a Cast of special use case people but then sequel makes this Again available to a wider audience, but this time on this MapReduce model on a patchy Hadoop And you can use the same BI tools and whatnot to do this then Somewhere in there also a patchy spark came up. I'm sure Matei. Sorry. I can tell you a lot more about this but spark was a bit of a revolution because it made This quite a bit faster so patchy Hadoop which MapReduce was a bit limited But a patchy spark took this idea of parallelizing things to a bit of a next level and there you could then again do SQL and all these kind of things so this was basically The way of doing things is what we nowadays or back then also called the the batch processing way of doing things so you would remember you put your data into your storage system and Then at some point you can schedule a computation for example say Calculate the average turn rate over this last month of data that I have stored and then you get the result for that or for example If you want to detect credit card fraud you maybe have your Your week or your day of data and then at night you run a SQL query or some MapReduce job on the data and then this gives you the Some results of where there might be credit card fraud and then you can go there and kind of sort out this fraud and see What you can do, but it's in this batch mode It is you have your data sitting there, and then you start a query that Calculate something and then finishes it and the next step in the evolution was basically this this rise of the stream processing systems and Again, there were research systems and other systems before that but the first popular system that Took the stream processing approach was a patchy storm. This was for writing programs that are Running basically 24-7 so you have your you write your program using the storm APIs And then you have a program sitting somewhere and it's producing data as it coming as it's coming in in real time And it produces results in real time So you would get immediately a notification that there is some credit card fraud or you immediately get a notification that there's some customer at risk of Leaving your company or not being a customer anymore and Before Kafka people were using different message queues and all kinds of systems But a patchy Kafka is in the stream processing space an important Development that came out of LinkedIn and it is initially it was just a system For storing a stream of messages. So like hgfs is a distributed file system for storing data A patchy Kafka is a system for storing streams of data It's basically a buffer from the sources of the data that you put data in and then you immediately consume it with systems like storm for example this Then the advent of Kafka made this a lot more popular in Technologies in quite nowadays every company that does stream processing arguably also uses Kafka a Bit of a detour was this thing called the lambda architecture Where people they they were not Comfortable or they didn't believe that these stream processing systems like storm were reliable enough So you had your map reduce your Hadoop This was a reliable system that had fault tolerance and would produce correct results when you run it on this daily data and storm Could have failure so it could maybe forget messages and it has there's these different Semantic guarantees where you have exactly ones or at least once we say maybe you process your data Maybe you process maybe you don't process it You don't know do you make sure that you process all the events exactly once and so on so people were skeptical And then they use this lambda architecture where you have a stream processing system like storm to give you real-time insights but then every night or every weekend or whenever you still use your batch processing system and process The data and then this produces the the results that you actually use but the stream processing results You don't you don't use them for longer term We just have immediate insights and then you throw them away But the real results of the ground truth is still based on the batch processing systems but this Changed Arguably around maybe 2015 When a patchy flink became a very prominent stream processing system so flink I Will talk later a bit about what flink is and what it does but flink Managed because it had these strong guarantees It had this exactly once fault tolerance for stream processing and for state it managed to convince people that you can Trust a stream processing system So you don't need this lambda architecture anymore where you have two systems You have your batch system and your stream system and you kind of manage both of them and you write the business logic for the map reduce system and for the streaming system and You kind of manage both of these but flink managed to convince people that you don't need to do this anymore because now we had reliable stream processing and I had this earlier slide where I said the the batch Mindset is you store all your data and then you can ask questions later So you store your credit card data and later you can ask was there maybe some fraud in there So the big thing that stream processing reliable stream processing enabled is that you can Put the questions first and then you can see if something happens. So for for example for this fraud use case you You develop a program you put it in place It is running 24-7 and then in real time It gets all the events and in real time it can give you a warning when there is Credit card fraud and it can prevent that fraud from actually happening. So this was in my opinion one of the bigger shifts because it allowed real-time insights into what things are happening in a company in a Factory in all those kinds of things and of course you you probably are expecting this so These stream processing systems again, it was kind of hard to develop programs, but then At some point we also had sequel for a stream processing system. So there is Flink sequel there is spark structure streaming which has a sequel like or sequel for stream processing and Yet again, we have this repeating history that sequel makes this stream processing available for a wider Audience of people and there's multiple players in this. So there's flink sequel. There's case equal by confluent There's multiple people in that space so now At the end now that we have all this understanding of where things came from and why How things developed also quickly want to talk about flink and then how flink Potentially will develop in the future So we have this The way I think about this the processing landscape is that you have Offline processing where you have your data sitting there and you can run queries on it and you have real-time Processing so for example hard real-time would be like those high-frequency trading People that have machines sitting in the basement of the stock exchange. So that's kind of the spectrum that we have and On the left we have traditional batch processing. So this is more offline for these Streaming analytics and continuous processing use cases where you just want to monitor data or Continuously continuously want to gather some metrics and perform some aggregations and get some results. This is Kind of in the middle of the spectrum because using a stream processing system for this is very good But you can also use a batch processing system where you for example Schedule a batch processing job every ten minutes on your data. So you get somewhat real-time results still but the on the far right these event-driven applications that I mentioned quickly earlier where you have some Business events that trigger some other thing and then maybe you want to set a timeout for if Such and such doesn't happen Then I want to do this in the future for these event-driven applications. You really need a stream processing system such as flink so Talking about flink what what are the things that that you have in a In a stream processing system or in a processing framework in general. So the In my opinion, do you have three parts? You have the engine which is the the thing that Make sure that you can execute programs that manages the different machines So if you have clusters of several machines, the engine is what makes sure that the communication between those machines works and that for example, you have a good network shuffle between those machines and for Running those stream programs. There is the API's that you use to Express those stream programming jobs or this batch job So there's map reduce was the API was just you had a map function and reduce function Nowadays we have these more expressive API's that flink and spark have where you can put together multiple calls of like map and flat map and reduce and join and window and so on and SQL is of course also an API that you use to specify Something that you want to do a query that you want to run and then the engine takes that and executes it And then the third part is Connectors because if you have if you have this bigger system that is super good and you have nice API's for writing these programs but you can't really Get data from the outside world so you can't read your files in S3 or can't read from Kafka or Kinesis or pops up all these other systems and if you can't write data To those systems so you want to maybe publish your results to some influx DB or some elastic search Then it's pretty useless to have the system. So that's the third biggest or the part in a stream processing system. So for flink the the engine is basically How you run things so deployment is if you for example, you can run flink on just if you have 10 machines there You can write those 10 machines called as bare metal or if you have a cluster management framework like yarn Which is part of hadoop you can use this to deploy Flink worker nodes we call them task managers on those machines and then they sit there and they can accept queries or There's for example other systems like me source or communities which is becoming very popular these days for Running those those worker nodes and the The important building blocks that the flink engine provides is basically that you have very fast network shuffle because for Processing these streams of data in real-time. You need to send them across different machines Maybe you need to partition them by some key or some partitioning function. Maybe and The the most important building block is that you have A state and timers so if you have For example a or you need to For example monitor some temperatures from a sensor You need to compute the average and then that thing that you that you keep on those machines Is what we call state so for example you have a running count of something or you have a running sum or And then you can compute the average or for example if you have These event-driven applications and some event comes in you need to store in that state that this event comes in And then at a future time when another event comes in we need to see What events have I already seen and this is why we need state in these machines and the other thing our timers Which basically are a way of scheduling computation in the future. For example, there might be an event that comes in that says customer opened or Started to shop on my shopping portal and then I set a timer for maybe two hours in the future that says In the timer then I can see if that shopping session was successful or not And then I can react based on the vents that I have in state. I can then compute some Insight into maybe why that shopping session session was not successful So these this very important building blocks state and timers and if you have state and timers So you have say your ten machines sitting in your cluster somewhere They all have state and timers on them then the system needs to provide guarantees that this state is not lost when Accidents happen. So when these machines fail or when there's some programming error and these machines go down you don't Want to lose that data. So the system needs to provide fault tolerance and then here You have again these semantically guarantees that I mentioned you can have At most once processing so you process an event at most once which is kind of shitty because you might lose events You have at least once we say I process each event at least once but potentially more And then there's this exactly once where the system make sure that it looks to the outside world as if each event Was processed exactly once even in the case of failure. So when failures happen fling for example can restore from Some backup of that state and then continue processing and it will look like that failure didn't happen the next The the second component of a system is the API so fling has This data set API. This is for for batch processing. It has a data stream API for stream processing It has what we call the table API, which is a Relational kind of API that is also contains the sequel API and this is just because of Flink evolved with Along with this the space that we are in so initially There was the data set API because people did batch processing that was flink had then at some point We went into stream processing so flink added this stream processing API and then at some point we added the table API with sequel because We need to make that technology while available to a wider audience And there's a couple of more specialized API's for graph processing for machine learning for complex event processing, but it's details So the batch API is quite straightforward. Everyone that has worked with spark would understand that immediately The day stream API looks very similar to that and here you basically can do Stateful stream processing we have access to the state and timers that I managed to really do these low level things like this event Driven application and there's also higher level API's for windowing which allows you to say for example I want to compute the average temperature sensor The average sensor temperature every hour or every day or maybe even every hour but trailing by 10 minutes or something like this and An important characteristic of the day stream API is that if you want to you can have complete control over What it does so it's quite a physical API So what you program is what you get basically It's different from sequel where there is typically an optimizer in between that takes a query and maybe orders from some joins around and Tries to be clever about it, but the day stream API is we call a very physical API Table API or sequel is an API that allows you to write and see compliant sequel queries that can be executed on batch data or on streaming data with the same sequel query and One important thing is that this that you don't there's no programming required So you don't need to write Java code to use this you can just Define some data sources and data things and then Put in a sequel query and then it does something so it could be for running on historic data Or it could be for putting in some fraud detection thing if you can write that as a sequel query and then it is running 24 7 and produces results and The important thing in the sequel API is that it has this pluggable framework for connectors in data format. So someone For example, there's connectors for Kafka and S3 and file systems and you can read file formats like Jason and Afro And you can Mitch mix and match those and people can plug in their own connectors or data formats if they need to And this is from a blog post that I recently wrote just so that you get a feel for what this sequel API looks like so here. I'm studying a docker-compose base setup that you can also Check out if you go to the data artisans blog. So here. I'm just starting these The flink sequel client then I'm looking at what are the data sources that I've defined So I'm looking at some tables. So there's some taxi rides. So this is from like a ride service like uber I look at what the schema of the data is and Then I can just say it's like star from tax rights made a little typo And then we just get results. So this would then be a query that is executed on a flink cluster and produces results in real time without me needing to write any Java code. I just Define my data sources my data sinks and I Define my query and then flink does the rest and As I said the third component is connectors So flink of course has the usual usual suspects of connectors so you can connect to Kafka you can read from Kafka right to Kafka There's kinesis There's elastic search Cassandra. There's someone in the process of contributing a pub sub connector And as I mentioned the table API has this modular library of Connectors and formats that you can mix and match. So this is just an example So don't be afraid. I think this is an example from this from this blog post that I mentioned where we define a data source and we This is three parts basically on the left. You can see we define the fields that We have in this data in the middle. We say that we actually want to connect to Kafka So we say what the version is and we say how to connect to Kafka So what the the zookeeper and where the Kafka brokers are and on the right? We say what the the format of the data is so we say this is some this is Jason data and That is a specification of what the Jason data looked like and then flink can just read from That Kafka connected this Jason data So Going back to this the stream processing landscape. This is how currently flink covers this landscape so you have the the data set API for batch processing you have to data stream API for These streaming analytics use cases discontinuous processing and for these event-driven applications and you have the table API which is Unified API so you can write your program once you query once and then flink takes this and compiles it to either data stream program Or data set program based on whether you have streaming sources in there or whether you have batch sources in there So that's how flink covers this landscape, but What is a potential next step in the evolution of flink so this is being discussed so this is Because flink is an open source project. I can't really say when know if When this will happen basically, but there's a lot of discussion in flink and by the people at data artisans where most of the Committers work what and there's some consensus on what the next step will be so for this we have to look again at How the data set and the data stream API what the difference is so the difference is made mostly that If you know you have batch data and you know that you can read all that data and then do and then Sort it or maybe do some other things with it. You can use optimized algorithms For example, if you know how joints work in a database if you know that your data is finite You can use a hash joint algorithm for example where you first read the It's one side of the join read all the data in build a test table and then you read the other side of the join and Join it with the this hash table that you have built up in streaming You can't really do this because your streams are unbounded. So you don't know when the data is finished So in streaming you have to use different algorithms for it and these algorithms there It's kind of historically grown that we have these two different sets of APIs But it's nowadays. It's not really necessary anymore and one problem with this that some customers are feeling now that this is a problem is that you can't easily combine Historic data sources and real-time data sources. So for example, you might have your historical data stored in some s3 or hfs where you have the historical data for maybe the last year of what happened And then you have your real-time data your events coming in you have them in Kafka but you can't write a program that Reads from both of them or maybe does the clever thing what you would expect it to do So ideally what you could do is write a sequel query and then flink goes first and reads from that historic data And then when the data is read it switches over to reading from this real-time data And it's not really visible to the user that this is going on and So the the next biggest step in the evolution of flink and this is actually quite a big change in The runtime that will require maybe the whole year to develop but the big thing that we want to do is to unify batch and stream processing so that you don't have this difference between the different APIs anymore So that you can just seamlessly read from historic data from real-time data have your queries That work on this data and that you can seamlessly integrate these historical sources and the real-time sources that come in and there's actually Bit earlier this year show way from Alibaba. He gave a keynote at the flink forward is a conference about flink in Berlin about how Alibaba Uses flink internally and they have this crazy scale. They have tens ten thousands of nodes that they run this on it's quite interesting actually but What they are also working towards is this unification of batch and stream processing because they have a lot of experience running flink Introduction at a very large scale. So the community which includes data artists and Alibaba is trying to get this grand unification of flink happening Because then if we extend this table API a bit and make it Do a bit more than sequel and we still have sequel we can use This table API that seamlessly works for the batch and stream processing use cases and the data stream API would then be The API if you really need access to the low-level details of what's going on like the state and timers And you have if you have these event driven applications so That's really it's a huge step in my opinion this unification of batch and streaming if we can pull it off So I'm quite excited about this, but if you are interested you can check out flink.opatchy.org If you want you can check out data artisans.com slash blog there you can find this this article That has this demo that I mentioned as you can if you have docker on your machine You can just go there docker compose up and then you can start playing around with these with this flink sequel thing and That's basically my company again. We're hiring if you're interested, but thanks a lot for your attention. Thank you Maybe I have time for some questions, but I think I will also be outside to answer questions. I don't know how Okay, if you have any pressing questions, you can ask them now or I'll be outside Hello How do you manage changes on schema? data schema This is this is a very good question. This is one of the most tricky questions and the answer is that if you if you write a Streaming program using the data stream API and maybe you have your state in there or maybe the data that you ingest changes then you can We have this thing called in flink called state evolution or scheme evolution where you can change the The shape or the schema of the state But if you use sequel then we currently don't have a solution a good solution for that because The problem with these long running streaming application that run 24-7 that have some states some Accugation that they are working on and then if you change your sequel query The optimizer might reorder the joints or compute or come up with a completely different physical plan for that then it's very difficult to Migrate that schema over to that new query So we that's a thing that we are very aware of and will hopefully find some solution for but currently for sequel There's no good solution for that. I'll be outside if anyone wants to find me