 Live from San Francisco, it's theCUBE, covering Flink Forward, brought to you by Data Artisans. Hi, this is George Gilbert. We are at Flink Forward, the conference put on by Data Artisans for the Apache Flink community. This is the second Flink Forward in San Francisco, and we are honored to have with us Stefan Nguyen, co-founder of Data Artisans, creator, co-creator of Apache Flink, and CTO of Data Artisans, Stefan, welcome. Thank you, George. Okay, so with others, we were talking about the use cases they were trying to solve, but you put together the sort of, all the pieces in your head first, and are building out, you know, something that's ultimately gets broader and broader in its applicability. Help us now, maybe from the bottom up, help us think through the problems you were trying to solve, and let's start with the ones that you solved first, and then how the platform grows so that you can solve more and more a broader scale of problems. Yes, yeah, I'm happy to do that. So I think we have to take a bunch of stepbacks and kind of look at what is the, let's say the breadth of use cases that we're looking at, how did that influence some of the inherent decisions and how we've built Flink, how does that relate to what were presented earlier today, the stream processing platform and so on. So, starting to work on Flink and stream processing, stream processing is an extremely general and broad paradigm, right? We've actually started to say what Flink is underneath the hood, it's an engine to do stateful computations over data streams. Let's say it's a system that can process data streams as a batch processor or processes, you know, bounded data, it can process data streams as a real-time stream processor or produces real-time streams of events. It can handle data streams as in sophisticated event by event, stateful, timely logic as many applications that are implemented as data-driven microservices also implement their logic. And the basic idea behind how Flink takes its approach to that is just start with the basic ingredients that you need that and try not to impose any form of various constraints and so on around the use of that. So, when I give the presentations, I very often say the basic building blocks for Flink is just like flowing streams of data, streams being received from systems like Kafka, file systems, so on and so on. It had systems like Kafka, file systems, databases, so you root them, you may want to repartition them, organize them by key, broadcast them depending on what you need to do. You implement computation on these streams, a computation that can keep state, almost as if it was like a standalone Java application. You don't think necessarily in terms of writing, writing state or database, think more in terms of maintaining your own variables or so. Sophisticated access to tracking time and progress, so progress of data, completeness of data, that's in some sense what is behind the event time streaming notion, you're tracking completeness of data as for a certain point of time. And then to round this all up, give this a really nice operational tool by introducing this concept of distributed consistent snapshots. And just sticking with these basic primitives, you have streams that just flow, no barrier, no transactional barriers necessarily there between operations, no micro batches, just streams that flow, state variables that get updated and then fault tolerance happening as an asynchronous background process. Now that is what is in some sense the, I would say kind of the core, the core idea, yeah, and what helps link generalize from batch processing to real time stream processing to event driven applications. And what we saw today is in the presentation that I gave earlier in how we use that to build a platform for stream processing and event driven applications. That's taking some of these things and in that case, I'm most prominently the fourth aspect, the ability to draw application snapshots at any point in time and use this as an extremely powerful operational tool. You can think of it as being a tool to archive applications, migrate applications, fork applications, modify them independently. And these snapshots are essentially your individual snapshots at the node level and then you're sort of organizing them into one big logical snapshot. Yeah, each node does its own snapshot but they're consistently organized into a globally consistent snapshot, yes. That has a few very, very interesting and important implications, for example. So just to give you one example where this makes really things much easier. If you have an application that you want to upgrade and you don't have a mechanism like that, right? What is the default way that many folks do these upgrades today? Try to do a rolling upgrade of all your individual nodes. You replace one, then the next, then the next. But that has this interesting situation where at some point in time there's actually two versions of the application running at the same time. Yeah, and operating on the same sort of data stream. Potentially, yeah, or on some partitions of the data stream you have one version and some partitions you have another version. So you kind of, you may be at the point we have to maintain two wire formats, like all pieces of your logic has to be written in understanding both versions or you try to use a data format that makes this a little easier. But it's just inherently a thing that you don't even have to worry about it if you have this consistent distributed snapshot. It's just a way to switch from one application to the other as if nothing was shared or in flight at any point in time. It just gets many of these problems just out of the way instantaneously. And that snapshot applies to code and data or can? So in Flinks, in Flinks architecture itself, the snapshot applies first of all only to data and that is very important. Because what it actually allows you is to decouple the snapshot from the code if you want to. That allows you to do things like we showed earlier this morning if you actually have an earlier snapshot where the data is correct, then you change the code but you introduce the bug. You can just say, okay, let me actually change the code and apply different code to a different snapshot. So you can actually roll back or roll forward different versions of code and different versions of state independently. Or you can go and say, when I'm forking this application, I'm actually modifying it. That is a level of flexibility that's incredible to, yeah, once you've actually start to make use of it and practice is incredibly useful. And it's been actually almost, it's been one of the maybe least obvious things once you start to look into stream processing but once you actually start to production or stream processing, this operational flexibility that you get there is, I would say, very high up for a lot of users when they said, okay, this is why we took Flink to streaming production and not others, the ability to do for example that. But this sounds then like a, like with some stream processors, the idea of unbundling the database, you have derived data at different sync points. And that derived data is for analysis, views, whatever. But it sounds like what you're doing is taking a derived data of sort of what the application's working on in progress and creating essentially a logically consistent view that's not really derived data for some other application use, but for operational use. Yeah, so is that a fair way to explain? Yeah, let me try to rephrase it a bit. When you start to take this streaming style approach to things, which it's been called turning the database instead of unbundling the database, your input sequence of event is arguably the ground truth and what the stream processor computes is a view of the state of the world. So while this sounds at first super easy and views, you can always recompute a view, right? Now in practice, this view of the world is not just something that's just like a lightweight thing that's only derived from the sequence of events. It is actually the state of the world that you want to use. It might not be fully reproducible just because either the sequence of events has been truncated or because the sequence event of events is just like too plain long to feasibly recomputed in a reasonable time. So having a way to work with this in a way that just complements this whole idea of event-driven, log-driven architecture very cleanly is kind of what this natural tool also gives you. Okay, so then help us think, so that sounds like that was part of core Flink. That is part of core Flink of Flink's inherent design. Okay, so then take us to the next level of abstraction, the scaffolding that you're building around it with the DA platform and how that should make that sort of thing that makes stream processing more accessible, how it empowers a whole another generation. Yeah, so there's different angles to what the DA platform does. So one angle is just very pragmatically easing roll out of applications by having a, like one way to integrate the platform with your metrics, alerting, logging, CICD pipeline and then every application that you deploy where they're just like inherits all of that, like every application developer doesn't have to worry about anything, they just say like, this is my piece of code, I'm putting it there and it's just gonna be hooked in with everything else. That's not rocket science, but it's extremely valuable because there's like a lot of TDS bits here and there that otherwise eat up a significant amount of the development time. Technologically maybe more more challenging part that this solves is the part where we're really integrating the application snapshot, the compute resources, the configuration management and everything into this model where you don't think about I'm running a flink job here, that flink job has created a snapshot that is running around here, that's also a snapshot here which probably may come from that flink application. Also that flink application was running, that's actually just a new version of that flink application which is the, let's say testing or acceptance run for the version that we're about to deploy here and so like tying all of these things together. So it's not just the artifacts from one program, it's how they all interrelate. It gives you the idea of exactly of how they all interrelate because an application over its lifetime will correspond to different configurations, different code versions, different deployments on production A, B, testing and so on and like how all of these things kind of work together, how they interplay, right? Flink, like I said before, flink deliberately couples checkpoints and so on in a rather loose way to allow you to evolve the code differently than and still be able to match a previous snapshot into a newer code version and so on. We make heavy use of that but we kind of give you a good way of first of all tracking all of these things together, how do they relate? When was which version running? What code version was that? Having a snapshot so you can always go back and reinstate earlier versions, having the ability to always move a deployment from here to there like forward get, drop it and so on. That is one part of it and the other part of it is the tight integration with Kubernetes which is a initially container sweet spot was stateless compute and the way stream processing as an architecture works is the nodes are inherently not stateless, they have a view of the state of the world. This is recoverable always. You can also change the number of containers and with Flink and other frameworks, you have the ability to kind of adjust this and so on. Including the state, including repartitioning the. Including repartitioning the state but it's a thing that you have to be often quite careful how to do that so that this all just, it all integrates exactly consistency like the right containers are running at the right point in time with the exact right version and there's not like, there's not a split brain situation where this happens to be still running some other partitions at the same time or you're running this, you're running that container goes down and is this a situation where you're supposed to recover or rescale, like figuring all of these things out together, this is kind of what the idea of integrating these things in a very tight way gives you. So think of it as the following way, right? You start with, and initially just start with Docker. Docker is a way to say I'm packaging up everything that a process needs, all of its environment to make sure that I can deploy it here and here and here and here and just always works. It's not like, oh, I'm missing the correct version of the library here or I'm interfering with that other process on a port or so. On top of Docker, people added things like Kubernetes to orchestrate many containers together forming an application and then on top of Kubernetes, there are things like Helm or for certain frameworks, there's like Kubernetes operators and so on, which try to raise the abstraction to say, okay, we're taking care of these aspects that this needs in addition to container orchestration. We're doing exactly that thing, like we're raising the abstraction one level up to say, okay, we're not just thinking about the containers, the compute here and maybe they're like local persistent storage, but we're looking at the entire stateful application with its compute, with its state, with its archival storage, with all of it together. Okay, let me sort of peel off with a question about sort of more conventionally trained developers and admins and they're used to databases for batch and request response type jobs or applications. Do you see them becoming potential developers of continuous stream processing apps or do you see it mainly for a new generation of developers? No, I would actually say that a lot of the like classic call it request response or call it like create, update, delete, create, read, update, delete or so kind of application working against the database. There's this huge potential for like stream processing or that kind of event-driven architectures to help change this field. There's actually a fascinating talk here by the folks from DriveTribe who implemented an entire social network in a stream processing architecture, so not against a database, but against a log and a stream processor instead. It comes with some really cool properties, like a very unique way of having operational flexibility to at the same time test and evolve and run and do like very rapid iterations over your- Because of the decoupling? Because of exactly, because of the decoupling, because you don't have to always worry about, okay, I'm experimenting here with something. Let me first of all create a copy of the database and then once I actually think that this is working out well, then okay, how do I either migrate those changes back or make sure that the copy of the database that I did, that bring this up to speed with the production database again before I switch over to the new version and so like so many of these things just like the pieces just fall together easily in the streaming world. Would it, I think I asked this of Kostas, but if a business analyst wants to query the current state of what's in the cluster, do they go through some sort of head node that knows where the partitions lay and then some sort of query optimizer figures out how to execute that with a cost model or something? In other words, if you wanted to do some sort of batch or interactive type. So there's different answers to that, I think. There's first of all, there's the ability to look into the state of link as in, you know, you have the individual nodes that maintain state during the computation and you can look into this, but it's more like a look up thing. You're not running a query as in a SQL query against that particular state. If you would like to do something like that, what Flink gives you as the ability is always to, there's a wide variety of connectors. So you can, for example, say I'm describing my streaming computation here. You can describe it in an SQL. You can say the result of this thing, I'm writing it to a neatly queryable, you know, data store and in memory database or so. And then you would actually run the dashboard style exploratory queries against that particular database. So Flink's sweet spot at this point is not to run many small fast short-lived SQL queries against something that is in Flink running at the moment. That's not what it is yet to be able to optimize for. A more batch oriented one would be the derived data that's in the form of like a materialized view. Exactly, so these two sides play together very well, right? You have the more exploratory batch style queries that go against the view and then you have the stream processor and streaming SQL used to continuously compute that view that you then explore. Is there, do you see scenarios where you have traditional OLTP databases that are capturing business transactions but now you want to inform those transactions or potentially automate them with machine learning and so that you capture a transaction and then there's sort of ambient data, whether it's about the user interaction or it's about the machine data flowing in and maybe you don't capture the transaction right away but you're capturing data for the transaction and the ambient data. The ambient data, you calculate some sort of analytic result could be a model score and that informs the transaction that's running at the front end of this pipeline. Is that a model that you see in the future? So that sounds like a former use case that has actually been run, not quite... It's not uncommon, yeah. It's actually, in some sense, a model like that is behind many of the fraud detection applications, right? You have the transaction that you capture. You have a lot of contextual data that you receive from which you either build a model in the stream processor or you build a model offline and push it into the stream processor as, let's say, a stream of model updates. And then you're using that stream of model updates. You derive your, let's say, your classifiers or your rule engines or your predictor state from that set of updates and from the history of the previous transactions and then you use that to attach a classification to the transaction. And then once this is actually returned, this stream is fed back to the part of the computation that actually processes the transaction itself to trigger the decision, whether to, for example, hold it back or to let it go forward. So this is an application where people who have built traditional architectures would add this capability on for low latency analytics. Yeah, that's one way to look at it, yeah. As opposed to a rip and replace, like we're going to take out our request response and our batch and put in stream processing. Yeah, so that is definitely a way that stream processing is used, that you basically capture a change log or so of whatever is happening in either a database or you just immediately capture the events, the interaction from users and devices and then you let the stream processor run side by side the old infrastructure and just exactly compute additional information that even a mainframe database might in the end use to decide what to do with a certain transaction. So it's a way to complement legacy infrastructure with new infrastructure without having to break or for break away the legacy infrastructure. So let me ask in a different direction, more on the complexity that forms a tax for developers and administrators, many of the open source community products slash projects solve narrow sort of functions within a broader landscape and there's a tax on developers and admins and trying to make those work together because of the different security models, data models, all that. So there's a zoo of systems and technologies out there and also of different paradigms to do things. Once things follow, once systems kind of have a similar paradigm or idea in mind, they usually work together well, but there's different philosophical tastes. Give me some examples of the different paradigms that don't fit together well. For example, when, maybe one good example was initially when streaming was a rather new thing. At this point in time, stream processors were very much thought of as a, just a bit of an addition to the, let's say the batch stack or whatever other stack you currently have. Just look at it as a kind of an auxiliary piece to do some approximate computation and a big reason why that was the case is because the way that these stream processors thought of state was with a different consistency model. The way they thought of time was actually different than the batch processors of the database which use timestamp fields and the early stream processors. They couldn't handle event time. Exactly, just use processing time and that's why these things, you could maybe complement the stack with that, but it didn't really go well together. You couldn't just say like, okay, I can actually take this batch job, kind of interpret it also as a streaming job. Once the stream processors got a better interpretation. Exactly, so once the stream processors adopted a stronger consistency model, a time model that is more compatible with reprocessing and so on, all of these things all of a sudden fit together, fit together much better. Okay, so do you see that vendors who are oriented around a single paradigm or unified paradigm, do you see them continuing to broaden their footprints so that they can essentially take some of the complexity off the developer and the admin by providing something that, you know, one throat to choke with the pieces that were designed to work together out of the box, unlike some of the zoos, you know, with perhaps the former Hadoop community. In other words, everyone seems to be trying not everyone, a lot of vendors seem to be trying to do a broader footprint so that it's something that's, you know, like a, that it's just simpler to develop to and to operate. There are a few good efforts happening in that space right now. So one that I really like is the idea of like standardizing on some APIs. APIs are hard to standardize on but you can at least standardize on semantics. Which is something that for example, Flink and Beam have been very keen on trying to, you know, have an open discussion and yeah, a roadmap that is very compatible in thinking about streaming semantics. This has been taken to the next level, I would say, with the whole streaming SQL, with the whole streaming SQL design. Like Beam is adding Stream SQL and Flink is adding Stream SQL both in collaboration with the Apache Kelzad project. So very similar, like again, very similar standardized semantics and so on, Ansi SQL compliant. So you basically get a, you start to get common interfaces. Which is a very important first step, I would say. Standardizing on things like. So SQL semantics across products that would be within a stream processing architecture. Yes. That's an example. I think this will become really powerful once, once other vendors start to adopt the same interpretation of streaming SQL and think of it as, yes, it's a way to take a changing data table here and project a view of this changing data table, a changing view, a changing materialized view into a different, into a different, into another system, right? And then use this as a starting point maybe compute another derived view. So you can actually start and think and think more high level about things. Think really relational queries, dynamic tables across different pieces of infrastructure. Once you can do something like that, interplay in architectures become easier to handle. Because even if, you know, even if the, even if on like on the runtime level, things behave a bit different, at least you start to establish a standardized model and thinking about how to compose your architecture. And even if you decide to change on the way you're frequently saved the problem of having to rip everything out and redesign everything because the next system that you bring in just has a completely different paradigm that it follows. Okay. This is helpful. To be continued offline or back online on theCUBE. This is George Gilbert. We were having a very interesting and extended conversation with Stefan Nguyen, CTO and co-founder of Data Artisans and one of the creators of Apache Flink. And we are at Flink Forward in San Francisco. We will be back after this short break.