 Live from San Francisco, it's theCUBE, covering Flink Forward, brought to you by Data Artisans. Hello again everybody, this is George Gilbert. We're at the Flink Forward Conference, sponsored by Data Artisans, the provider of both Apache Flink and the commercial distribution, the DA platform that supports the productionization and operationalization of Flink and makes it more accessible to mainstream enterprises. We're privileged to have Costa Sumas, CEO of Data Artisans with us today. Welcome, Costa. Thank you. So tell us, let's start with a sort of an idealized application use case that is in the sweet spot of Flink. And then let's talk about how that's going to broaden over time. Yeah, yeah. So just a little bit of an umbrella above that. So what we see very, very consistently, we see it in tech companies, and we see, so modern tech companies, and we see it in traditional enterprises that are trying to move there, is a move towards a business that runs in real time, runs 24-7, is data driven, so decisions are made based on data and is software operated. So increasingly decisions are made by AI, by software rather than someone looking at something and making a decision, yeah. So for example, some of the largest users of Apache Flink are companies like Uber, Netflix, Alibaba, Lyft, they are all working in this way. Can you tell us about the size of their, something in terms of records per day or cluster size? Yeah, sure. So latest I heard, Alibaba is powering Alibaba search. More than a thousand nodes, terabytes of states, I'm pretty sure they will give us bigger numbers today. Netflix has reported of doing about one trillion events per day, yeah, on Flink, so pretty big sizes, yeah. So and Netflix I think I read is powering their real time recommendation updates. They are powering a bunch of things, a bunch of applications. There's a lot of routing events internally. I think they had a talk, they had a talk definitely at the last conference where they talked about this. And it's really a variety of use cases. It's really about building a platform internally and offering it to all sorts of departments in the company, be that for recommendations, be that for BI, be that for running, state food microservices, you know, all sorts of things. And we also see the more traditional enterprise moving to this modus operandi. So for example, ING is also one of our biggest partners. It's a global consumer bank based in the Netherlands. And their CEO is saying that ING is not a bank, it's a tech company that happens to have a banking license. It's a tech company that inherited a banking license. So that's how they want to operate. So what we see is stream processing is really the enabler for this kind of business, for this kind of modern business where we interact with in real time, they interact with the consumer in real time, they push notifications, they can change the pricing, et cetera, et cetera. Sorry. So this is really the crux of stateful stream processing for me. So, okay, so tell us, for those who have a passing understanding of how Kafka's evolving, how Apache Spark and structured streaming's evolving as distinct from, but also Databricks, what is it about having state management that's sort of integrated that, for example, might make it easy to elastically change a cluster size by repartitioning? What can you assume about managing state internally that makes things easier? Yeah. So I think really the sweet spot of Fling is that if you are looking for stream process, for a stream processing engine and for a stateful stream processing engine for the matter, Fling is the definition of this. It's the definite solution to this problem. It was created from scratch with this in mind. It was not sort of a bolt on top of something else. So it's streaming from the get go. Then we have done a lot of work to make state a first class citizen. What this means is that in Fling programs you can keep state that scales to terabytes. We have seen that. And you can manage the state together with your application. So Fling has this model based on checkpoints where you take a checkpoint of your application and stay together and you can restart at any time from there. The core of Fling is around state models. And you manage exactly once semantics across the checkpointing? It's exactly once. It's application level exactly once. We have also introduced end-to-end exactly once with Kafka. So Kafka Fling, Kafka exactly once. So fully consistent. Okay, so let's drill down a little bit. What are some of the things that customers would do with an application running on a, let's say a big cluster or a couple clusters where they want to operate both on the application logic and on the state that having it integrated makes much easier. Yeah, so it is a lot about a flipped architecture and about making operations and DevOps much, much easier. So traditionally what you would do is create, let's say containerized stateless application and have a centralized data store to keep all your states. What you do now is the state becomes part of the application. So this has several benefits, it has performance benefits, it has organizational benefits in the company like autonomy between teams. It gives you a lot of flexibility on what you can do with the applications, like for example, right? Scaling an application, yeah? What you can do with Fling is that you have an application running with parallelism of 100 and you are getting a higher volume and you want to scale it to 500, right? So you can simply with Fling take a snapshot of the state and the application together, yeah? And then restart it at a 500 and Fling is going to resub for the state. So no need to do anything on a database. And now reshard and Fling can reshard. It will reshard and it will restart. And then one step further with the product that we have introduced, the platform which includes Fling, you can simply do this with one click or with one rest command. Oh, so the resharding was possible with core Fling, the patchy Fling and the DA platform just makes it that much easier along with other operations. Yeah, so what the DA platform does is it gives you an API for common operational tasks that we observed. Everybody that was deploying Fling that a decent scale needs to do. It's based on Kubernetes, but it gives you a higher level API than Kubernetes that you can manage the application, the state together. It gives that to you in a REST API, in a UI, et cetera. Okay, so in other words, it's sort of like by abstracting even up from Kubernetes, you might have a cluster as a first-class citizen, but you're treating it almost like a single entity. And then under the covers, you're managing the things that happen across the cluster. So what we have in the DA platform is a notion of a deployment, which is, I think of it as a cluster, but it's basically based on containers. So you have this notion of deployments that you can manage, sorry. And then you have an notion of an application and an application is a fling job that evolves over time. And then you have a very bird's-eye view on this. When you update the code, this is the same application with updated code. You can travel through a history, you can visit the logs, and you can do common operational tasks, like as I said, rescaling, updating the code, rollbacks, replays, migrate to a new deployment target, et cetera. Let me ask you, outside of the big tech companies who have built much of the application management scaffolding themselves, you can democratize access to stream processing because the capabilities are not in the skill set of traditional mainstream developers. So question, the first thing I hear from a lot of sort of newbies or people who want to experiment is, well, it's so easy to manage the state in a shared database, even if I'm processing continuously. Where should they make the trade-off? When is it appropriate to use a shared database, maybe for real OLTP work, and then when can you sort of scale it out and manage it integrally with the rest of the application? So when should you use a database and when should you use streaming, right? Yeah, and even if it's streaming with the embedded state, you know. Yeah, that's a very good question. I think it really depends on the use case. So what we see in the market is many enterprises start with a use case that either doesn't scale or it's not developer friendly enough to have this database application level separation, and then it quickly spreads, sorry, out in the whole company and other teams start using it. So for example, in the work we did with ING, they started with a fraud detection application, where the idea was to load models dynamically in the application as the data scientists are creating new models and have a scalable fraud detection system that can handle their load. And then we have seen other teams in the company adopt stream processing after that. Okay, so that sounds like where the model becomes part of the application logic and it's a version of the application logic and then the version of the model is associated with the checkpoint. Correct. So let me ask you then, what happens when you're managing, let's say terabytes of state across a cluster and someone wants to query across that distributed state. Is there in Flink a query manager that knows about where all the shards are and the statistics around the shards to do a cost-based query? So there is a feature in Flink called queryable state that gives you the ability to do very simple for now queries on the state. This feature is evolving, it's in progress and it will get more sophisticated and more production ready over time. And that enables a different class of users. Exactly, I wouldn't like to be frank, I wouldn't use it for complex data warehousing scenarios that still needs a data warehouse but you can do point queries and the future slightly more sophisticated queries. So this is different, this type of state would be different from like in Kafka where you can store the commit log for X amount of time and then replay it. It's in a database, I assume, not in a log form and so you have faster access. Exactly, and it plays together with the log. So you can think of the state in Flink as the materialized view of the log at any given point in time with various versions. Okay. Yeah. And the way replay works is roll back the state to a prior version and roll back the log, the input log to that same logical time. Okay. So how do you see Flink spreading out now that it's been proven in the most demanding customers and now we have to accommodate skills where developers and DevOps don't have quite the same distributed systems knowledge? Yeah. Yeah, I mean, we do a lot of work at Data Artisans with financial services, insurance, very traditional companies, but it's definitely something that is work in progress in the sense that our product, the DA platform makes operations much easier. This was a common problem everywhere. This was something that tech companies sold for themselves and we wanted to solve it for everyone else. Application development is yet another thing. And as we saw today in the last keynote, we are working together with Google and the BIM community to bring Python, Go, all sorts of languages into Flink. Okay, so that'll help at the developer level. At the development level. And you're also doing work at the operations level with the platform. And of course there's SQL, right? So Flink has Stream SQL, which is standard SQL. And would you see at some point actually sort of managing the platform for customers either on-prem or in the cloud? Yeah, so right now the platform is running on Kubernetes, which means that typically the customer installs it in their clusters, in their Kubernetes clusters, which can be either their own machines or it can be a Kubernetes service from a cloud vendor. Moving forward, I think it will be very interesting to move to more hosted solutions, make it even easier for people. Do you see a breakpoint or a transition between the most sophisticated customers who either are comfortable on their own premises or who were cloud sort of native from the beginning and then sort of the rest of the mainstream? You know, what sort of applications might they move to the cloud or might coexist between on-prem and the cloud? Well, I think it's clear that the cloud is, you know, every new business that's on the cloud, that's clear. Yeah. There's a lot of enterprise that is not yet there, but there's big willingness to move there. And there's a lot of hybrid cloud solutions as well. Do you see mainstream customers rewriting applications because they would be so much more powerful in stream processing or do you see them doing just new applications? Both. We see both. Both. It's always easier to start with a new application, but we do see a lot of legacy applications in big companies that are not working anymore and we see those rewritten. And very core applications, very core to the business. So could that be, could you be sort of the source and an analytic processing for the continuous data and then that sort of feeds a transaction and some parameters that then feed a model? Yeah. Is that a, so in other words, you could augment existing OLTP applications with analytics that inform them in real time, essentially. Absolutely. Okay, cause that sounds like then something that people would build around what exists. Yeah, I mean you can do, you can think of stream processing in a way as transaction processing. It's not a dedicated OLTP store, but you can think of it in this flipped architecture, right? Like the log is essentially the redo log and then you create the materialized views. That's the right path. And then you have the read path which is queryable state. This is this whole CQRS idea, right? Yeah, command query response, exactly. So this is actually interesting. And I guess this is critical. It's sort of like a new way of doing distributed databases. I know that's not the word you would choose, but it's like the derived data managed by sort of coming off of the state changes then in the stream processor that goes through a single sort of append-only log and then reading and how do you manage consistency on the materialized views, the derived data? Yeah, so we have seen fling users implement that. So we have seen companies really base the complete product, you know, companies really base the complete product on the CQRS pattern. I think this is a little bit further out. Consistency-wise, fling gives you the exact nuanced consistency on the right path. What we see a lot more is an architecture where there's a lot of transactional stores in the front end that are running. And then there needs to be some kind of global of single source of truth between all of them. And a very typical way to do that is to get these logs into a stream and then have a fling application that can actually scale to that, create a single source of truth from all of these transactional stores. And by having, by feeding the transactional stores into this sort of hub, I presume, some cluster as a hub, and even if it's in the form of sort of a log, how can you replay it with sufficient throughput? I guess not to be a data warehouse, but to have low latency for updating the derived data. And is that derived data, I assume, in non-flink products? Yeah, so the way it works is that you can get the change logs from a database, you can use something like Kafka to buffer them up, and then you can use fling for all the processing. And to do the reprocessing with fling, this is really one of the core strengths of fling. Basically, what you do is you replay the fling program together with the state so you can get really, really high throughput reprocessing there. And where does the super high throughput come from? Is that because of the integration of state and logic? Yeah, that is because fling is a true streaming engine. It is a high-performance streaming engine, and it manages the state. There's no tier... Crossing and boundary. There's no tier crossing, there's no boundary crossing when you access the state. It's embedded in the fling application. Okay, so that you can optimize the IO path. Correct. Okay. Correct. Very, very interesting. Correct. So it sounds like the Kafka guys or the Confluent folks, their aspirations, from the last time we talked to them, doesn't extend to analytics. I don't know whether they want partners to do that, but it sounds like they have a similar topology, but I'm not clear how much of a first-class citizen state is other than the log, is how would you characterize the trade-offs between the two? Yeah, so obviously I cannot comment on Confluent, but what I think is that the state and the log are two very different things. You can think of the log as storage. It's a kind of hot storage because it's the most recent data. Yeah. But you cannot query it, it's not a materialized view, right? So for me, the separation is between processing state and storage. The log is a kind of storage, a kind of message queue. State is really the active data, the real-time active data that needs to have consistency guarantees, and that's a completely different thing. Okay, and that's the, you're managing, it's almost like you're managing under the covers of distributed database. Yes, kind of, yeah. Distributed, Kiva, you store a few weeks. Okay, yeah. Okay, and then that's exposed through multiple interfaces, data stream, table. Data stream, table API, SQL, other languages in the future, et cetera. Okay, so going further down the line, how do you see the sort of use cases that are gonna get you across the chasm from the big tech companies into the mainstream? Yeah, yeah. So we're already seeing that a lot. So we're doing a lot of work with financial services and insurance companies, a lot of very traditional businesses. And it's really a lot about maintaining single source of truth, becoming more real-time in the way they interact with the outside world and the customer. Like, they do see the need transform. If we take financial services and investment banks, for example, there is a big push in this industry to modernize the IT infrastructure, to get rid of legacy, to adopt modern solutions, become more real-time, et cetera. And so they really needed this, like the application platform, the DA platform, because operationalizing what Netflix did is gonna be very difficult, maybe for non-tech companies. Yeah, I mean, it's always a trade-off, right? And for some companies, build some companies, buy. And for many companies, it's much more sensible to buy. That's why we have software products. And really, our motivation was that we worked in the open-source filling community with all the big tech companies, we saw their successes, we saw what they built, we saw their failures, we saw everything, and we decided to build this for everybody else, for everyone that is not Netflix, is not Uber, cannot hire software developers so easily or with such good quality. Okay. All right, on that note, Kostas, we're gonna have to end it and to be continued, one with Stefan next, apparently, and then hopefully next year as well. Nice. Thank you. All right, thanks Kostas. Thank you, sirs. All right, we're with Kostas Tsumas, CEO of Data Artisans, the company behind Apache Flink, and now the application platform that makes Flink run for mainstream enterprises, we will be back after this short break.