 Live from San Francisco, it's theCUBE. Covering Flink Forward, brought to you by Data Artisans. Hi, this is George Gilbert. We are at Flink Forward, the conference for the Apache Flink community sponsored by Data Artisans, which is the company commercializing Flink. And we have with us now Enrico Canziere. Wait a minute, I didn't get that right. Canzoniere. Yeah, that's good. Sorry, Enrico, from Yelp. And he's going to tell us how sort of Flink has taken Yelp by storm over the past year. Why don't we start off with where you were last year and in terms of your data pipeline and sort of what challenges you were facing? Yeah, sure. So we had a Python company in the sense we've developed most of our software in Python. So until last year, we had most of our stream processing was happening in Python. We had developed any announced framework that was doing Python processing. And that was really what we had. There was no Flink running. Most of the applications were built around a very simple interface that was process message function. And that was what we expected developers to use. So no real abstraction there. Okay, so in other words, it sounds like you had a discrete task, request, response, or a batch and then hand it off to the next function. And that's, is that what the pipeline looked like? The pipeline was more of a streaming pipeline where we had a Kafka topic and input and we had these developers will write this process message function where each message would be individually processed. There was a kind of the semantic of the pipeline there. And then we will get the result of that processing task into another Kafka topic and then get another processing function on top of that. So we could have very easily two or three processing tasks all connected by Kafka topics. Obviously, there were like big limitations of the kind of architecture, especially when you want to do something more advanced that can include a windowing aggregation or especially state management. Because Kafka has several layers of abstraction and I guess you'd have to go pretty low level to get the windowing, all the windowing and state management capabilities. Yeah, it becomes really hard. You basically have to implement by yourself unless you're using, you are on the, maybe conference platform or you're using what they call Kafka streams but we are not using that. Oh, no, okay. So obviously we had to, we were trying to implement that on top of our Python simple framework from zero. So tell us how the choice of Flink, you know, where sort of, where did it hit your awareness and where did you start mapping in into this, you know, this need for this pipeline that was Python based? Yeah, so we had a really, I think two main use cases. This was last year that we were struggling to get right and to really get working. The first one was a connector and the challenge there was to aggregate data locally, scale it to hundreds of streams and then once we aggregated the data locally, upload the data on S3. So there was one application. We were really struggling to get that work because of like we, in the framework where we had, we had no real abstraction for windowing. So we were, we had this method, process method function where it was trying to implement all of that and also because we were using a very low level Kafka consumer primitives, getting scalability was not that straightforward. So there was one application that was pretty challenging. The other one was really a full, pure stateful application where we, we needed to retain the state forever. It was doing a windowed join across streams. So obviously the challenges in that case are even more because we will have to implement state management from the ground up. And all the time, time semantics. Yeah, we basically know even time semantics. We were not supporting that. There was a, so we looked at fling because of even time support. So now we could actually do even time processing. State management support already implementing. Like it's way different than implementing it from the ground up. And then obviously the abstraction, so the streaming primitives. You have windows that are, you have a nice interface that you can use that makes developers who are writing code, it becomes easier for them. So let's start with the state management. Help us walk through like what capabilities in state management does Flink have relative to sort of the lowest level abstraction using in Kafka? Or perhaps what Spark structured streaming might provide? Yeah, so I think the nice features in streams are really around the fact that the state management is implemented and fully supports the clusterized approach of Flink. So for example, if you're using Kafka, Flink already, in the Kafka connector, Flink already provides a way to represent the state of a Kafka consumer. It also, for operators, if you have a flat map or you have a window, state for windows is already fully supported. So if you are accumulating events in your window, you don't really need to do them, nothing special. The state will be automatically maintained by the Flink framework. That means that if Flink is taking a snapshot, so a checkpoint or a save point, all the state that was there will get stored in the checkpoint that you will be able to recover. For the full window? Yeah. It's like, because it understands the concept of the window when it does a checkpoint. Yeah, because it's a native support in Flink for that. And what's the advantage of having state be integrated with the compute as opposed to compute and then some sort of API to a separate state manager? Yeah, it's definitely like code to clarity. I mean, and it's a big simplification of how you implement your code, your streaming application. Because in the end, if for every stream processing application you need to go ahead and implement or define, implement basically the way your state gets stored, that really makes a very complex application, especially on the maintenance. So in Flink, you kind of focus on the business logic. So we actually did some tuning on the state manager and that was necessary. But the tuning that we did applies in the same way across all the application that we built. Then users who want to build an application, they focus on the business logic that they want and they have, I would say, this state is more kind of declarative. You say, you want this map, you need this list in the state as part of the state and Flink will take care of actually making sure that it gets into the checkbox. So the sort of semantics of state management are built in at the compute layer as opposed to going down to an API for a separate service and other implementations. Okay, so all right, we have just a minute left. Tell us about some of the things you're looking forward to doing with Flink and are they similar to what the DA platform that's coming out from data artisans or are there, do you have like still a whole bunch of things on the data pipeline that you want to accomplish with just the core functionality? Yeah, we definitely, I will say one of the features that we are really excited about is the stream SQL. So I see a lot of potential there for new applications. We actually use the stream SQL at Yelp. We deploy that as a service so it makes it easier for users to deploy and to develop a stream processing applications. We definitely are planning to expand our Flink deployment, introduce new apps and especially one of the things we try to do is especially building reusable components and trying to deploy the reusable components are very coupled with the way we think about our data pipeline. Okay, so would it be fair to say that can you look at the DA platform and say for companies that are not quite as sophisticated as you, that this is going to make it easier for mainstream companies to build and deploy, operate? Yeah, I see good potential there. I was looking at the presentation in the morning. I like the integration with Kubernetes for sure, since that's where the current trend for application deployment is going. So yeah, I definitely see potential. I think for Yelp, we clearly have a complex enough deployment and service integration that won't probably be a good fit for us, but probably companies that are approaching the road to Flink now and we'll probably have like a radio and existing Kubernetes deployment. They may probably give it a try. Okay. All right, Enrico, we got to end it there, but that was very helpful and thanks for stopping by. Thanks for having me here. Okay. And this is George Gilbert. We are at Flink Forward, the Data Artisans Conference for the Apache Flink community and we will be right back after this short break.