 Live from San Francisco, it's theCUBE. Covering Flink Forward, brought to you by Data Artisans. Hi, this is George Gilbert. We are at Flink Forward, the user conference for the Apache Flink Community sponsored by Data Artisans. We are in San Francisco. This is the second Flink Forward conference here in San Francisco. And we have a very imminent guest with a long pedigree, Holden Karau, formerly of IBM, and Apache Spark fame, putting Apache Spark and Python together. And now, Holden is at Google focused on the Beam API, which is an API that makes it possible to write portable stream processing applications across Google's data flow, as well as Flink, and other stream processors. And Holden has been working on integrating it with the Google TensorFlow framework, also open sourced. So Holden, tell us about the objective of putting these together. What type of use cases? So I think it's really exciting. And it's still very early days. I want to be clear, like if you go out there and run this code, you are going to get a lot of really weird errors, but please tell us about the errors you get. The goal is really, and we see this in Spark with the pipeline APIs, that most of our time in machine learning is spent doing data preparation. We have to get our data in a format where we can do our machine learning on top of it. And the tricky thing about the data preparation is that we also often have to have a lot of the same preparation code available to use when we're making our predictions. And what this means is that a lot of people essentially end up having to write like a stream processing job to do their data preparation and then they have to write a corresponding online serving job to do similar data preparation for when they want to make real predictions. And by integrating TF Transform and things like this into the Beam ecosystem, the idea is that people can write their data preparation in a simple uniform way that can be taken from the training time into the online serving time without them having to rewrite their code, removing that potential for mistakes where we like change one variable slightly in one place and forget to update it in another and just really simplifying the deployment process for these models. Okay, so help us tie that back to, in this case, Flink. Yes. And also to clarify that data prep, my impression was data prep was a different activity was like design time and serving was runtime. Totally. But you're saying that they can be better integrated? So there's different types of data prep. Some types of data prep would be things like removing invalid records. And if I'm doing that, like I don't have to do that at serving time, but one of the classic examples for data prep would be like tokenizing my inputs or performing some type of hashing transformation. And if I do that, when I get new records to predict, they won't be in a pre-tokenized form or they won't be hashed correctly. And my model won't be able to serve on these sort of raw inputs. So I have to recreate the data prep logic that I created for training at serving time. So by having common Beam API and the common provider underneath it like Flink and TensorFlow, it's the repeatable activities for transforming data to make it ready to feed to a machine learning model. That you want those, it would be ideal to have those transformation activities be common in your prep work and then in the production serving. Yes, very true, very true. Okay. And so tell us what type of customers want to write to the Beam API and have that portability? Yeah, so that's a really good question. There's a lot of people who really, they want portability outside of Google Cloud. And that's one group of people, essentially people that want to adopt different Google Cloud technologies, but they don't want to be locked into Google Cloud forever, which is completely understandable. There are other people who are more interested in being able to switch streaming engines like they want to be able to switch between Spark and Flink. And those are just people who want to be able to try out different streaming engines without having to rewrite their entire jobs. Does Spark structured streaming support the Beam API? So right now the Spark support for Beam is limited. It's in the old Dstream API, it's not on top of the structured streaming API. It's the thing that we're actively discussing on the mailing list, how to go about doing, because there's a lot of intricacies involved in bringing new APIs in line. And since it already works there, there's less of a pressure, but it's the thing that we should look at more of. Where was I going with this? So the other one that I see is like, Flink is a wonderful API, but it's very Java-focused, right? And so Java's great, everyone loves it, but a lot of really cool things that are being done nowadays are being built in Python, like TensorFlow. There's a lot of really interesting machine learning and deep learning stuff happening in Python. And Beam gives a way for people to work with Python across these different engines. Flink supports Python, but it's maybe not a first-class citizen. And the Beam Python support is still a work in progress. Like, we're working to get it to be better, but it's, I mean, you can see the demo this afternoon, although I guess if you're not here, you can't see the demo, but you can see the work happening in GitHub. And there's also work being done to support Go. Into support Go. Go, which is a little out of left field. So would it be fair to say that the value of Beam for potential Flink customers is they can start on a Google Cloud platform, they can start on any one of several stream processors, they can move to another one later, and they also inherit the better language support or bindings from the Beam API? I think that's very true. The better language support, it's better for some languages, it's probably not as good for others. Like, it's somewhat subjective what better language support is, but I think definitely for Go, it's pretty clear. This stuff is all stuff that is like in the master branch, it's not like released today, but like if people are like looking to play with it, I think it's really exciting. And they can go and check it out from GitHub and build it locally. So, Wendy, what type of customers do you see who have moved into production with machine learning and the streaming pipelines? Biggest customer that's in production is obviously Spotify, or not obviously, but Spotify. One of them is Spotify, they give a lot of talks about it. Because I didn't know we were gonna be talking today, I didn't have a chance to go through my customer list and see who's okay with us mentioning them publicly, so I'll just stick with Spotify. Without the names, sort of the use cases and the general industry. I don't want to get in trouble. I'm just gonna, sorry. Okay, okay. So then, so let's talk about, does Google view data flow as their sort of strategic successor to MapReduce? And... I mean, yes, so... And is that a competitor then to Flink? So I think Flink and data flow can be used in some of the same cases. But I think they're more complementary. Like Flink is something that you can run on-prem. You can run it in different sort of Hadoop vendors. And data flow is very much like, I can run this on Google Cloud. And part of the idea with Beam is to make it so that people who want to write data flow jobs, but maybe want the flexibility to go back to something else later can still have that. And so, yeah, we couldn't swap in Flink or data flow execution engines if we're on Google Cloud, but I mean, we're not, how do I put it nicely? Provided people are running this stuff, like they're burning CPU cycles. I don't really care if they're running data flow or Flink as the execution engine. Either way, it's a party for me, right? Like, it's probably one of those sort of friendly competition, right? Where we both push each other to do better and add more features that the respective projects have. Okay, 32nd question. Cool. Do you see people building stream processing applications with machine learning as part of it to extend existing apps or for ground up new apps? Totally. I mostly see it as extending existing apps. I mean, this is obviously possibly a bias just for the people that I talk to, but going ground up with both streaming and machine learning at the same time, like starting both of those projects fresh is a really big hurdle to get over, right? For skills. For skills, like it's really hard to pick up both of those at the same time. It's not impossible, but it's much more likely you'll build something, like, maybe you'll build a batch machine learning system, realize you want to productionize your results more quickly, or you'll build a streaming system and then want to add some machine learning on top of it. Those are the two paths that I see. I don't see people jumping headfirst into both at the same time, but this could change. You know, it's, batch has been king for a long time and streaming is getting its day in the sun so we could start seeing people becoming more adventurous and doing both at the same time. Okay. Holden, on that note, we'll have to call it a day, but that was most informative. That's really good to see you again, as always. So, this is George Gilbert. We're on the ground at Flink Forward, the Apache Flink user conference for the sponsored by Data Artisans, and we will be back in a few minutes after this short break.