 Welcome everybody. We're at the Flink Forward Conference in San Francisco at the Kabuki Hotel. Flink Forward US is the first US user conference for the Flink community sponsored by data artisans, the creators of Flink. And we're here with special guest Kenneth Knowles who works for Google and who heads up the Apache Beam team where just to set context, Beam is the API or SDK on which developers can build stream processing apps that can be supported by Google's Dataflow, Apache Flink, Spark, Apex, among other future products that will come along. Ken, why don't you tell us what was the genesis of Beam and why did Google open up the sort of the API to it? So I can speak as an Apache Beam PMC member that the genesis came from a combined code donation to Apache from Google Cloud Dataflow SDK and there was also already written by data artisans a Flink runner for that which already included some portability hooks and then there was also a runner for Spark that was written by some folks at PayPal and so sort of those three efforts pointed out that it was a good time to have a unified model for these DAG based computational, I guess it's a DAG based computational model. Okay so I want to pause you for a moment, generally we try and avoid being rude and cutting off our guests but in this case help us understand what a DAG is and why it's so important. Okay so a DAG is a directed acyclic graph and in some sense if you draw boxes and arrows diagram of your computation we say I read some data from here and it goes through some filters and then I do a join and then I write it somewhere these all end up looking like they're called a DAG just because of the fact that it is this structure and all computation sort of can be modeled this way and in particular these massively parallel computations profit a lot from being modeled this way as opposed to MapReduce because the fact that you have access to the entire DAG means you can perform transformations and optimizations and you have more opportunities for executing it in different ways. Oh in other words because you can see the big picture you can find like the shortest path as opposed to I got to do this step and this step and this step. Yeah it's exactly like that you're not constrained to sort of the person writing the program knows what it is that they want to compute and then you know very smart people writing the optimizer and the execution engine so may execute in an entirely different way. So for example if you're doing a summation rather than shuffling all your data to one place and summing there maybe do some partial summations and then you just shuffle accumulators to one place and finish the summation. Okay now let me bump you up a couple levels and tell us so MapReduce was a trees within the forest approach lots of seeing just what's you know a couple feet ahead of you and now we have the big picture that allows you to find the best path perhaps one way of saying it. Tell us though with Google or with others who are using beam compatible applications what new class of solutions can they build that you wouldn't have done with MapReduce before? Well there's I guess there's two main aspects to be in that I would emphasize there's the portability so you can write this application without having to commit to which back end you're going to run it on and there's also the unification of streaming and batch which is not present in a number of back ends and beam as this layer sort of makes it very easy to use sort of batch style computation and streaming style computation in the same pipeline and actually I said there were two things the third thing that actually really opens things up is that beam is not just a portability layer across back ends it's also a portability layer across languages so something that really only eliminates port on a lot of systems is Python so for example beam has a Python SDK where you write a description of you know a DAG description of your computation in Python and via beams portability APIs one of these sort of usually Java centric engines would be able to run that Python pipeline. Okay so can I answer your question? Yes yes but let's let's go one level deeper which is if MapReduce you know was its sweet spot was web web crawl indexing you know in batch mode what what are some of the things that are now possible with a beam styled platform that supports beam you know underneath it that can do this directed asyclic graph processing? I guess what I I'm still learning all the different things that you can do with this styled computation and the truth is it's just extremely general right you can you can set up a DAG and there's a lot of talks here I'm going forward about using a stream processor to do high frequency trading or fraud detection and those are completely different even though they're in the same model of computation as you know you would still use it for things like you know crawling the web and doing page rank over actually at the moment we don't have iterative computation so you wouldn't do page rank today. So is it considered a complete replacement and then new use cases for older style you know frameworks like MapReduce or is it a complement for things where you want to do more with data in motion or lower latency? It is absolutely intended as a full replacement for MapReduce yes like if you're thinking about writing a MapReduce pipeline instead you should write a beam pipeline and then you should benchmark it on different beam backends. And so working with Spark working with Flink how are they in terms of implementing the full richness of the beam interface relative to the Google product data flow from which I assume beam was derived? So all of the different backends exist in sort of different states as far as implementing the full model one thing I really want to emphasize is that beam is not trying to take the intersection of all of these right and I think that your question already shows that you know this we keep sort of a matrix on our website where we say okay there's all these different features you might want and then there's all these backends you might want to run it on and it's sort of there's can you do it can you do it sometimes and notes about that we want this whole matrix to be guess you can use all of the model on Flink all of it on Spark all of it on Google Cloud data flow but so so they all have some gaps and I guess if yeah we're really welcoming contributors in that in that space. So you almost for someone who's been around for a long time you might think of it as an ODBC driver where the you know the capabilities of the databases behind it are different and so the drivers can only support some subset of a full full capability. Yeah I think that there's so I'm not familiar enough with ODBC to say absolutely yes absolutely no but yes it's it's that sort of a thing it's like the JVM has many languages on it and ODBC provides this generic database abstraction. Is Google's goal with the Beam API to make it so that customers demand a level of portability that you know goes not just for the on-prem products but for products that are in other public clouds and sort of to pry open the you know the API lock-in that. So I can't say what Google's goals are but I can certainly say that Beam's goals are that nobody's gonna be locked into a particular back end. Okay I mean I can't even say what Beam's goals are sorry those are my goals I can speak for myself. Is Beam seeing so far adoption by the sort of big consumer internet companies or has it started to spread to mainstream enterprises or is it still a little immature? I think Beam's still it's still a little bit less mature than that we're heading into our first stable release so we we begin incubating as an Apache project about a year ago and then around the beginning of the new year actually right at the end of 2016 we graduated to be an Apache top level project so right now we're sort of on the the road from we've become a top level project we're seeing contributions ramp up dramatically and we're aiming for a stable release as soon as possible our next release we expect to be a stable API that we would encourage users and enterprises to adopt I think. Okay and that's when we would see it in production form on the Google Cloud platform. Well so the thing is that it's already the code and the back ends behind it are all very mature but right now we're still sort of like I don't know how to say we're polishing the edges right it's still got a lot of rough edges and you might encounter them if you're trying it out right now and things might change out from under you before we make our stable release. Understood. Alright Kenneth thank you for joining us and the update on Google the BEAM project and we'll be looking for that and seeing its progress over the next few months. With that I'm George Gilbert I'm with Kenneth Knowles we're at the Data Artisans Flink Forward user conference in San Francisco at the Kabuki Hotel and we'll be back after a few minutes.