 Live from the Fairmont Hotel in San Jose, California, it's The Cube at Big Data SV 2015. Hello, and welcome back to The Cube. We're live here in Silicon Valley. I'm John Furrier, the founder of Silicon Angle, and this is The Cube. We extract the signal from the noise, go out to the events and talk to all the folks we can talk to out there. Entrepreneurs, engineers, CEOs of big companies, and we are here at Big Data SV, our event, in conjunction with Stratoconference and Hadoop World. We're here with John Furrier, our next guest is from Google, Eric Schmidt, the product manager, not the former CEO of Google. Love the name. Big smile over there. So Eric Schmidt on the schedule, I just couldn't lie. Welcome to The Cube. Eric Schmidt, product manager, Google's cloud data flow product. Obviously in context to Big Data Week here. Welcome to The Cube. Thank you. Thanks. So obviously Google's, we're big fans of what you guys do in the cloud. Love the scale of Google. Love the developer traction. Kubernetes is awesome, hot product. Obviously in the stack is data, right? You got to do something with data. A lot of hot stuff going on in memory, streaming, all this stuff's hot. Application developers want the data. A lot of stuff being architected. Tell us what's going on with you guys with data flow and how does this all connect in with Google Cloud and what you're doing here? Sure. So I guess a little bit of history. We announced at Google I.O., the past Google I.O. that we were building a fully managed service for parallelized data processing, which was called data flow. And data flow is a synthesis of really two main efforts inside of Google. Back in 2004, we released our thoughts on this concept called MapReduce and then internally continue to build an implementation. If you had only open source MapReduce cloud era wouldn't exist. That's a whole other story. We'll go down there in a second. I was really excited to see just how much momentum and excitement from the community and innovation that's happening. It's so awesome, it's been great. We obviously helped create some of that DNA but the community at large, the open source process has really proven itself valuable in producing the broader Hadoop ecosystem. Yeah, and just for folks watching out there who don't know the history, Google's DNA is all over the Hadoop world as well as Yahoo. Obviously, we're on our work than everyone else. But cloud, even before cloud era became a household name, there was a ton of work done in the open source community within Apache, so that's something that needs to be footnoted, I think. Yeah, and we spent some time in the session earlier this week basically turn it, you know, map our work in MapReduce focused on Batch and then we later released a paper on Millwill, which is a real-time stream processing system. And Dataflow as a product is really a synthesis of these two models. It's basically bridging the Batch world with the stream with the stream processing world both as a fully-managed service as well as a unified SDK. So a data processing, data engineer can sit down and not necessarily have to think about as my data at rest, as my data in flight how am I going to deal with temporal aspects of my data basically bringing to market a unified development model and a unified and fully-managed service. So, talk about that unification key. I mean, data in flight, data at rest these are all concepts that are obviously known but relative to open source one of the things that Hadoop's growing up fast is automation orchestration. It used to be, I mean, it's still kind of a hassle but just a few years ago a lot of work would have to be done just to kind of manage, stand up Hadoop and then you add in complexity of data flowing around north, south, east, west whatever you want to call it that's written critical that automation and to scale. So what are you guys doing to help that? Is that part of the piece? Yes, so the service aspects enable developers to in essence specify the type of boundaries that they would like in their cluster to manage their data. So you can kind of think of a typical on-premise Hadoop implementation where you kind of do a lot of architecture design figure out what type of node structure you want, you go buy, acquire and then rack up that cluster which is great but it's not necessarily the best solution if you have a lot of volatile usage or your business counterparts are constantly calling up the next day and saying hey we need more capacity, we need you to be able to expand. So the service aspects the elastic aspects of the service enable a developer to specify a set of boundaries, hey this is the minimum size and potentially maximum size of the cluster you deploy that cluster into or ask Dataflow to deploy that cluster on your behalf and we will modulate expand and shrink down the size of that cluster to keep up their data rates. Versus dealing with all the hassles of managing elastic things like control plane and all the other plumbing. Exactly, yeah. How do I say it, hassles. Developers just that's DevOps, that's a DevOps concept. It is, but whenever you get into talking to like hardcore and traditional MapReduce developers, they're doing that development but they're also having to put on that DevOps hat. They're operations person, persons maybe that's responsible for their NAS or their storage system. Those guys are usually pretty close buddies because they're having to figure out different sharding techniques for different data sizes, you know volatile changes and key structure and data and our goal is to really help kind of offset a lot of that work. In some cases completely remove it and the concept of auto scanning those clusters is both a benefit and a batch sense so you can in essence improve clock time. Your boss calls up tomorrow and says hey I need this answer but I need it in three hours faster we give you the ability to expand your clusters easily so that you can get questions faster. In streaming mode it's really about dealing with ingest rates of data. If you can't process fast enough then you're going to have lag. Lag then introduces either inaccuracy in data or unavailability of data. So what does data flow target to? Is it target to the developers? Is it targeted to the DevOps guy? Obviously your alpha is at alpha in terms of deploying or coding development kit. Is it for just data jockeys and wranglers? What is the focus all the above? The sweet spot right now is focus really on your data processing engineer. The guy or girl whose task it is to do traditional ETL. Getting data moved from disparates sources and temporary storage locations. Love pipeline kind of thing. Exactly. Getting that data into their data warehouse maybe they're using BigQuery as an example to do a lot of interactive analysis and they have terabytes or petabytes worth of data on GCS. How do you get that data moved? Then you get into more advanced movement like doing filtering enrichment shaping of data. So it's kind of one stack. The second stack is doing analysis either batch, spot analysis or doing continuous computation over that data. So use cases would be maybe you are doing basically classic ETL into your system but at the same time you have some in-flight data that you would like to continuously monitor. Maybe you want to push it out to a monitoring agent so your broader network operations people can kind of see. This is what's happening with my kind of inbound control plane with data and then this is what's happening in real time with user based data. So you can start correlating those two aspects. The third use case would be someone typically maybe higher above those data processing engineers who is trying to synthesize or orchestrate data processing logic across multiple flows. Like you have maybe one process that's all about ingestion of data type A, another process for data type B, data type C, etc. Being able to synthesize or fuse those flows together that represents something meaningful for the business. So it's really kind of three different personas. So talk about the open source piece. You guys have some open source component and also talk about the alpha component that's application driven. You're just submitting applications. You approve all of them. Is there a criteria? It seems like there's like a black box there. Tell us what's behind the curtain. So on the open source side, we open sourced our Java 7 implementation. We pushed that out to GitHub right before Christmas, December 17th I think was the day. That build, the build that's in GitHub is our build. It's the same build that we are building and promoting with the product. We also have a Python 2 implementation in the works that whenever we have it feature complete and stabilized, we will also open source that as well. Subsequently though, since then, several companies have produced Scala ports, which is a relatively straightforward thing to do because they can wrap on top of our Java 7 implementation. We are also seeing some interaction though on alternate runners and the idea would be you would develop in Dataflow. You'd write your code in Dataflow but instead of executing it on our managed service, you could say execute it on your own self-managed Spark cluster. So Josh Wills over at Cloudera who was the inventor of Crunch, he jumped into the mix and said hey, I like Dataflow as a programming model but I'd really like to have it run on Spark so he's contributed and the folks over at Data Artisans they're also building a Flink runner that I think will be out in the next week or two. So that's kind of where we're at in the short whatever two and a half months from an open source perspective. We'd like to see a lot more extensions, inputs, output support, different types of transforms that people are sharing. I think your other question was about how do you get in? If you go to cloud.google.com and say sign up and say apply for alpha type in a little bit of information and myself and some other folks process that on a daily, every other day basis. So is there like an algorithm you guys have in terms of our filter you looking for key words? I mean it's probably a simple vetting process or is it a complex vetting process? In the beginning in the beginning it's controlled alpha essentially it's directed availability on the alpha In the beginning whatever we released in last summer for EAP and I think this is common for most projects like this you are selective mainly because you're trying to control the pressure back on your engineering and the resources. And the team managing it you don't want to open the kimono too fast. But we're in the process now, we're pretty much onboarding everyone that comes through. We did that with our CrowdChat app for a good year. We limited who can do it just to see patterns. We didn't want to get killed by a thousand paper cuts of support. Exactly and then you also start clustering around issues and then you can solve them. I feel on the product management side we feel pretty good where we're at right now in terms of supportability we've hammered on most of the usability issues. So yeah we're actively onboarding. And what's the beta timetable? You're looking out at Google I.O. or next event timetable. Any road map there in terms of I'm going to hold you to numbers. Not like you're going to get fascinated for not making the number. I mean come on, is it like this year second half of the year, you don't know. So I don't have a time dimension in the future that I can or a time a specific time answer that I can give you. The joke that I made yesterday. You're working on it. No the joke that I made yesterday is like if you look at the timeline as we were doing MapReduce, Flume, Dremel, Millwill, etc. And then we basically have been taking lots of that technology to cloud. We released Dataflow in June of 2014 in EAP mode and then we went to Alpha December 17th. So if you want to do some estimation on historical you know I'll work on that tonight. I'll be up all night working on Google's, you know, I'll break this big store up. We are working as fast. The Compute Engine did the same thing you guys. It's the standard operating procedure. We're working as fast as possible to get the market. So what's the coolest thing that you think is going on there? In terms of where you see excited is it the spark piece, is it the pipeline, is it the unification of the programming, is it open source. And what are the people saying about what's the conversation around this? There are several cool things or things that really excite me. One of which is our sophistication or intelligence around our windowing support. Other programming models have the concept of windows basically being able to bucket data into windows. We have fixed windows, we have sliding windows. We've also implemented the concept of session based windows. So if you don't necessarily know a specific key you can ask the system to look at streams of data and attempt to infer this is a start point and an end point for a particular system. Which is great for log processing. You may not necessarily know who the user idea is, but you can look at a pattern. So there's some sophistication there about we've also implemented this concept of triggers and watermarks which enable a developer to specify custom time stamps for input data. This is something that Spark developers are challenged with a lot today in streaming mode where Spark is looking at input data and time stamping it as it arrives into the system. Which is great for a lot of systems, but if you want to do correlation back to the actual event time say on a mobile device or something like that. You potentially have some time drift there. I'm super excited that we have this ability to basically deal with arrival time and or source time and then give the developer the ability to basically control how they want to deal with late arriving data. It begins with a windowing support and we have this concept of triggers and watermarks. You wrap them all together and at the end of the day we provide a high level of tuning to dial in correctness of data. Real time streaming is hard because you never know whenever you have all of your data and sometimes you may never get all of it. So then the question is how do you deal with late arriving data? So that's something I'm really excited about. It reminds me of the old days of TCPIP and packet through networks. It's a network packet problem. You got to know the flows. It's also contextual data instrumentation you need. It's really complex. Well Eric, thanks for coming on the Cube. We really appreciate it. We've got to wrap up that segment. We're getting the hook here from the folks on the timetable. Final word. What do you think about the show here this week? What's the vibe? What have you learned? What's the big aha in terms of the vector we're on, the navigation, the path to the future? What's happening here? I see a massive shift from people moving as fast as way of possible from batch to get to real time. And then the extension of ML applied to real time. I'm extremely excited about that. I've played in the batch processing space for a long time but always wanted to be in real time. I have to ask this because you know that I hate the term data lake because I think it symbolizes slowness and like batch. That's a batch concept. The ocean like Pacific is always moving, it's always changing. And there's different currents, different ecosystems. That's really the big data world in the future. You're going to have that kind of current. It might not stick, not that I care. But in terms of data, it's complex. In real time, you're dealing with that kind of unknown at any given time, a rip current of new things could be streaming. What's your take on that? Do you see it that way? Not to say data, you don't have to agree with me, but that kind of real time complexity can arise just as fast as the benefits. Exactly. And we're seeing that from our customer base whether it's small orgs or start-ups or very large orgs they're dealing with these types of problems. One day they feel like they have some type of consistency around their data processing and boom the next day another part of their organization launches a new game that just has exponentially increased their data rates. Or they want to process new data coming from their marketing organization, etc. This isn't going to stop. In order to get the real intelligence out of their ML processes they're going to need to continue to aggregate. In an ocean which feeds hurricanes, there's tsunamis, there's currents a lot of unknown circumstances have to develop. So in your data flow vision what do we need to build from an infrastructure intelligence and software standpoint to make that predictable so that at any given time circumstances can change, hence the data has to be adaptive and reactive in real time. Yes. Is there any vision on what needs to be in place? I think what you're talking about is actually the core of our fully managed service. The faster that you can get away from having to scale to deal with the operational aspects the more time that you can spend on actually just dealing with the data you're still going to have to shape it. You're still going to have to process it. You're going to still have to run it through different algorithms to pull out the types of data that you want and then potentially train some other ML implementation. That's where I think the real magic is going to happen. That's a lot of new code ideas in there. AI, all this new stuff that people are talking about really good automation. But it's a boom for computer science. It's just going to be magical. Using an analogy, if you can't put your arms around the ocean and even if you could could you hold on to it for long enough and as more water comes into it if that's capable but this is what's happening with these entities, whether big or small if you can do that then you have control over it and you can start doing something with it. You can know what not to do. If it's a tsunami coming or a big storm you don't play in that wars or you adjust appropriately so I think to me I look at currents and streaming and the river, these kinds of concepts that have been network concepts and it's always moving. The lake is a batch. Again, I'm ranting on my data oceans. Eric Schmidt here, validating my data ocean vision. Thank you, this is the first person to support my data ocean at least in principle. I'll put it on my next slide. This is theCUBE where we're right back on the next guest. Eric Schmidt from Google, Dataflow. Great product, great company in terms of scale. Really leading the way. Again, the DNA is all over Hadoop and it's well known so congratulations and always good stuff coming out of Google, always fun. We'll be right back after this short break.