 From San Jose in the heart of Silicon Valley, it's theCUBE, covering Big Data SV 2016. Hi everyone, we're welcome back. This is George Gilbert. We're at Silicon Valley Big Data. This is running alongside Strata and Hadoop World. And we have a very special guest with us today, Costa Tsumas. For those in the know, he's the key guy behind Data Artisans, which is the company that is behind Flink, which is on everyone's tongue, sort of where Spark was a couple of years ago. Welcome, Costa Tsumas. Hi. Nice to be here, George. So tell us, give us a little background on the company. How it got started, the mission, where you are in funding, so we get a sense of where you are in the life cycle. Yeah. So the company is actually breathing you. We started Data Artisans about one and a half years ago when we basically saw the need for a very high performance stream processor in the market. So there was a certain gap, at least in the open source space. So we started one and a half year ago with a seed financing round from a European VCB2V partners. And actually, yesterday, we announced our series A round led by Intel Capital. Intel Capital. Okay. They don't get much more blue chip than that. So, all right. So tell us stream processing has been on the top of everyone's mind. Absolutely. Let's start at the top. Why is stream processing so topical now? One, what's made it possible for it to be such focus, and then why is it so relevant? So if you ask me, actually, I wonder why stream processing hasn't been around for a long time. So to me, stream processing actually enables the obvious, which is continuous analytics on data that is produced continuously. So if you look at most interesting data sets out there, they're not static. They don't have a beginning and an end, right? They are enraged and their records added and there's new events coming up all the time, right? So look at user behavior, click streams, logs, sensors, reading from sensors, etc., etc. So the very logical thing is to actually do the analytics on these continuous data sets continuously. Now, the funny thing is that until recently, the assumption of all the tools in the infrastructure was that data has a beginning and an end, what people used to call sort of bots. So I think what is different now is that number one, the rate of change is much higher, so especially with new data sets coming in like Internet of Things and lots of user activity, the rate of change is much higher, and a lot of companies want to move to real-time decisions on demand, so they really want to give answers to the customers immediately without delay. Is it fair to say that we've sort of had two modes of computing for 50, 60 years like batch or request response, meaning the online? That's right. So, I mean, this need for continuous processing, I mean, it's been around for a while. What made it really sort of take off? Was there some catalyst for it that you can identify? Like every technology, a lot of companies have been doing this for a while, so a lot of this comes from companies like Google or LinkedIn, Twitter, Amazon, etc., that have embraced this for a while. Banks have been doing this for a while and telcos, and now the rest of the world is following, so now stream processing is becoming extremely mainstream in the whole enterprise. And I think one of the key factors is the maturity of the open source software for stream processing, so until now, the open source tools were not as mature, if you wish, as they're batch processing counterparts, but now this is actually changing. There are a lot of open source projects out there, including Apache Flink that we're developing, Apache Kafka, Apache Beam that recently came out of Google, that are basically pushing this level of maturity to the point that they can cover all data processing with streaming. So, okay, give us a nice concise example of how you would go about using stream processing in conjunction with a broader application, and then what that might look like if you go about it for batch processing, because, I mean, programmers would know that, but maybe not all our audience would. Sure. So, I would like to actually take an extremely simple example. I will actually be talking about this at the Strata conference right after I leave here, and that is counting. So, we want to count things, things like social media analytics, user behavior, aggregations on metrics, on infrastructure, like logs, et cetera, things like that. So, let's say that we want to do continuous counting. Doing that with batch tools is possible, and people have been doing this with Hadoop and Spark and other tools, but it's very problematic. So, you need to glue together a bunch of systems, you need to write out files, and then, hopefully, you schedule a job that hopefully starts when this file is ready, hopefully finishes when the next file is ready, et cetera, et cetera. So, it's very brittle. There are too many moving parts. You cannot do much. This boundaries between these batches that you schedule are very rigid. It's unclear what is in the first boundary and what is in the second, and this becomes just natural and simple with stream processing, because the only thing you need is actually a stream processor, and all the important code about the application is actually in the stream processing program. It's not scattered around a bunch of systems and DevOps, et cetera. Okay, so you've set us up with a good example. Now, tell us, the whole world's now focused on Spark. It's the shiny new toy, and within Spark, everyone's focused on Spark streaming as the answer to this need for continuous computation analysis. Help position Flink against Spark. What are the sweet spots? When should you use one? When should you use the other? Or is Flink so much better that people should sort of migrate to that? Well, I'm obviously biased, right? That's okay. So, the way I see it is the following. So, both Spark and Flink share a commonality that they're both very broad. They want to cover a lot of types of analytics. So, in Flink, we're not doing only stream processing. We're also doing batch processing. We're also doing a bit of machine learning. We're also doing graph analytics. Spark, the same thing. So, this is the commonality. But that said, its project has its, you know, really strong points, and the market is usually focusing on the strong points. So, Flink is a true stream processor in the sense that you can do event-at-time analytics with very low latency, a few milliseconds. So, you can get an event and immediately publish an alert on that. You can do very, very rich stream analytics that actually take into account that the streams that you are getting in do not arrive in order. So, things that you cannot very easily do with other stream processors. So, I would say the really strong point for Flink and where we see most of the adoption is companies that, for which streaming is not just a little add-on. So, if you're doing like a lot of Spark and you want to do a little bit of ingestion, then you probably would like to do that in the same framework. But if stream processing is really at the core of your infrastructure, then you really need a proper high-performance stream processor. Do we need, if people are really to take advantage of continuous processing and stream processing, do they need to rethink how they build their applications end to end? Or is it just like how one part might communicate with another part? So, there is a level of rethinking in the infrastructure from storage to compute. So, there are two main differences. So, with streaming you embrace the fact that, let's say the ground truth is the history of the world. So, it is all the events that have come since the beginning until now. And it is not, let's say, the state of the world. So, you're not trying to capture the state of your business right now, your whole business. But your ground truth is the history. And then every isolated application, microservice, etc., is building its own local state. That is the first thing. And the second thing that you need to embrace with streaming is the explicit management of time. So, a lot of these things are time series. So, inside the application code, you actually need to define things like time windows and say that I would like to aggregate the last 10 minutes worth of data. So, this is now part of your application code. You're managing it in the application. I would say that these are the two main things. So, it will take a little bit of education for people to get there. But it is happening. And once people do that, they realize that it is actually a very, very natural way to interact with data. Okay. So, in the last 20 seconds that we have, tell us what we should expect from Flink and Data's artisans over the next 18 to 24 months. All right. Yeah. So, that's a very good question. So, we're working on a lot of stuff on Flink. So, for example, we're working now finally on adding SQL on both static and continuous datasets. We're working on making Flink programs scale dynamically and respond to what is happening in the input streams in the world. And, I mean, you can see all that stuff in the road, but it's actually, there's a lot of features. And, you know, obviously, from Data Artisans, we're growing a lot. We're growing the team. We have our primary location is in Berlin. In Germany, we have people here. We're growing both locations, so Berlin and San Francisco. We are going to offer a lot more to Flink users and generally take the project way, way forward. All right. Costa Sumas. Great to have you here. Flink and Data Artisans, I'm sure we'll be hearing a lot more from you. We're in San Jose at Big Data Silicon Valley, running concurrently with Strata and Hadoop World. This is George Gilbert. We'll be right back after a short break.