 Live from San Francisco, it's theCUBE. Covering Flink Forward, brought to you by Data Artisans. Hi, this is George Gilbert. We are at Flink Forward on the ground in San Francisco. This is the user conference for the Apache Flink community. It's the second one in the US and this is sponsored by Data Artisans. We have with us Greg Benson, who's Chief Scientist at SnapLogic and also Professor of Computer Science at University of San Francisco. Yeah, that's great. Thanks for having me. Good to have you. So Greg, tell us a little bit about how SnapLogic currently sets up its, well, how it builds its technology to connect different applications and then talk about a little bit where you're headed and what you're trying to do. Sure, sure. So SnapLogic is a data and app integration cloud platform. We provide a graphical interface that lets you drag and drop components that we call snaps and you kind of put them together like Lego pieces to define relatively sophisticated tasks so that you don't have to write Java code. We use machine learning to help you build out these pipelines quickly so we can anticipate based on your data sources what you are going to need next and that lends itself to building, rapid building of these pipelines. We have a couple of different ways to execute these pipelines. You can think of it as sort of this specification of what the pipeline's supposed to do. We have a proprietary engine that we can execute on single nodes either in the cloud or behind your firewall and your data center. We also have a mode which can translate these pipelines into Spark code and then execute those pipelines at scale so you can do sort of small low latency processing to sort of larger batch processing on very large data sets. Okay, and so you were telling me before that you're evaluating Flink or doing research with Flink as another option. Tell us what use cases that would address that the first two don't. Yeah, good question. I'll just back up a little bit. So because I have this dual role of chief scientist and as a professor of computer science I'm able to get graduate students to work on research projects for credit and then eventually as interns at SnapLogic. A recent project that we've been working on since we started last fall, so working on about six or seven months now is investigating Flink as a possible new backend for the SnapLogic platform. So this allows us to explore and prototype and just sort of figure out if there's going to be a good match between an emerging technology and our platform. So to go back to your question, what would this address? Well, so without going into too much of the technical differences between Flink and Spark which I imagine has come up in some of your conversations or it comes up here because they can solve similar use cases. Our experience with Flink is the code base has been quite easy to work with both from sort of taking our specification of pipelines and then converting them into Flink code that can run. But there's another benefit that we see from Flink and that is whenever any product, whether it's our product or anybody else's product that uses something like Spark or Flink as a backend, there's this challenge because you're converting something that your users understand into this target, this Spark API code or Flink API code. And the challenge there is if something goes wrong, how do you propagate that back to the user? So the user doesn't have to read log files or get into the nuts and bolts of how Spark really works. And what- It's almost like you've compiled the code and now if something doesn't work right, you need to work at the source level. That's exactly right. And that's what we don't want our users to do. So one promising thing about Flink is that we're able to integrate the code base in such a way that we have a better understanding of what's happening and the failure conditions that occur. And we're working on ways to propagate those back to the user so they can take actionable steps to remedy those without having to understand the Flink API code itself. And what is it then about Flink or its API that gives you that feedback about errors or operational status that gives you better visibility than you would get in something else like Spark? Yeah, so without getting too, too deep on the subject, what we have found is one thing nice about the Flink code base is the core is written in Scala but there's a lot of all the IO and memory handling is written in Java. And that's where we need to do our primary interfacing and the building blocks, sort of the core building blocks to get to, for example, something that you build with the Dataset API to execution. We have found it easier to follow the transformation steps that Flink takes to end up with a resulting sort of optimized Flink pipeline. Now, so by understanding that transformation, like you were saying, the compilation step, by understanding it, then we can work backwards and understand how when something happens, how to trace it back to what the user was originally trying to specify. The GUI, the GUI specification. So, help me understand though, it sounds like you're the one essentially building a compiler from a graphical specification language down to Spark as the sort of pseudo compile code or Flink. But if you're the one doing that compilation, I'm still struggling to understand why you would have better reverse engineering capabilities with one. It just is a matter of getting visibility into the steps that the underlying frameworks are taking. I'm not saying this is impossible to do in Spark, but we have found that it's been easier for us to get in to the transformational steps that Flink is taking. Almost like for someone who's had as much programming as a one semester in night school, like a variable inspector that's already there. Yeah, there you go, yeah, yeah, yeah, yeah. So you don't have to go trying, you can't actually add it, and you don't have to then infer it from all this log data. Now, I should add there's another potential Flink. You were asking about use cases and what does Flink address? As you know, Flink is a streaming platform and in addition to being a batch platform and Flink does streaming differently than how Spark does. Spark takes a micro batch approach. What we're also looking at in my research effort is how to take advantage of Flink's streaming approach to allow the SnapLogic GUI to be used to specify streaming Flink applications. Initially, we're just focused on the batch mode, but now we're also looking at the potential to convert these graphical pipelines into streaming Flink applications, which would be a great benefit to customers who want real-time integration. Want to do what everybody's with Alibaba and all the other companies are doing, but take advantage of it without having to get into the nuts and bolts of the programming. Do it through the GUI. Wow, so it's almost like, it's like Flink beam in terms of abstraction layers and then SnapLogic. Sure, yes. And not that you would compile the beam, but the idea that you would have per event processing and a real-time pipeline. Okay, so that's actually interesting. So that would open up a whole new set of capabilities. Yeah, and it follows our company's vision in allowing lots of users to do very sophisticated things without being Hadoop developers or Spark developers or even Flink developers. We do a lot of the hard work of trying to give you a representation that's easier to work with, right? But also allow you to sort of evolve that and debug it and also eventually get the performance out of these systems. One of the challenge, of course, of Spark and Flink is that they have to be tuned and you have to, and so what we're trying to do is using some of our machine learning is eventually gather information that can help us identify how to tune different types of workflows in different environments. And that, if we are able to do that in its entirety, then we take out a lot of the really hard work that goes into making a lot of these streaming applications, both scalable and performant. Performant. So this would be, but you would have to do that, you would probably have to collect, well, what's the term? I guess data from the operations of many customers. Right. As training data, just as the developer alone, you won't really have enough. Absolutely. And that's so that you have to bootstrap that. We, for our machine learning that we currently use today, we leverage the thousands of pipelines, the trillions of documents that we now process on a monthly basis. And that allows us to provide good recommendations when you're building pipelines because we have a lot of information. So you are serving these runtime compilations? Yes. Oh, they're not hosted on the customer premises? Oh, no, no, no, we do both. So it's interesting, we do both. So you can deploy completely in the cloud. We're a complete SaaS provider for you. Most of our customers, though, banks, healthcare, want to run our engine behind their firewalls. Even when we do that though, we still have metadata that we can get introspection, sort of anonymized, but we can get introspection into how things are behaving. Okay, that's very interesting. All right, Greg, we're going to have to end it on that note, but I guess everyone stay tuned. That sounds like a big step forward in sort of specification of real-time pipelines at a graphical level. Yeah, well, I hope to be talking to you again soon with more results. Looking forward to it. With that, this is George Gilbert. We are at Flink Forward, the user conference for the Apache Flink user community, and sponsored by Data Artisans. We will be back shortly.