 Live from Boston, Massachusetts, it's theCUBE, covering Spark Summit East 2017, brought to you by Databricks. Now, here are your hosts, Dave Vellante and George Gilbert. Welcome back to snowy Boston, everybody. This is theCUBE, the leader in live tech coverage. This is Spark Summit, Spark Summit East, hashtag Spark Summit. Brian Duxbury is here, he's the vice president of engineering at StreamSets. Cleveland boy, welcome to theCUBE. Thanks for having me. You're very welcome. Tell us, let's start with StreamSets. We're going to talk about Spark and some of the use cases that it's enabling and some of the integrations you're doing, but what does StreamSets do? Sure, StreamSets is data movement software. So I like to think of it either the first mile or the last mile of a lot of different analytical or data movement workflows. Basically, we build a product that allows you to build a workflow or build a data pipeline that doesn't require you to code. It's a graphical user interface for dropping an origin, several destinations and then lightweight transformations onto a canvas. You click play and it runs. So this is kind of different than, a lot of the market today is a programming tool or a command line tool that still requires your systems engineers or your unfortunate data scientists who are pretending to be systems engineers to do systems engineering, to do a science project to figure out how to move data. And the challenge of data movement, I think it's often underplayed how challenging it is, but it's extremely tedious work. You have to connect to dozens or hundreds of different data sources with totally different schemas, different database drivers or systems altogether, and it breaks all the time. So the home-built stuff is really challenging to keep online. When it goes down, your business is not, you're not moving data, you can't actually get the insights you built in the first place. I remember I broke into this industry in the days of Mainframe and you used to read about them and they had this high-speed data mover and it was this key component. It had to be integrated. It had to be able to move a large, back then, was large amounts of data fast. Today, especially with the advent of a dupe, people say, okay, don't move the data, keep it in place. Now, that's not always practical. So talk about the sort of business case for starting a company that basically moves data. Well, so we handle basically the one step before Hadoop. Like I agree with you completely, many data analytical situations today where you're doing the true business-oriented ETL, where you're actually analyzing data and producing value. You can do it in place, which is to say in your Hadoop cluster, in your Spark cluster, in Vertica, all the different environments you can imagine. The problem is that if it's not there already, then it's a pretty monumental effort to get it there. I mean, I think we see, a lot of people think, oh, I can just write a SQL script. And that works for the first two to 20 tables you want to deploy. But for instance, in my background, I used to work at Square. I ran data platform there. We had 500 tables that we had to move on a regular basis, coupled with a whole variety of other data sources. And so at some point, it becomes really impractical to hand code these solutions. And even when you build your own framework, when you start to build tools internally, it's not your job, really, at these companies to build a world-class data movement tool. It's your job to make the data valuable, right? And actually, data movement is, it's like a utility, right? And providing that utility, really the thing you need to do is be productive and cost-effective, right? So the reason why we built Streamset, the reason why this thing is a thing in the first place is because we think people shouldn't be in the business of building data movement tools. They should be in the business of moving their data and then getting on with it, right? Does that make sense? Yeah, absolutely. Now, talk about how it all fits in with Spark and generally, and specifically Spark coming to the enterprise. Well, so in terms of how Streamset connects to stuff, we deploy in every way you can imagine, whether you want to run it in your own premise, on your own machines, or in the cloud, it's up to you to deploy it however you like. We're not prescriptive about that. We often get deployed on the edge of clusters, whether it's your Hadoop cluster or your Spark cluster. And basically, we try not to get in the way of these analysis tools. Like there are many great analytical tools out there, like Spark is a great example. We focus really on the moving of data, right? So what you'll see is someone will build a Spark streaming application or some big Spark SQL thing that actually produces their reports and we plug in ahead of that. So if your data is being collected from Edge web logs or some Kafka thing or a third party API or you're scripting a website, we do the first collection and then it's usually picked up from there with the next tool, whether it's Spark or other things. I'm trying to think about the right way to put this. I think that people who write Spark, they should focus on the part that's like the business value for them. They should be doing the thing that actually is applying the machine learning model or is producing the report that the CEO or CTO wants to see and move away from the ingest part of the business. Does that make sense? Yeah, the Spark guys sort of aspire to that by saying you don't have to worry about exactly one's delivery and you can make sure there's sort of guaranteed, you've got guarantees that'll get from point A to point B, things like that. But all those sources of data and all those targets, writing all those adapters is, I mean that's been a La Brea tar pit for many companies over time. In essence, that is our business, right? I think that you may touch on a good point. Spark can actually do some of these things, right? There's not complete but significant overlap in some cases. But the important difference is that Spark is a cluster tool for working with cluster data and we're not going to beat you writing a Spark application for consuming from Kafka to do your analysis, right? But do you want to use Spark for reading local files? Do you want to use Spark for reading from a mainframe? Like these are things that StreamSets is built for and that library of connectors you're talking about, it's our bread and butter. Like it's not your job as a data scientist applying Spark to build a library of connectors. And so actually the challenge is not the difficulty of building any one connector because we have that down to an art now but we can afford to invest. We can build a portfolio of connectors but you as a user of Spark could only afford to do it on demand, right? Reactive. And so that turnaround time of the cost that might take you to build that connector is pretty significant. And actually I often see the flip side. This is a problem I faced at Square which was that people asked me to integrate new data sources. I had to say no because it was too rare. It was too unusual for what we had to do. We had other things to support. So the problem with that is that I have no idea what kind of opportunity cost that left behind. Like what kind of data we didn't get, what kind of analysis we couldn't do, right? And with our approach to like StreamSets you can solve that problem sort of up front even. So sort of two follow-ups. One is it would seem to be an evergreen effort to maintain the existing connectors? Certainly. And then two, is there a way to leverage connectors that others have built like the Kafka Connect type stuff? Well, truthfully we are a heavy duty user of open source software. So our actual product, if you like dig into what you see, it's a framework for executing pipelines and it's an SDK for connecting other software into our product. So it's not like when we integrate Kafka we've built like a brand new blue sky Kafka connector. We actually integrate what stuff is out there. So our idea is to bring as much of that stuff in there as we can and really be a part of the community. Our product is also open source. So we play well with the community. We have had people contribute connectors. People who say we love the product, we need to connect to this other database and then they do it for us. So this has been a pretty exciting situation. We were talking earlier off camera about sort of, George and I have been talking all week about sort of the batch workloads, interactive workloads and now you've got the sort of new emerging workloads, continuous streaming workloads, which is in the name. What are you seeing there and what kind of use cases is that enabling? Yeah, so we're focused on mostly the continuous delivery workload. We also do a little bit of the batch stuff. We're finding as people are moving farther and farther away from batch in general because batch was not the goal, it was a means to the end, right? Like people wanted to get their data into their environment so they could do their analysis, they want to run their daily reports, things like that. But ask any data scientist, they would rather the data show up immediately, right? So we're definitely seeing a lot of customers who want to do things like moving data live from a log file into Hadoop so they can read it and impala like immediately on the order of minutes. We're trying to do our best to enable those kind of use cases. In particular we're seeing a lot of interest in the Spark arena, obviously that's kind of why we're here today. People want to add their complex event processing or their aggregation and analysis like SparkJob especially like SparkSQL and they want that to be almost happening at the time of ingest, right? Not once it landed but like when it's happening. So we're starting to build integration. We have like kind of our initial foot in the door there with our Spark processor which allows you to like put a Spark workflow like right in the middle of your data pipeline or as many of them as you want in fact and we'll sort of manage the lifecycle of that and do all those sort of connections that's required to make your pipeline pretend to have a Spark processor in the middle. We really think that with that kind of workload you can do your ingest but you can also capture your real-time analytics along the way and that doesn't replace you know, batch reporting per se that will happen after the fact or your daily reports or what have you but it makes it that much easier for your data scientist to have a piece of intelligence that they add in flight. I love talking to somebody who's a practitioner who's now sort of working for a company that's selling technology. What do you see from both perspectives as Spark being good at, you know, what's the best fit and what's it not good at? Well, I think that Spark is following the arc of like Hadoop basically. It started out as infrastructure for engineers for building really big scary things but it's becoming more and more a productivity tool for analysts, data scientists, machine learning experts and we see that popping up all the time and it's really exciting frankly to think about these streaming analytics that can happen, these scoring machine learning models really bringing a lot more power into the hands of these people who are not engineers, right? People who are much more focused on the semantic value of the data and not the garbage in, garbage out value of the data. You were talking before about it's really hard, you know, data movement and it's not, the data's not always right, you know? Data quality continues to be a challenge. Maybe comment on that, state of data quality, how the industry is dealing with that problem. It is hard, it is hard. I think that the traditional approach to data quality is to try and specify quality up front and we take the opposite approach. We basically say that it's impossible to know that your data will be correct at all times so we have what we call schema drift tools, right? So we try to go more of like an, we say like intent driven approach for interacting with your data rather than a schema driven approach. So of course your data has an implicit schema that's as it's passing through the pipeline rather than saying, you know, let's transform column three, we want you to use the name, right? We want you to be aware of what it is you're trying to actually change in effect and then the rest just kind of flows along with it. You know, like there's no magic bullet for every kind of data quality issue or schema change that could possibly come into your pipeline. We try to do the best to make it easy for you to do effectively like the best practice, the easiest thing that will survive in the future and build robust data pipelines. And this is one of the biggest challenges I think with like homegrown solutions is that it's really easy to build something that works. It's not easy to build something that works all the time, right? It's very easy to not imagine the edge cases because it might take you a year until you've actually encountered, you know, the first big problem, the real, the gotcha that you didn't consider when you were building your own thing, right? And, you know, those of us at Streamsats who have been in industry and as on the user side, we've had some of these experiences, right? And so we're trying to export that knowledge in the product. Who do you guys sell to? Everybody. We see a lot of success today with, we call it Hadoopry platforming, which is people who are moving from their huge variety of data sources environment into like a Hadoop, like data lake kind of environment. Also cloud, people who are moving into the cloud, they need a way for their data to get from wherever it is to where they want it to be. And certainly people could script these things manually. They could build their own tools for this. But, you know, it's just so much more productive to do it quickly in a UI. Is it an architect who's buying your product? Is it a developer? It's a variety. I mean, so I think our product resonates greatly with the developer. But also people who are higher up in the chain, people who are trying to design their whole topology. You know, I think the thing I love to talk about is everyone when they start on a data project, they sit down and they draw this beautiful diagram with boxes and arrows that says, here's where the data's going to go, right? But a month later, it works kind of, but it's never that thing, right? And so- Yeah, because the data is just everywhere. Exactly. And the reality is what you have to do to make it work correctly within SLA guidelines and things like that is so not what you imagined, right? But then you can almost never go backward. You can never say, based on what I have, give me the boxes and arrows, right? Because it's a systems analysis effort that no one has the time to engage in. But since StreamSets actually instruments every step of the pipeline, and we have a view into how all your pipelines actually fit together, we can give you that. We can just generate it. So we actually have a product, we've been talking about the StreamSets data collector, which is the core data movement product. We have our enterprise edition, which is called the Dataflow Performance Manager, or DPM. And it basically gives you a lot of collaboration, enterprise grade, authentication, access control, and the command and control features. So it aggregates all your metrics across all of your data collectors and helps you visualize your topology. So people like your director of analytics or your CIO, who want to know, is everything okay? Like, we have a dashboard for them now, right? And that's really powerful. It's a beautiful UI and it's really a platform for us to build visualizations and more intelligence that looks across your whole infrastructure. Green is good. Yeah. And then the thing is, this is strangely kind of unprecedented because again, the engineer who wants to build this themselves would say, well, you know, I can just deploy graphite and all of a sudden I've got graphs, it's fine, right? But they're missing the details. Like, what about the systems that aren't under your control? What about the failure cases? All these things, these are the things we tackle, right? And because it's our business, we can afford to invest massively and make this a really first class data engineering environment. Would it be fair to say that Kafka sort of as it exists today is just data movement built on sort of like a log, but that it doesn't do the analytics and it doesn't really yet, maybe it's just beginning to do some of the monitoring with a dashboard or that's a statement of direction. Would it be fair to say that you can layer on top of that or you can substitute on top of it with all the analytics and then when you want the really fancy analytics, you call out to Spark? Well, so sure, I mean, I would say that for one thing, we definitely want to stay out of the analytics space. We think there's many great analytics tools out there like Spark. We also are not a storage tool. In fact, we're kind of like, we're Q-like, but we view ourselves more of like, if there's a pipe in a pump, we're the pump, right? And Kafka is the pipe, right? I think that from a monitoring perspective, we monitor Kafka indirectly, right? Because if we know what's coming out and what's going on later, we can give you the stats. And that's actually what's important. This is actually one of the challenges of having sort of like a homegrown or disconnected solution is that stitching together so you understand the end to end is extremely difficult. Like, because if you have a relational database and a Kafka and a Hadoop and a Spark job, sure, you can monitor all those things. They all have their own UIs. But if you can't understand what the throughput is on the whole system, you're left with four windows open trying to figure out where things connect and it's just too difficult. So just on a sort of a positioning point of view for someone who's trying to make sense out of all the choices they have, to what extent would you call yourself a management framework for someone who's building these pipelines, whether from scratch or buying components? And to what extent is it, I guess, well, when you talk about a pump, that would be almost like the runtime part of it. So there's a control plane and then there's a, I guess, a data plane. What's the mix? Yeah, well, we do both for sure. I mean, I would say that the data plane for us is this stream-stress data collector, right? We move data, we physically move the data. We have our own internal pipeline execution engine so it doesn't presuppose any other existing technologies, not dependent on Hadoop or Spark or Kafka or anything. You know, to some degree, data collector is also the control plane for small deployments because it does give you start and stop, command and control, some metrics monitoring, things like that. Now, when people need to expand beyond the realm of a single data collector, when they have enterprises with more than one business unit or data center or security zones, things like that, you don't just deploy one data collector, you deploy a bunch, you know, dozens or hundreds. And in that case, that's where Dataflow Performance Manager, again, comes in as that control plane. Now, Dataflow Performance Manager has no data in it. It does not pass your actual business data, but it does, again, aggregate up all of your metrics from all your data collectors and give you a unified view across your whole enterprise. And one more follow-up along those lines. When you have a sort of a multi-vendor stack or a multi-vendor pipeline, what gives you the meta view? Well, we're at the ins and outs, right? So we see the interfaces, right? So, in theory, if someone were to consume data out of Kafka, do something right in HTVS, and then there's another job later, like a spark job, so we don't have automatic visibility to that. But our plan in the future is to expand the Dataflow Performance Manager to take third-party metric sources, effectively, to broaden the view of your entire enterprise. You've got a bunch of stuff on your website here, which is kind of interesting. Talking about some of the things we talked about, taming data drift is one of your papers, the silent killer of data integrity and some other good resources. So, just in sort of closing, how do we learn more? What would you suggest? Sure. Yeah, please visit the website. The product is open source and free to download. Data collector is free to download. I would encourage people to try it out. It's really easy to take for a spin. And if you love it, should check out our community. We have a very active Slack channel on Google Group, which you can find from the website as well. And there's also a blog full of tutorials. Yeah, and when you're solving gnarly problems that a lot of companies just don't want to deal with, so that's good. Thanks for doing the dirty work, we appreciate it. Yeah, my pleasure. All right, Brian, thanks for coming to the queue. Thanks for having me. You're welcome. All right, keep right there, everybody. We'll be back with our next guest. This is theCUBE. We're live from Boston, Spark Summit. Spark Summit East, hashtag Spark Summit, right back. Since the dawn.