 Live from San Jose, California, in the heart of Silicon Valley, it's theCUBE, covering Hadoop Summit 2016. Brought to you by Hortonworks. Here's your host, George Gilbert. Good afternoon, this is George Gilbert. I'm your host right now at Hadoop Summit, San Jose. This is our premier Hadoop conference for Hadoop Summit in the middle of the summer. So I'm scoping it down a little bit. And my guests for this segment are Tendu Yogotsu. Hi, George. Of SyncSort, and Scott now of Hortonworks. And they've announced something a couple of months ago at Hadoop Summit in Europe, and they're going to give us an update on the partnership between the two companies. Tendu, why don't you start? Sure, we are very excited that we are actually having a new agreement with Hortonworks tightening our partnership. Hortonworks will be reselling SyncSort, DMXH for ETL onboarding on Hortonworks data platform, basically bringing all data from enterprise data sources from legacy mainframe to relational data stores so emerging data repositories like Hadoop as well as streaming and integrating and onboarding users easily on Hadoop. Scott, so obviously if you're the champion on the Hortonworks side, what did you look at in the field of competitors which seems fairly large, what jumped out at you? Sure, well there are a couple of things, George. One, and we're highlighting the messaging here at Hadoop Summit about this notion of connected data platforms and the fact that businesses really require access to all of the data all of the time, and that means certainly all of the data whether it be data that's created in legacy mainframe systems, if it's data that's created at the edge and new devices and anywhere in between, all of the data and all of the time at the right time so that they can make meaningful decisions at the right time in the customer interaction experience. So as part of that, it's obviously really important to us to figure out ways to make it easier for our customers to get access to all of their data and their data legs and their implementations of the larger Hadoop ecosystem. And so we looked around, we came upon the product from Syncsport DMXH. We really like what it enables. It's kind of an easy button for onboarding, especially for legacy data where a lot of the work on doing provenance and building structure and building metadata has already been done. Being able to pick that up seamlessly and get it into the data legs so that it can be accessed and integrated into other kinds of analytics, that easy button is really what it creates. Runs natively on the cluster, certified to work with Hortonworks, and our customers have been asking us for these kinds of solutions to make their journey a little bit easier. So it was kind of a win-win. So a couple of things to unpack in there. One thing that sounded important was in the source, in the data source repositories or strings, there was metadata that was important for governance and provenance. How is that brought forward in a way that it fits in with whatever tools are going to be, the customers going to use in Hadoop? One of the value propositions that Syncsort brings is really making that metadata mapping simple. So if organizations are accessing, let's say, VSAM data or DB2 on the mainframe and Oracle tables, we do the mapping of schemas from these data sources automatically to the schemas on Hive or any other data target that's on the Hadoop side. So we simplify that. The user basically has to just point to a data source and regardless of that data source being Kafka, Q versus mainframe, we are really operating on a single data pipeline and making that all access to all data a single step. But it would be another step beyond that to include data about who touched what and how things were transformed? Yes, because we have the knowledge where that data is originating from, cross-platform knowledge, whether it's from mainframes, whether it's open systems, we can basically expose that metadata to the Hadoop-based repositories and like Atlas, for example, and make it available for our joint customers. Okay, so those tags can be passed along and they can be integrated through any tool. Certainly, since we've announced our partnership, we've also made some really big enhancements to Apache Atlas from a governance perspective. And so all these pieces start to fit together nicely in a seamless end-to-end kind of process flow. And I think it's also really important to keep in mind kind of the product advantage that the SyncSorto brings in being able to speak natively to those legacy systems. That's a very complicated problem to go solve for and creating optimizations around pipelining that data very efficiently is something that kind of comes in the box with the product. So our customers don't have to think about that thing with the appropriate credentials, the appropriate IP addresses they can connect up and just start moving stuff around. So we're hearing so much noise in what was the ETL space as we move. Well, let me ask you, tell me the design patterns we were moving from with the operational databases, data warehouses and what the design pattern looks like. And if the design pattern is the wrong word, you know, put the right one in. I certainly, Scott will be the expert here coming from TerraData speaking on that. What we are seeing is that at some point, really how the data warehouse architecture evolved. ETL data integration products just became very expensive schedulers. That was really one of the business cases that Hadoop became so attractive to these large enterprises trying to leverage their data assets and trying to take advantage of data and make it really available for advanced analytics. ETL, ELT workloads became like 40, 60, 80% of the data warehouse architecture which became not scalable and very costly. So we started seeing that data integration vendors were really expensive schedulers just pushing everything to the data warehouse and doing ELT. Now, Singsort has been always in the ETL place because we transformed the data in our engine. We thought that Hadoop was very complimentary for that orchestration because we didn't have a business to cannibalize actually, to kill in that very large data integration space. We just took that lightweight dynamic optimization and integrated the engine through our open source contributions as well to the Hadoop ecosystem. And that has several advantages. One, ETL is really running in Hadoop in our engine. So everything around the technology stack and interoperability comes organically. So when we have range of security, certification is out of the box. When it is on parry, deployment with on parry is out of the box. So ELT, our ETL engine is really now ETL onboarding on Hadoop for us. But I will let Scott speak to that. Let me ask you to step beyond that and tell us about how people manage this process from OLTP databases to data warehouses and what that pattern is looking like now. Well, I think there actually are multiple patterns and the problem has grown kind of beyond beyond just traditional ETL, which is one piece of what's required to do data movement inside of an enterprise. So, but it's an important piece, right? It's important piece because if you think about transaction-based databases, ERP systems, other things that are running the business operationally, can't go down. A lot of work went into standardization of the process, understanding the schema, understanding the metadata and the business rules around the data. So there's no reason to regress from that. You want to maintain and capture that and pull it forward into any other architecture that you're moving the data. And that's certainly where this tool fits very nicely is we talked about kind of the product feature and the capabilities of doing that. But I think overall, ETL is a very small portion of the larger problem that's hitting all of us now in all the data all the time, where we've got streaming data, devices, IoT data, web-related data. Some of that data is not ever going to go through an ETL process, right? Because an ETL process also defines, generally speaking, the data are structured, the data are known, the data are owned by a company they're going to be used for something managing a system, right? Device data isn't happening inside the firewall and it's not necessarily owned by a company, but it's important to capture that data and take advantage of the analytic content. So that's a completely different process than ETL, but it's data movement. So being able to put together, just like we talk about the Hadoop ecosystem, the community and the ecosystem of partners, I think in this space partnerships and an ecosystem of technologies to go address that problem is what you'll see, this partnership represents a really important and critical part of that operational aspect of what's going to happen in data movement. So let me ask you to put in NIFI or Onyara, and I'm not sure the naming the... Hortonworks data flow. Yeah, Hortonworks data flow, that was it. Rolls off the tongue. How much does that, how many of the other pieces of the puzzle does that provide? That covers a really big piece of the puzzle where you can think of Hortonworks data flow and Apache based on Apache NIFI as kind of the traffic cop, managing all of the different data flows into which you might plug some Kafka, into which you might plug some ETL tools and understanding at a macro level how data are flowing inside of an enterprise in addition to kind of being the traffic cop and managing and helping understand all of those mega flows and how they're going. It can also be used to push processing to the edge for simple event processing. It can be used for prioritization of messages over a fixed bandwidth. So there are a lot of other things at NIFI addresses, but again, it's part of an ecosystem of solutions where the appropriate technologies need to fit together in a robust ecosystem. So it sounds like, maybe I'm borrowing too liberally, but it sounds like the orchestrator and then there are going to be several execution engines. You know, maybe Kafka is one. I'm having early onset Alzheimer's. Think so, DMX. Yes, think so. Which is running in actually leveraging the compute frameworks in the ecosystem. But then would it be fair to say that Hortonworks data flow is the sort of traffic cop above all these? And there are orchestration points that you can pick and choose based on the requirement, yes. And in addition, obviously as part of, we announced our partnership in April in Dublin. We've hit the ground running. We've got a lot of interest from our joint customers. We have a lot of customers together already. I think who are really happy to hear about the relationship, but we now have the opportunity to work even more closely together doing integration for data governance, integration for security, integration of the orchestration layer and some of those other key pieces of technology that will fit together in that data fabric. So even managing the governance across these. Yeah, governance, provenance, some of the big issues that are out there for us as a community to go solve, we can start working on these things together. Does that would explain why at the Teradata Influencer summit a couple of weeks ago, they kept saying Hardenworks data flow is to, I won't say any vendor names of the previous generation. It's like the new, sort of the new way of doing things. And it is a new way of doing things in that it was built specifically for the use case that we're describing where we're going to have data streaming bi-directionally, point-to-point peer-to-peer in an IoT environment and managing that problem into which you can plug a lot of other data movement problems as well. Okay, with that, Tendu and Scott, we will be back, this is George Gilbert where at Hardenworks Hadoop Summit in San Jose we'll be right back in a few minutes. Thanks.