 Live from the San Jose Convention Center, extracting the signal from the noise, it's theCUBE, covering Hadoop Summit 2015. Brought to you by headline sponsor, Hortonworks, and by EMC, Pivotal, IBM, Pentaho, Teradata, Syncsort, and by Atunity. Now your hosts, John Furrier and George Gilbert. Okay, welcome back everyone. We are here live in Silicon Valley for Hadoop Summit 2015. This is theCUBE, our flagship program. We go out to the events and extract the signal noise. Winding down day one, we've got a lot of great wall-to-wall covers, three days of coverage here. I'm John Furrier with my co-host, George Gilbert, wikibond.com, our new big data analyst, our next guest, Josh Rogers, president of Syncsort. Welcome back, and Armando Acosta, senior product consultant at Dell. Guys, welcome to theCUBE. Thank you for having us. Thanks for having us. Great innovation with Dell and Syncsort, because it really hits the nerve of what's going on in the market and this industry right now, and that is mainstreaming of Hadoop. You're starting to see this now come together where, hey, it's not just a science project anymore that was three years ago. It's not just bolting in some new technologies. It's up and running. It's in the mainstream conversation. So the conversations are different. It's like, not about MapReduce and Spark, it's, I have an investment, I have technology, I have business problems to solve. So we're back to the business outcomes in this transformation. So what is that conversation between you guys at the product level and related to outcomes? Yeah, yeah. So I think, I mean, Armando, if you could lead off with how Dell got to, you know, into the Hadoop business, I think it's a really interesting kind of evolution of how that conversation has turned from technology to use cases. Maybe we can start off. So just give you a little context and a little history of where we started out with Hadoop. We're actually the first tier one vendor to release a reference architecture on Hadoop. We did that back in 2011, but that's really not the first starting point where we got back in 2007, 2008, within Dell, we created a group called the Data Center Solutions Group. And that group was tasked to go after the top 20 data centers of the world. So I'm talking about the Google, Facebook, y'all who's of the world. And so as we started to go after these, you know, top 20 data centers of the world, one workload kept popping up more than others. And it was, hey, we want 5,000 of these. We want 10,000 of these. Okay, what workload are you running? And everything, it's a duke, it's a duke, it's a duke. So our light bulb turned off for us and said, hey, instead of just selling the server, let's actually build a solution around this and let's sell essentially not only the server, but the Hadoop distribution and then wrap services around it and let's actually build a full end-to-end solution with this. And so that's what we did in 2011, but now where you fast forward today, where we're at with SyncSort, it's what our customers have told us is, okay, I've already kicked the tires, I've already done my proof of concept, but now I want to make it real for my organization. So instead of Dell, just give me a general reference architecture. I want you to actually give me a reference architecture that's defined on a use case and that way you show me, actually, if I'm going to do Hadoop, here's the first step I take and oh, by the way, here's how you do it in the simplest format possible and here's how you do it with a Dell SyncSort cloud error solution so that you don't have to do that hard thing of designing, architecting it. What configuration should I go with? What are my servers look like? What are my data nodes look like? What are my infrastructure nodes look like? How do you do all that test development? Because Dell and SyncSort and cloud error have done all that hard work for you and now you bring it to your solution and now it's a plug and play and you don't have to do that hard work. Yeah, a lot of people, I mean, I'll just clarify because you guys deserve props on this Dell, is they call it industry standard servers or commodity if you're in the cloud area ecosystem in the early days. But commodity, Amazon makes their own stuff, right? That's commodity in my mind. But industry standard servers, low cost, high performance machines is what starters we're buying, right? So like the early versus a Hadoop and H-Base you really couldn't run in the cloud. Uncloudable is that word. But you had to run them rack and stack some drives and some servers. So you guys were already winning at the start. So you guys were early in this. So just to give you some props on that. Thank you. And we were talking about earlier how the early adopters are playing with hard bare metal basically in this case servers. So props to Dell. Going to the next level is now when you kind of graduate the mainstream, we have a mainframe and we have all this other stuff. Now you're in the Dell services world. Is that kind of how it played out? I mean, because Dell's got great market share on the service, right? So like that's how you can see it in there. When you went to that next level of conversation with Syncsort, was that because of the mainframe or because of other legacy stuff? What would take us through that piece? I think it was mostly actually about the use case, right? When you know what we've done is we've done a lot of validation, a lot of market research of what's that first step you take with Hadoop, what's that first step you take in your big data journey. And we've actually have it validated by both CloudEra, Hortonworks, and MapR. And their number one use case today is ETL offload. So it's not like we had this grand idea, you know. It's pretty known. You know, the writing was on the wall, but what we did was we were the first to execute it, right? And where we come with Dell is we've been doing this since 2011. So we have the validation, certification process, the engineering process down like the back of our hand. Now the beautiful thing about this is you bring a product in like Syncsort and now you bring that full end-in solution because people can talk about ETL offload, but in the past it's, here's this general reference architecture. Now go buy your ETL tool. Now, Mr. Customer, you have to make that work together. Now you have to do the hard work of integrating. You have to do the hard work of testing. What we're saying is no, let Dell and Syncsort and CloudEra do that work for you and you can focus on your data and you can actually focus on your data transformation job so that you can actually get value out of that data instead of having to start from square one. We solve that for you. Let's explore that for a minute because like Syncsort's heritage was on the mainstream and there's a ton of legacy data there that was trying to, needed to be liberated, but you went so much further and with Dell, essentially you're taking what sort of would have been, what would have been sort of constrained perhaps in the legacy ETL tools and then you've now got, as you can probably better explain it, a rather future proof architecture where you're independent of the execution engine underneath and you can bring the data and start from there and let's drill in. Yeah, it's probably helpful to kind of step up to what is our philosophy? And our philosophy when we came into the big data market, the Hadoop market was, it seemed pretty clear from an architectural perspective that Hadoop was a thing that needed to exist, that it was clearly going to be the foundational component of information management going forward. And if you start with that level of conviction then that leads to a series of choices. The first choice is okay, if Hadoop is going to represent that big data operating system, then what I'm going to want is I'm going to want applications that run on top of that operating system that allow me to deliver the functionality that I would use. And if you look at what is the most common application that people were writing using the projects available in Hadoop, it was ETL. They were writing Hive jobs or writing Pig jobs, writing these by hand, specialized skills, et cetera, which there was a few different problems with that. One is I don't have that expertise. I have a bunch of people that know how to build ETL jobs but I don't have anybody that knows how to build Hive and Pig. Another issue was there's a constant tuning and tweaking that happens to that code as your data elements change, you bring new data into the system. And so it was obvious to us very early on that a bespoke, tightly integrated ETL application was going to be an important component for Hadoop to be able to suck up both data and workload in the future and kind of live up to its potential. So we set out to build that application and we took the approach that says we're going to take advantage of wherever possible the services that the Hadoop operating system provides. So if they provide metadata management, metadata management capabilities, we're going to leverage that. If they provide- To be specific on that it's the, so what data is there, where is it, and you know- Everything that you can do in H-Catalog and Hive Metastore. And by the way, a few things that you couldn't quite do in the beginning. So that brings me to the second point of the philosophy which is where those services aren't yet mature enough for us to be able to kind of have the application behave the way the user would want, we're going to contribute to the operating system. That's the beauty. That's an open source platform. And so we've made a string of contributions to projects like MapReduce, to SCOOP, to H-Catalog. We've got a backlog for things like Uzi that are adding additional functionality APIs, parallelization to those projects that will help us better support ETL workloads in our application. And what you get is you get an application that leverages your investment. It doesn't require you to stand up to another infrastructure that's kind of nearby Hadoop. It's pure. Like legacy ETL. Exactly. The other thing is it gives you the best performance and manageability. So if I can surface, if my jobs run like they're MR jobs and I can surface all of the log data and Bari or Cloudera manager or the management utilities, then it gives me best in class manageability. Where that becomes really important, that performance and scale and manageability is when I'm actually putting all that processing power to work for mission critical workloads. Offload is that use case, right? I've got existing business logic that gets executed on a routine basis that business users count on. And if I'm going to move that out of my data warehouse or out of my ETL, my legacy ETL environment and running in Hadoop, I need to make sure it's going to work. I need to make sure it's going to scale. I need to make sure I can count on it. And we believe that taking a native, tightly integrated approach to the architecture of that ETL application is absolutely critical. As we've learned more about the use cases and as the patchy Hadoop has evolved, we've learned that customers are confronted with a lot of change, right? There's a lot of new functionality that's available, but there's also a lot of change that incorporates into their existing applications. And what we've done is developed this intelligent execution capability that allows people to design once and then run that job in Hadoop. And our software will actually make a set of decisions for which engine should it use. Should it use MR1 or MR2 or in the future Spark? And we believe that allows you to seamlessly leverage your existing ETL development skills but then be able to not just run that workload today, but to run it without disruption in the future. And if I'm making that bet that I'm going to offload this mission critical workload into my Hadoop cluster, I need that future proofing capability. So just to turn that into sort of dollars and cents, are there some customers that were saying, we were bumping up against the capacity of our data warehouse and we were going to have to cut a check for $35,000 a terabyte, whatever, to add capacity. But now if we took, this workload was consuming say 35% of our capacity on the data warehouse and we moved it, any examples like that? Yeah, sure. I mean, one of our early customers who's been in production for a year and a half at this point, you know, their internal estimates said that it was close to $100,000 a terabyte to manage data in their warehouse. That their warehouse, the ELT in their warehouse was consuming about 42% of the capacity. They estimate that it's about $1,000 a terabyte to manage that data in Hadoop. And they had service level issues in users because that ELT workload was crowding out the day job of ad hoc queries and reporting workloads. So for them, it was, you know, I can dramatically lower my cost structure by deferring these upgrades. I can get a whole lot more flexibility in terms of how I leverage the data if I have it in Hadoop. I can learn all the skills that I need to manage this environment, not just for storage purposes, but actually processing, running applications that process data. Well, dig into that a little bit. When you say this, the skills, you mean because it generates MapReduce or Spark or do you mean because you're at a higher level and you don't have to drop into the... That's a statement on both the development and operations, right? So, you know, if I'm going to, you know, participate in Hadoop, I need to have a tooling that allows my existing skills to be productive in Hadoop, which is the approach we've taken, or I need to take, you know, an approach to either train my existing people or get new people that know how to write, you know, low-level code. We've taken the approach that says, you have existing ETL developers. They know your data. You should make them productive in this new environment that's better suited from a cost structure and performance perspective to be able to productive in that environment. That, maybe that process flow, you know, like where you can reverse engineer. Oh, sure. So, that speaks a little bit more to the migration component. So, to execute that type of project, you have to take the sequel that has been written over the past couple of decades. It's running in your data warehouse. It's not documented. That is not simple. The equivalent of your old cobalt. Absolutely. And you've got to understand it. You've got to figure out how to replicate that business logic as a MapReduce statement. That's a big project. And by the way, what we've seen is, if you're going to do that, you're going to have to build a bunch of user-defined functions in Java to kind of fill in the gaps that, you know, Hadoop doesn't have in terms of bankers rounding or whatever the specific feature is. You know, using our ETL application that runs natively on top of Hadoop, you can move that and run that confidently. We have a tool called SILK that actually will scan that sequel. It will visualize it for you, show you in a graphical way what, you know, we're grabbing this table, this table, we're doing a join, and then we're doing an aggregation. And we will actually generate a DMXH job that you can run on Hadoop. And that's a lot of bugs on MapReduce or Spark. And once it's a DMXH job, it can run on whatever engine you think is most suitable. Today, it's MapReduce, MR1 or MR2. We're architecting and working closely with the community on Spark support. Okay. So have you seen customers who tried to do this data warehouse offload without, you know, your help? And then what sort of walls did they, you know, slam into? It was kind of what I was describing before. It's like, oh, I have to now parse through manually a thousand line sequel script to understand what it's doing. I've got to figure out how to construct that as a MapReduce statement. I've got to fill in the gaps with user defined functions. And by the time I get it all running, it's not all that performing. And so what we've done is to streamline that entire process. Okay, before we go, because we only have a couple of minutes, you would talk in the beginning about the participation in the community where you want to add to the standards rather than keep some secret sauce, you know, that limits interoperability. So your tool, if I'm understanding correctly, takes you to the point where it'll generate the job that takes the ELT and puts it into Hadoop where it's, you know, from the data warehouse. But there are other tools that take the data further in its life cycle. How do you make that easy where traditionally there's value in end-to-end integration, versus your depth and specialization? How do you get the best of both? Yeah, well, so, I mean, we've built a whole series of connectivity options for customers so that they can take data into our tool from a variety of sources. And that could be, obviously, sources outside of Hadoop, such as, you know, mainframe or relational databases, but also sources inside of Hadoop. So if you have data stored in JSON, you have data stored in Avro Parquet, you know, we can ingest those sources and we can do transformations. And then we can write out a whole variety of sources. So we can write out. But I'm also thinking downstream after you've got the data transformed and, you know, normalized or denormalized. Right. So we can publish those datasets into consuming, you know, repositories, whether that be within Hadoop, like Impala or HBase, or outside of Hadoop, like your warehouse or Mongo or any of those other kind of consuming platforms. So it's, you know, we've done, I think, a really good job of giving people all of the capabilities they need to be able to execute real ETL workload on Hadoop at scale in a manageable way, as well as push that data downstream to the various components, whether that's in or off Hadoop. And the downstream could be like, okay, I'm going to work on it with Impala or Hive. And then from there, I'm actually going to add micro strategy or Tableau. For example, you can write out a Tableau data file, a TD, right out of our tool. So I could grab data, you know, right off base DFS and I can write a TDE and someone can go directly against Hadoop and visualize data in Hadoop. I can do the same thing with CLIC. So we're CLIC, with CLIC View. So we're continually adding those capabilities as we understand, you know, customer demand. Okay, that's a strong story. John, what final question is, George, good job, you've nominated the product conversation. Well, as an analyst, I want to get the perspective. As the market connects, what does it mean to the average user out there? What's the pain point? When, you know, ETL offload, I get that, but sync sort of Dell relationship beyond that. When does the use case kick in? If there's an indicator for folks watching, watching now, hey, I need to get involved, make a phone call. What is this? It sounds like you've now, you've, there were scenes or cracks in the process and you've closed them. So should ETL offload and, you know, enablement of sort of business intelligence on Hadoop now take off? Yeah, I think I mean, I certainly think it's going to take off. I think you're seeing that, you know, at the stats from this show in terms of attendees, the quality of the attendees, the number of attendees that are from, you know, large enterprises, certainly it is, you know, this joint partnership gives people for the first time ever end to end soup to nuts capability to deliver against the offload use case. But look, I still think that people need to realize, it's early, you know, this is an industry that is going to not take shape over the next three years, but over the next 15 years. And if you look at the pace of innovation that's happened over the last 24 months, it's kind of staggering to think about the possibilities and we're really excited to be a part of it. And really excited to be working with folks like Dell. Okay, Syncsort and Dell, great partnership announced here at Hadoop Summit 2015 to theCUBE. We'll be right back with more action live here in Silicon Valley as they break down day one, bringing the party in here, all the exhibitors, a lot of action, growing ecosystem, as Rob Bearden said, the inflection point is coming. It's theCUBE, we'll bring it to you right back after this short break.