 Live from New York, it's theCUBE. Cover Big Data NYC 2015. Brought to you by Hortonworks, IBM, EMC, and Pivotal. Now your host, Dave Vellante and George Gilbert. Welcome back to New York City, everybody. This is theCUBE. We're here at Strada, Hadoop World, Big Data NYC. Tendu Yogurtju is here. She's the general manager of Big Data at SyncSort. Tendu, good to see you again. Welcome to theCUBE. Hi, Dave. Hi, George. Thank you for having me. You're welcome. So SyncSort, you guys are at the heart of this data transformation, this data tsunami that's going on. Give us the update on SyncSort. What's going on at Strada and Hadoop World? Yes, of course. This is actually an exciting time for us in a third announcement today. We came out around the data processing. Today we announced our integration with Apache Spark and Kafka, top active projects in the open source, as you know. And all of our innovation is really driven by the business use cases. In 2015, our main focus was around offloading expensive workloads from the legacy data warehouse, legacy ETL tools, and mainframe to Hadoop, and our announcement with Dell, Cloudera, Intel, the partnership we announced also is around that. However, we see increasing demand for real-time and streaming use cases, fraud detection in finance, healthcare, digital devices, internet of things, telemetry data collection. All of these use cases are driving streaming and real-time workloads. And organizations are really interested in having a single software environment where they can utilize for batch and real-time workloads, taking advantage of the powerful compute frameworks like Apache Spark or Resilient Messaging Framework as Kafka. The timing is perfect, George, right? Yes, this is a team of ours, which is the move from batch to real-time, but also the presence of two pipelines. One to improve the decision-making process, and then one to execute those decisions really fast. We'll bump up a little for our viewers, who may not be as familiar with DMXH, and tell us what the tool does, and then let's drive, you know, then let's dive into how we went from batch to real-time. Of course. Basically, our data integration product, DMXH, is a data integration product. It allows you to define your data transaction, data processing transformation pipeline. And now, with our dynamic optimization and intelligent execution, you can take that data pipeline, and you can run it whether it's in MapReduce, in Yarn, or in Spark, and on-premise, or in Cloud, or on your Windows workstation. So this is really a powerful story for the user, because you are defining a single data pipeline, single data processing job, and you are basically, the product is running in real-time if the data source is real-time, like Kafka streaming, or in batch if the data is coming from HDFS, for example. And the author of this pipeline, the one who's designing it the first time, what are they working with? Are they working down in code? Are they working with a drag-and-drop, gooey environment? How does it look? They are working with a drag-and-drop graphical user interface. They are just basically defining their data sources, whether it's a database's mainframe or Kafka streaming, the same way, and they can basically drive the metadata, sample their data very easily, and define their data transformation. So it's really in terms of skill set, it simplifies, and it reduces the cost for operational expanses and maintenance for organizations. You don't have to go and hire a person who is specialized in Scala, Python, or have to understand what's the difference between Spark versus MapReduce. And just to be clear, so you've got this development environment that's a gooey development environment for your pipeline. Yes. You're doing the sourcing of the data that a data engineer might do. You're doing the blending and enrichment that maybe a data scientist might do. Data scientist. And to find what's relevant. But you're not focusing on the analytics. You drop it off where the analytics tools take on the next part of the job. Is that correct? That's correct. So basically we make the data available, whether it's the data lake, the HDFS, or any other optimized storage the user may choose. And we integrate that data, blend it, enrich it, and prepare it for analytics or visualization tools like click or Tableau. And Spark, the way that Spark has the promise for real time at the patch, is also driving our interest going into Spark. And one differentiation we have from some of the legacy ETL vendors, our engine is running natively as part of the data flows. That's due to our continuous open source contributions. We have contributed to MapReduce in yarn significantly. So our engine can be actually a native yarn engine running as part of the MapReduce flow. Likewise, we started contributing to Spark. A couple of weeks ago we contributed the mainframe connector for Spark packages. So just to be clear, you're not just taking the pipeline that's defined in a graphical environment and then spitting out Spark code or MapReduce code. There's actually functions within those that are contributions you make to the open source community to make those run better when defined in your tool. That's correct. So instead of generating code that we may not have one-to-one correspondence, we are actually contributing to the open source and using the open source APIs and having an intelligent execution that decides how to run a job in MapReduce fashion or in Spark and that happens at execution, at runtime. I wonder if, I want to ask you about how you choose sort of where to focus your development dollars because everybody's development dollars are limited. We just did a survey. We got the results Friday, George and I were looking at it and then one of the questions we asked is, we asked them about the workloads that they're using for the big data analytics. We asked them which ones involve real time or near real time capabilities. Fraud detection came up, workflow optimization, IT equipment operation support, good news for Splunk, data transformation. And then we asked them, okay, what analytic tools are you using for real time? Now Kafka came up, but Kafka's relatively new. Especially sort of the guy who had Confluent spun out. So that was like late last year. Data torrent came up big, but then of course Spark streaming and Storm. How do you choose where to put your bets? So it's really driven by the adoption and what we are seeing is the hard problems and the challenges our customers have. So there are three types of investments we focus on. One, organic innovations. And in organic innovations, we try to have something that's going to decouple the compute framework and reduce complexity, make these very rapidly evolving big data technology stack easily adapted because that's the challenge. That's the challenge for the enterprise to be able to find people who understands all of the new projects and make decisions. So that's when we go after organic innovations, we basically say, okay, at the time, half of our customers are in production, which is great. I'm really excited to be able to say that. And 80% of those production workloads are 75, 80, is still offloading expansive workloads from data warehouse. So that's our bread and butter. And operational efficiency use cases are bread and butter for many of the Hadoop vendors as well. And however, we have to keep an eye on the strategic direction. What is going to happen in 2016 and 2017 with the internet of use cases and internet of things use cases streaming applications. A lot of financial services and healthcare and telecommunications companies are our customers. And they are interested in getting real time insights. Pro detection is big. Digital devices in healthcare are big use cases. So they have this question. The way that I made a choice for a tool for MapReduce and Yarn, I do not want to have developers developing in Scala. I want to have something that's easy to be easily adapted. So that kind of challenges that our customers are having is driving one type of innovation. And the second one is open source contributions. We always continue. So we can have a differentiated value that's complementary to the Hadoop vendors as well as big data vendors like Databricks. So the stack is accelerated. And the third one is inorganic acquisitions basically. Okay, so any of those you want to tell us about? No, thank you. We are actively always engaged and looking. Have you, as it relates to offload, you talked about that earlier, have you seen that movement attenuate at all or are you seeing it accelerate? What's the status? I mean, everybody, you know, last year at this time it was the big sucking sound, I call it. Take data off the traditional data warehouse. Are customers comfortable with that? Have they gone too far? Are they slowing that down or is it accelerating? I think it's still accelerating. That's still a very valid use case because one third of the analytics happens in the data warehouse. It's really interactive real-time analytics. And two thirds of the data warehouse is almost used for batch type of workloads. And those batch workloads are very good candidates for scalable big data architectures. And pretty much everybody's open minded, adopting new technologies and open source is well received. So that's, we don't see that slowing down yet. I think that's very much complimented with the customer 360 kind of analysis. And they start at this point, they start thinking about what are the transformative applications that I should be investing, not just offloading workloads and those transformative applications are driving the innovation. And what are some of those transformative workloads and what are the kinks the customers have to work through in terms of skills, technical maturity, data source access? The challenges are around bringing all of the different types of data together because there are tools and stack of tools that you can use for one type of data, scoop for database, flume for streaming, Kafka for streaming. So there are many tools, however people are really trying to find a single interface that can simplify that for them. And the technology is storm Kafka for real time. And as you said that Kafka is still going to mature. There's a lot of interest because of high resilient nature and distributed nature of Kafka. And Spark has the promise of having the single platform with interactive batch and real time computer frameworks. These are all going to mature and we will see how the adoption is going to happen in the enterprise. You mentioned the open source connector to the mainframe before. Yeah, people don't like to talk about mainframe but mainframe is interesting to us because of its ability to bring traditional LTP and analytic workloads together. We were at an announcement in January at Jazz at Lincoln Center, IBM announced the new Z mainframe and it was interesting talking to the customers there. Not that once you get through the IBM messaging which is strong but you talk to the customers of what they're doing in terms of actually bringing those together. We feel like it's a harbinger of what the open community is going to do but I wonder if you could talk about that mainframe business and the parallels to the open business that it seems to give you visibility on what's going to occur sometimes. Well, I wonder if you could comment on that. Yeah, absolutely. The mainframe is a big interest for us also as you know. We have a very successful mainframe business. So mainframe is very important and the mainframe is not going to go anywhere soon. It's very well suited for transactional type of workloads. So while mainframe is very suited for transactional type of workloads, Hadoop Spark, the new big data stack is very suited for batch type of workloads. However, our mainframe open source connector is basically making Spark interactive, Spark SQL queries available for mainframe access. So you have the mainframe data available for interactive querying in SparkQL. That is a big advantage because now you can actually have some of the more affordable real-time interactive queries against your legacy data and against your transactional data. So that's critical. We also, in the Splunk conference, we announced collecting network and security data and making the mainframe logs available for Splunk analytics. So we will continue to have that bridge between the mainframe and the open source and big data technologies because we are in a very good position as a company. So I wonder if you could also comment. So something else that came up in the survey related to Spark. It was significant percent we're using Spark today, almost half, and everybody was, of course, evaluating Spark, you're asleep if you're not evaluating Spark, but a very large proportion. I mean, the majority of people said that they're going to plan or are actually substituting Spark for new workloads that would have gone to Hadoop. Now, there seems to be a debate around that, right? The Hadoop guys go, oh, no, no, it's all complemented. The Spark guys, they're just like, oh, Hadoop is dead. What do you see? I mean, you really, I mean, you're kind of agnostic to that whole argument, right? You want to go where the business is and where the customers want you to go. So what do you actually see as a, you know, technologist, you know, somebody who's quasi-independent in that debate? In terms of the, Hadoop has Spark adoption. That's what they are seeing because if we look at the last four or five years, it has been such an exciting time for Hadoop and the disruption around Hadoop. And pretty much every single organization is looking at revamping their enterprise data warehouse architecture and redefining it based on Hadoop in the center. And it took a while for Hadoop to be matured. And a lot of vendors had to come into the picture, a lot of applications had to come into the picture building the ecosystem. Now, Hadoop just is maturing as a platform and Spark is appearing. Spark running on Hadoop is just an easy way for organizations to adapt capabilities of Spark because they just mature their platform. For more online businesses or more evangelist type of businesses, they can directly go and jump into Spark VanVegen. And that's fine. However, for enterprise, we see Hadoop actually helping Spark acceleration. So that makes sense for the guys who have the ability to digest and manage all the Hadoop tool complexity. But of course others have said to us, yeah, well, Spark is not without its complexities as well. Maybe it's simpler than Hadoop, but you still got to know specific programming languages like Python and it's maybe simplified. What are you seeing there? Is it, do you agree with that premise that they're both somewhat heavy lifting or you see Spark as simplified? They both have different challenges. In the Hadoop stack, the initial challenge was really maturing the platform, securing the platform. Now, this year is big in terms of having data governance, for example, right? People started talking about data governance. With Spark, maturing the Spark as a platform is going to be the next step. So that will be the challenge for Spark. However, Spark is very good tool for just exploration and interactive queries, getting something quickly done by data scientists. So those use cases are very suitable. And you don't have to kind of worry about some of the things, the enterprise data warehouse architectures about governance and security in those, some of the use cases. And that adoption can be happening from data scientists team. They have different challenges at this measured cycle. But you see them as complementary, is what I'm hearing. At this point, I see them complementary. Of course, the Databricks survey, George, I think they threw out the number 46% of the... 48? 48% said, okay. It's splitting here, who's counting? So 48% said they were doing Spark independent of Hadoop. Now that could have been pilots, it could have been tire kicking, and so that could be a marketing number. But, Tendu, I'm inferring from your comments that you would expect those two worlds to coexist for quite some time. Yes, that's what I expected. Would it be fair to say Hadoop has a lot of moving parts. The pro is that there's incredible innovation all the parts are sort of evolving independently of each other, not held back. So, but the downside of that is there's a fair amount of complexity on development and operations. Whereas Spark, it's one sort of framework and then increasingly integrated set of APIs. So from a development perspective, it seems to be simplifying, especially as we get these notebook tools in front of it, and operationally, if someone's going to run it as a service for you, it would seem to remove some of the operational complexity. That's correct, yes. And that's, what we're wondering about is, like running it as a service may be sort of the tire kickers right now, and that Hadoop has a lot more work to do to get to that level of development and operational simplicity. Databricks is doing a great job, basically with that offering as I hosted the environment and simplifying the operational part. They are definitely doing a great job. I think in terms of combining different types of workloads, having also the Spark running a batch in addition to real-time type of workloads, all of those have to go through the enterprise maturity cycle. We see a lot of pilot projects, a lot of proof of concepts happening with Spark, and which is, I think the Databricks survey announced 600 contributors, which is great, and we will be supporting the project, and we will definitely have bringing diverse data sets for processing in Spark. I think Hadoop, at this point, because where it is in the maturity cycle and adoption in the enterprise, the vendors, the Hadoop vendors like Cloudera, Hortonworks, and Mapar are really focusing ready for production. They are in production already, so that's really, and our announcement with Dell, Intel, and Cloudera last week is also about that. Having appliance where SyncSort, Cloudera, and Dell jointly offers a simple appliance for offloading expensive workloads, augmenting the data warehouse, for example. There was SyncSort, Cloudera, and Dell? Yes, Dell, Cloudera, Intel, and SyncSort. Okay, Tendu, we're out of time, but last question, sort of the way forward for SyncSort, what's ahead for you guys? Yes, be ready to hear from us. A lot of innovation is coming end of this year and early in 2016. We will continue to focus on bridging the gap between mainframe and open source, and we will continue to innovate about making life easier for organizations and reducing their operational and business costs. All right, Tendu, you'll go at you. Thank you very much for coming on theCUBE and sharing SyncSort's vision, your vision. Awesome, really appreciate it. Thank you. All right, keep right there, we'll be back with our next guest. This is theCUBE, we're live from Big Data NYC at Strata and Hadoop World, right back.