 Live from Midtown Manhattan, it's theCUBE, covering Big Data, New York City, 2017. Brought to you by SiliconANGLE Media and its ecosystem sponsors. Hello everyone, welcome back to theCUBE's special Big Data NYC coverage of theCUBE here in Manhattan in New York City, we're in Hell's Kitchen. I'm John Furrier with my co-host Jim Kobilis who's with Wikibon Analyst for Big Data. In conjunction with Strata Data going on right around the corner, this is our annual event where we break down the big data, the AI, the cloud, all the goodness of what's going on in Big Data. Our next guest I tend to do, Yigert Tu, who's the Chief Technology Officer at SyncSort. Great to see you again, CUBE alumni, been on multiple times. Always great to have you on, get the perspective, the CTO perspective and the SyncSort update, so good to see you. Good seeing you, John and Jim. It's a pleasure being here too. Again, the pulse of Big Data is in New York and it's a great week with a lot of happening. I always borrow the quote from Pat Gelsinger's, the CEO of VMware, we said on theCUBE in 2011, before he joined VMware as CEO at EMC, he said, if you're not out in front of that next wave, you're driftwood. And the key to being successful is to ride the waves and the big waves are coming in now with AI. Certainly Big Data has been rising tide floats all, but now that the aperture, the scale of data is larger, SyncSort has been riding the wave with us, we've been having you guys on multiple times. And you know what's important to the main frame, very nearly days. But now SyncSort just keeps on adding more and more capabilities and you're riding the wave, the big wave, the big data wave. What's the update now with you guys? Where are you guys now in context today's emerging data landscape? Absolutely, as organizations progress with their modern data architectures and building the next generation analytics platforms, leveraging machine learning, leveraging cloud elasticity, we have observed that the data quality and data governance have become more critical than ever. Couple of years, we have been seeing this trend. I would like to create the data lake data as a service and enable bigger insights from the data. And this year really every enterprise is trying to have that trusted data set created. Because data lakes are turning into data swamps that Dave Vellante refers over. And the collection of this diverse data sets, whether it's main frame, whether it's messaging queues, whether it's relational data warehouse environments is challenging the customers. And we can take one simple use case like customer 360, which we have been talking for decades now, right? Yet still it's a complex problem. Everybody is trying to get that trusted single view of their customers so that they can serve the customer needs in a better way, offer better solutions and products to the customers, get better insights about the customer behaviors, whether leveraging deep learning, machine learning, et cetera. However, in order to do that, the data has to be in a clean, trusted, valid format. And every business is going global. You have data sets coming from Asia, from Europe, from Latin America, and many different places in different formats. And it's becoming challenge. We acquired Trillium Software in December 2016. And our vision was really to bring that world leader enterprise great data quality into the big data environments. So last week we announced our Trillium Quality for Big Data product. This product brings unmatched capabilities of data validation, cleansing, enrichment, and matching, fuzzy matching to the data lake. We are also leveraging our intelligent execution engine that we developed for data integration product, the MXH. So we are enabling the organizations to take this data quality offering, whether it's in Hadoop MapReduce or Apache Spark, whichever compute framework it's going to be in the future. So we are very excited about that. And congratulations. You mentioned the data lake being a swamp that Dave Vellante referred to. It's interesting because how does it become a swamp if it's a silo, right? We're seeing data silos being antithesis to governance and challenges, certainly IoT. Then you got the complication of geopolitical borders. You mentioned that earlier. So you still got to integrate the data. You need data quality, which has been around for a while, but now it's more complex. What's specifically about the cleansing and the quality of the data that's more important now in the landscape now? Is it those factors? Are that the drivers of the challenges there? And what's the opportunity for customers? How do they figure this out? Complexity is because of many different factors, some of it from being global. Every business is trying to have global presence and the data is originating from web, from mobile, from many different data sets. And if we just take a simple address, these address formats are different in every single country. Trillium quality for big data, we support over 150 postal data from different countries and data enrichment with this data. So it becomes really complex because you have to deal with different types of data from different countries. And the matching also becomes very difficult. Whether it's John Freer, Jay Freer, John Courier, you have to be open. All my handles on Twitter, they know me, that's about. All our handles you have. Every business is trying to have a better targeting in terms of offering products and understanding the single and one and only John Freer is a customer. That creates a complexity. And is any data management and data processing challenge, the variety of data and the speed that data is really being populated is higher than ever we have observed. Hold on, Jim. I want to get Jim involved in this one conversation because I want to just make sure those guys can get settled in on just your microphone there. Jim, she's bringing up a good point. I want you to weigh in and just kind of add to the conversation and take a direction of where the automation's happening. Because if you look at what Tendu's saying is the complexity is going to have an opportunity in software. Machine learning, root level, cleanliness can be automated because Facebook and others have shown that you can apply machine learning and techniques to the volume of data. No human can get at all the nuances. How is that impacting the data platforms and some of the tooling out there, in your opinion? Yeah, well, much of the issue, one of the core issues is where do you place the data matching and data cleansing logic or execution in this distributed infrastructure at the source, in the cloud, at the consumer level, in terms of rolling up the disparate versions of data into a common view. So by acquiring a very strong, well-established reputable brand in data cleansing, Trillium, this thing's sort of done a great service to your portfolio, to your customers. Trillium is well known for offering lots of options in terms of where to configure the logic, where to deploy it within distributed hybrid architectures. Give us a sense for going for the range of options you're going to be providing with for customers on where to place the cleansing and matching logic. How you're going to support a flexible workflows in terms of curation of the data and so forth, because the curation cycle for data is critically important, the stewardship. So how do you plan to address all of that going forward in your product portfolio? Thank you for asking that question, Jim, because that's exactly the challenge that we hear from our customers, especially from larger enterprise in financial services, banking, and insurance. So our plan is our actually next upcoming release end of the year is targeting very flexible deployment. The flexible deployment in the sense that you might be creating, when you understand the data and create the business rules and set what kind of matching enrichment that you'll be performing on the data sets, you can actually have those business rules executed in the source of the data, or in the data lake, or switch between the source and the enterprise data lake that you are creating. That flexibility is what we are targeting, that's one area. On the data creation side, we see these percentages, 80% of data stewards time span on data creation and data cleansing, and it is actually really a very high percentage. From our customers, we see this still being a challenge. One area that we started investing is using the machine learning to understand the data and using that discovery of the data capabilities we currently have to make recommendations what those business rules can be or what kind of data validation and cleansing and matching might be required. So that's an area that we will be investing. Are you kind of planning in terms of incorporating your product portfolio, using machine learning to drive the sort of term I like to use is recommendation engine that presents recommendations to the data stewards, human beings about different data schemas or different ways of matching the data or the optimal way of reconciling different versions of customer data. So is there going to be like a recommendation engine of that sort in line with what you're talking about? That's what our plan, currently recommendation so the users can opt to apply or not or to modify them because sometimes when you go too far with the automation you still need some human intervention in making these decisions because you might be operating on a sample of data versus the full data set and you may actually have to infuse some human understanding and insight as well. So our plan is to make as a recommendation in the first phase at least that's what we are planning to. And when we look at the portfolio of the products and our CEO Josh is actually today was also in the cube part of Splunk.conf we have acquisitions happening we have organic innovation that's happening and we really try to stay focused in terms of how do we create more value from your data and how do we increase the business service ability whether it's with our iron stream product we made an announcement this week iron stream transaction tracing to create more visibility to application performance and more visibility to IT operations. For example, when you make a payment with your mobile you might be having problem and you want to be able to trace back to the back end which is usually a legacy mainframe environment or whether you are populating the data lake and you want to keep the data in sync and fresh with the data source and apply the change as a CDC or whether you are making that data from raw data set to the more consumable data by creating the trusted high quality data set we are very much focused on creating more value and bigger insights out of the data. And Josh will be on tomorrow so folks watching we're going to get the business I have some pointed questions I'm going to ask him but I'll take one of the questions I'm going to ask him but I want to get your response from a technical perspective as CTO. As SyncSort continues your journey you keep on adding more and more things it's been quite impressive you guys done a great job and we enjoy covering the success there watching you guys really evolve. What is the value proposition for SyncSort today? Technically if you go in talk to a customer and perspective new customer why SyncSort what's the enabling value that you're providing under the hood technically for customers? We are enabling our customers to access and integrate data sets in a trusted manner. So we are ultimately liberating the data from all of the enterprise data source and making that data consumable in a trusted manner. And everything we provide in that data management stack is about making data available, making data accessible and integrate with the modern data architecture bridging the gap between those legacy environments and the modern data architecture and it becomes really a big challenge because this is a cross-platform play. It is not a single environment that enterprise is working with. Every Hadoop is real now, right? Hadoop is in the center of data warehouse architecture and whether it's on-premise or in the cloud there's also a big trend about the cloud. And certainly batch they own the batch thing. Yeah and as part of that it becomes very important to be able to leverage the existing data assets in the enterprise and that requires an understanding of the legacy data source and existing infrastructure and existing data warehouse attributes. And you guys say you provide that? We provide that and that's our value and we provide that in a enterprise grade manner. Hold on Jim, one second. I'm just going to finish the thought. Okay, so given that, okay cool, you got that out there. What's the problem that you're solving for customers today? What's the big problem in the enterprise and in the data world today that you address? I want to have a single view of my data and whether that data is originating on the mobile or that data is originating on the main frame or in the legacy data warehouse and we provide that single view in a trusted manner. What, when you mentioned Iron Stream that reminded me that one of the core things that we're seeing at Wikibon in terms of IT operations is increasingly being automated through AI. Some call it AI ops and whatnot. We're going deeper on the research there. Iron Stream by bringing main frame and transactional data and like the use case you brought in was IT operations data into a data lake alongside machine data that you might source from the internet of things and so forth. Seemed to me that's a great enabler for potentially for SyncSort if it wished to play your solutions or position them into IT operations as an enabler, leveraging your machine learning investments to build more automated like anomaly detection and remediation into your capabilities. What are your thoughts? Is that where you're going or do you see it as an opportunity, AI for IT ops for SyncSort going forward? Absolutely, we target use cases around IT operations and application performance. We integrate with Splunk ITSI and we also provide this data available in the big data analytics platforms. So those are really application performance and IT operations are the main use cases we target. And as part of the advanced analytics platform, for example, we can correlate that data set with other machine data that's originating in other platforms in the enterprise. Nobody's looking at what's happening on mainframe or what's happening in my Hadoop cluster or what's happening on my VMware environment, right? They want to correlate the data that's cross platform and that's one of the biggest values we bring, whether it's on the machine data or on the application data. Yeah, that's quite a differentiator for you. Tendu, thanks for coming on theCUBE. Great to see you, congratulations on your success. Thanks for sharing. Thank you. Okay, CUBE coverage here in Big Data NYC. Exclusive coverage of our event, Big Data NYC in construction strata Hadoop right around the corner. It's our annual event for SiliconANGLE and theCUBE in Wikibon. I'm John Furth, Jim Kobielus, who's our analyst at Wikibon on Big Data. Peter Burrs has been on theCUBE. He's here as well. Big three days of wall-to-wall coverage on what's happening in the data world. This is theCUBE. Thanks for watching. We'll be right back with more after this short break.