 Live from the Fairmont Hotel in San Jose, California, it's theCUBE at Big Data SV 2015. Welcome back everybody, this is theCUBE. We are live at Big Data SV in San Jose, California. We're winding down here on day one of just a full day of live interviews on theCUBE. One of the things we've been talking about all day today and frankly for the last couple of years in the Big Data space is the partnership ecosystem. Big Data is very much more than just a single technology. It requires a lot of different parts and a lot of cooperation among partners. We're going to continue on that theme in this segment. We've got John Chrysa, who's the VP of Strategic Marketing at Hortonworks and a frequent CUBE guest. Welcome back. Thanks Jeff. And John Haddad, Senior Director of Product Marketing in Informatica. Thanks for joining us on theCUBE. Thank you. So let's just start at the top. Tell us a little bit about your relationship, John. Let's start with you. In terms of how do the two companies come together, Informatica and Hortonworks to establish this partnership? Right. We've been working with Hortonworks for several years now since you formed as a company actually. Yeah, very early on. And, you know, we've had several lots of customers using us, both Hortonworks and Informatica together for a variety of use cases. And I think what's propelled this partnership to be successful is our customers, our joint customers ask, you know, when Hortonworks goes into an account, they'll say, oh, we're using Informatica for data integration and data quality. Now we'd like to be able to do that on Hadoop. And Hortonworks being the natural, you know, pure open source vendor, it's a natural marriage of having the best of open source with the best of data management together. And so that's why our customers or joint customers are using us, is to leverage a lot of the skills they have in place today to take advantage of the new technologies and innovation we're seeing in the open source community. And, you know, it's interesting, when we talk to practitioners, there's one of the key pain points is integrating Hadoop with the rest of their infrastructure. So obviously you need partnerships among the different players to make that happen. But specifically when you're talking about integration and you're talking about data integration, you're actually talking about connecting systems and that's exactly what the role you play. Right, yeah. So, John, from Hortonworks perspective, talk a little bit about, we've heard a lot about your partnership strategy, but specifically when it comes to the data integration space and how important that is to enabling what Hortonworks can bring to an organization. Sure, great. And the partnership I'll echo with John said, it's been a great partnership. One that's been based on engineering and looking to mutually solve the problems that the customers have around data to your point. And, you know, Hadoop as a platform runs on data. I mean, it must consume data and it really, the value comes out of it. It's loaded into that for transformation and analytics and the things that can be done there. So it's been a great partnership with Informatica because there is a natural synergy between what Informatica's skills are and what their expertise is and what platform and what our skills are. So, therefore, it's a great benefit for the customer to use these two technologies together and for us to work together to help them understand how they work together and how they can get benefit. So talk about some of those joint customers. What are some of the more interesting things you're seeing in terms of leveraging both technologies in order to kind of build out this more modern data architecture that brings in some of the more traditional tools and warehouses and things that are gonna be here for a long time and some of the things you do. What are some of the more interesting things you're seeing? John, we'll start with you. Sure, so I think there's a very wide range of how the technologies work together and how they're being applied, everything from, and I'll start kind of top down, the customers ultimately are trying to get to a data lake and it's something that we've been talking about with Informatica for some time now. But if you look at where they start, they usually start with one or two simple use cases and build it up to that more complex deployment. One would be trying to get a 360 degree view of the customer, right? They have all these different data sources that they need to bring together to get that single unified view of that customer and they're gonna break down the existing silos. Some of the places where we're working with Informatica is so great because they can help bring those data sources from those different components into the platform so that the customers can really achieve that 360 degree view of the customer. Mm-hmm, mm-hmm. And from your perspective, I mean, John mentioned the data lake and we hear that a lot as kind of the foundation that companies are starting to build out using Kadoop. But of course an important part of that is things like compliance and governance and data quality, which is obviously where Informatica plays as well. What does Informatica bring to the table in terms of helping make sure that data lake doesn't turn into a data swamp or it's just a lot of data where you don't have any kind of handle on it? That's right, yeah. So like John was saying, they're using us for a variety of different use cases. Some of these use cases are things that they've been doing before. We want to do them much better. Now in the traditional world, you have data governance, and when I say data governance, what does that include? It includes data quality. It includes mastering your data, managing your master data, customer entities, customer relationships, product relationships. It includes data security, all the things that help you manage data as an asset. That's true in the new world too. That doesn't go away, right? And the conversation has shifted over the last year from the basics, right? Storing data at scale, processing it at scale, to getting through those basic fundamentals to, oh, now we're using this at an enterprise-wide scale, not just for one project in one use case, but for multiple projects and multiple use cases, it's the data lake, right? It started off as a, we started off doing data warehouse optimization and offloading to kind of control costs. Then you do the one project, the second project, and then that gets popular, that gains momentum. So you need to bring in that data governance, those disciplines into the new world. And so we're helping our customers jointly drive, you guys rolled out the Data Governance Council and Initiative, you can talk more about that. And we've been doing data governance for many, many years. Data quality, MDM, data security, and so on. And you can leverage those, once again, getting back to the skills aspect of it, is you can leverage those same skills now in the Hadoop world, and you don't necessarily have to retrain or hire new people to do those types of things. And you can also leverage a lot of the work you've already done in the traditional world. You talked about integration, integrating Hadoop into the current infrastructure and ecosystem. Well, MDM today still runs on traditional technology, right? And we've created that golden record, so to speak, managing all the relationships in a household or between employees and customers and organizations. That golden record can serve as a way to join disparate data sets that are sitting in the Data Lake. And when you discover all the information related to customers and certain demographics, you can use that information to enrich the master data. So it's bi-directional, the information that's going between the Data Lake and the MDM system, for example. Well, and I think you touched on something really important around the data governance because some of the practitioners we've talked to, some of the challenges have been, well, okay, we started with this concept of a Data Lake, we're going to offload data from our existing systems for some cost savings, but they didn't think about the data governance at that point. And then you say, well, don't I have a foundation here or now I can build applications on top of it? Well, you don't unless you've got the governance because then you start building prototypes and you realize you get the compliance team involved, you get the business side, the lawyer. Well, you can't do that. You didn't, where did this data come from? Can you prove where it came from? And those questions can very easily derail big data projects if you don't address them upfront. Are you noticing, either of you, in terms of customers recognizing that earlier in the process? Yeah, I mean, I would say yes, I'd say for all the reasons that you talked about and that John talked about, the requirements from the enterprise don't change once they start to deploy applications on a new platform, right? That they still have to have the governance and the security and the other components that go around that, I think, with Hadoop as it matures and gets used for broader and broader use cases, it's a natural that those things are going to be required. So that is one reason we started the data governance initiative. It's kind of a very common way that Hardworks kind of tries to rally the ecosystem and the community around solving a problem in open source for the enterprise. So that's very representative. This one happens to be around governance and making sure that the platform can solve the governance requirements alongside of working with the ecosystem and helping to drive that forward. So that's really very true in terms of how we're evolving that platform. And John, I want to turn back to informatic and kind of your role in the big data landscape. Because frankly, I was a little naive of some of the things you've been doing with your customers for a long time around not just the structured data, which is what you're known for, but also a lot of multi-structured data work you've been doing and in fact, quite a bit of the data under management, if you will, in the informatic and customer base is this multi-structured kind of data. Talk a little bit about what you've been doing in that space, not just in the last six months but over the last several years. That's right, yeah. So I can't remember the exact statistic, but something like 80% of the data is unstructured or semi-structured or complex format of data. Machine log files, clickstream data, or just industry standard data, right? Like in healthcare, you have HL7 and HIPAA and financial services, you got FIX and SWIFT and insurance, you got NACHA and Accord and all these different standards, you got market data that's streaming in and it's coming in at different latencies, right? And so you don't want an impedance mismatch between the rate at which the data's being generated and the rate at which you can ingest it into the data lake. So a few things that we've been doing over many years, unfortunately, we haven't necessarily broadcasted the word out as widely as we'd like, is the ability to stream data directly into Hadoop. This is once again, machine data, real-time data, sensor data, but also once you bring the data in, how do you parse and extract out the elements that you need, right? Because it is a complex or multi-structured formatted data. And so we have pre-built parsers to do that so that you don't have to reinvent the wheel. So for example, we have pre-built parsers for HL7 in healthcare for FIX and SWIFT and financial services, EDI in manufacturing, and you can build custom parsers using a visual development environment. That's one thing that we bring to increase productivity and we've done some benchmarks with Hortonworks that we've increased productivity over the hand-coding, not that there's anything wrong with hand-coding in certain cases, but for things like ETL or data quality where there's pre-built transforms, and once again, we've been doing this over many years, why reinvent the wheel, right? Leverage those capabilities. It's the same for the parsing of the structured data. And then you need to integrate that with the structured data. So what we see very common is, how do I do transformations and integration on these different types of data sets? Well, the first thing you need to do is not just parse and extract, but normalize, standardize all the different codes that you have in a traditional type of more structured schema. That's already been done for you. You don't have that with all the messy types of data. Now, that's not to say that you don't always want some messy data to look for outliers and anomalies. And this is another concept in the data lake, right, is being able to go from the swamp to the sandbox, from the sandbox to the more refined, as it were, let's say. So there's these different stages and categories of data, and you just have to recognize that data needs to be fit for use or fit for purpose. Well, there's a spectrum of these cases. Exactly, yeah. It's not a simple one application. And that's what the data lake is all about, the whole concept, and what Yarny has enabled, running different types of applications on the same corpus of data where you can now reuse that data. And the value of that data goes up the more you reuse it. That's right. So talk a little bit about how you're helping companies consume this faster. Lower that time to insight. I know you've worked together on this trial download. Talk a little bit about that, what you guys are doing together there. Yeah, I wanna go ahead. I mean, I would say, you know, one of the big things John touched on early on was just the reuse of skills, right? It's very important for the practitioners and users to be able to not necessarily have to learn new skill sets, but take advantage of what they have already. And I think that's one of the things that we both commonly preach together is, hey, you can reuse your skills on this new infrastructure for new kinds of value. And that's where the integration has come along. So what we've done is we've worked together to make it so that users can now, just with a single download, get a sense of how to use the Informatica tools on top of the Hadoop platform, on top of Hortonworks data platform with a single virtual machine download that they can run on their platform with built-in tutorials. So we're pretty excited about that. It's something we've worked on together. I don't want to give details on how they can use it, but... Yeah, I mean, so just like Hortonworks has a wealth of tutorials for learning Hadoop and how to do certain things on Hadoop, we have included some tutorials for web log processing or dealing with change data, some common problems as you move from the old world to the new world. So I think that's a... And we need to add more of those types of tutorials because I think that's what helps bridge the gap between the traditional skills and required to do the work and some of the new skills and just trying to make it easier for people to do that. Yeah, well, I mean, I think there's... It's a combination of you need to build up the skill set of practitioners and you also have to look into the vendor community to build software and tools that just make it easier, lower the bar. It sounds like you have to tack it from both angles. Right, that's right. That's right, and not kind of one or the other. That's right. Yeah, and the reason I keep bringing that up is because that's probably the number one reason why our customers talk to us is like, we can scale up the storage, we can scale up the processing on Hadoop, but we can't scale up the skills to do the work, right? So, yeah. Yeah, thank you. So we're running close on time, so I want to get your take on what's going on here this week and we've kind of dubbed it Big Data Week in San Jose. The Strat Hadoop World is kind of kicking off, I think officially any minute now. What are you looking for this week? Beyond kind of necessarily what you guys might be announcing, but just in general, what are you expecting to see at the show, maybe from customers? What's kind of top of mind for you? John, I want to start with you. Sure, yeah, I'm looking to hear more exciting use cases, how kind of the state of the state has been advanced in terms of how it's being used by the enterprise. I mean, I think just being unique where I sit in terms of the Hadoop ecosystem have a pretty good idea of what's happening in the technology, so I'd like to hear how companies and also partners in things are integrating with Hadoop to help drive more value out of this, to help with the acceleration, when we see Hadoop accelerating the option and I want to understand how are these other partners helping with that and what are some of the interesting use cases that might be emerging? I look at it both from a technology perspective and a business perspective, so from a technology perspective, I'm looking at new projects, new things that are emerging in the open source community, because like I said at the beginning, we feel that the best for the community and the best for our customers is taking the innovations that are occurring in the open source community, combining that with the innovations that we've created over the last 20 years, and so just like we use MapReduce and Hive and Yarn and Taz and Spark and all these different things that have come out of the community, and we want to help kind of hide some of that complexity because there's new things coming out all the time and we want to help the community adopt those innovations faster by hiding some of that complexity with our development environment, so I'm always looking at some of the new technologies that are emerging and working with R&D that say well, should we be leveraging it for these types of transformations or for these types of processing and that's what we've done, so that's one and then the second is like John said, the different types of use cases, I'm always interested in the use cases where companies are in their journey, in their big data journey, a lot of them as we start with data warehouse optimization, move to the data lake, move to the 360 customer analytics, then to like real time operational intelligence, they have a vision just like we have a vision and I'm always curious to see where they are in that journey and that evolution. Yeah, I think customers can sometimes be unpredictable and it's fun to watch where they might take this market and things we can prognosticate but you never know really where customers are gonna take it which is really interesting especially in this space because of some of the open sort of nature, you've got practitioners who are also creating technology and innovating, so it's a really interesting market market to cover, market to watch, so guys, John from Hortonworks, John from Informatica, thanks for joining us on theCUBE, appreciate it. Guys, thanks for watching and we will be right back after this to wrap up day one.