 Live from New York City, it's The Cube at Big Data NYC 2014. Brought to you by headline sponsor, Juan Disco, with support from EMC, Mark Logic, and Teradata. With hosts, Dave Vellante and Jeff Kelly. We're back. We are here with Todd Goldman, who's the Vice President of Enterprise Data Integration at Informatica. We're talking data, we're talking data integration. Todd, welcome to The Cube. Good to have you. Yeah, it's great to be here. So, it's a data week. Yes, big data, big data week. Yeah, and it's crazy over at the Javits. We understand it's pretty packed. Yeah, we're 5,000 people there. A lot of action. And I'm going to just get right into it. We were talking offline at our event yesterday. Jeff Kelly presented some information and we asked our customers, what's the number one, what's the tool sets you have for your big data initiatives? The number one tool set that came up was an initiative with data integration off the charts. And you were saying, well, a year ago, it would have been different. What's your take on that? Yeah, I think that what's happened is people have gone from sandbox environments where they're playing with this stuff and maybe using one or two sources that they put into Hadoop and then say, and then they play with that and they can do all this in hand coding. And now it's gotten a little bit more difficult I make an analogy to my teenage daughters who leave their clothes all over the house. They've got their clothes all over the house and I tried the strategy where I said, okay, they're not putting the clothes away. I'm going to take all the clothes, this mess that they've created that's extended all over my house into the car. I'm going to take all that stuff. I'm going to dump it onto their bed, in their bedroom, and then they're going to put it away. They're going to organize and put away. And of course, I went from having a distributed mess to a co-located mess. It was not put away. It wasn't cleaned up. And the same thing's true, I think, with Hadoop. And that's what they realized where there's a big advantage with Hadoop where I don't have to have a schema. I don't have to worry about the data structure going into Hadoop, but that doesn't mean it's integrated. And I think they confused this idea that I could, in the old days, you take your data out of your operational database and then you'd put in the warehouse. You'd have to restructure it. So you needed E, you need to extract it. You need the T, you need to transform it. And they need to load it. You had to transform it just to load it and get a star schema and get a structure that was highly readable for doing reporting. Well, Hadoop, you don't have that. So everyone said, oh, you don't need data integration anymore. You don't need ETL. Well, then they put 10 data sources in there and they realized, well, wait a minute. In order for me to actually start querying this and get analytics out of it, I have one system that, simple example, but one system that's got social security number in three columns and another system that's in one column. Well, before I combine all that data together, why don't I need to normalize that so social security system's all the same? I can combine the data that I can start analyzing that. And you multiply that by every attribute, every system, it gets complicated really fast. So they're realizing, oh yeah, I've got to apply some of the old thinking to this new environment. I love the analogy, not only because I have teenage daughters, but it's so true. Did that work for you? You got it, it did. Because I'm picturing, well, I know my daughter, she's got to fold the clothes and she goes to the drawer. Her drawer has to be reorganized because it's all a mess and I don't have time for this. So it just stays there, it goes in the floor. And take that metaphor into data. I might not have the place to put it. I don't have the structure. I don't have time for that. The bombs are dropping all around me. The boss is coming down. Something just went wrong. I don't have time to organize this. And I think that's the whole thing though, that people are moving from this ad hoc world to how do I operationalize this process? And to operationalize the process, it means that I've got to know, well, what data do I have where? Are my sweaters in the sweater drawer? Are the pants in the pants drawer? If mom comes in or her sister comes in and wants to borrow a sweater, but it's on the floor, she's got to go through all that stuff and find it. And the same thing's true with data, but a thousand, hundred thousand times worse because there's so much of it. So if there's lineage, if there's metadata to track, where is the information that's sitting in my Hadoop repository, then somebody else can come in and share that information. And I've also, if I've taken five data sources and I've combined it together in some interesting way, and I've cleansed it along the way, and then Jeff over here comes along and wants to use that. He can see that, oh, Todd's cleaned it all up. I'm not going to take all that raw data. I'm going to start from Todd's nice clean repository that he's taken my data swamp and he's created a nice clean data lagoon. I'm going to take something from the data lagoon and maybe you'll add to it. But at least that way, for you to decide that you trust it, you're going to look at, well, what's the lineage of the data in that lagoon? Do I trust that lineage has Todd cleaned it? Can I see that process? And these are the kinds of things that if you hand code everything, it's very hard to figure that out. And this is where tooling comes in and visualization of these environments come in. And this was true with classic data warehouses. The issue was that that process to prototype and build took too long. With Hadoop, that process to prototype is just completely shrunk. And so I can go from, I can prototype much more quickly, but then I have to think about, well, how am I going to then make that prototype repeatable? How am I going to make it so that other people can use that information? Well, Todd, we were talking earlier a little bit about what you're seeing in terms of this transition from a lot of the code jockeys at Hadoop World. And now you're seeing more kind of business conversations and conversations around things like governance. So are you seeing that shift? Are we starting to, I mean, across the chasm is one way to put it, but are we starting to move? Do you think from that early adopter POC phase to more of these POCs moving to production? Yeah, so I think there are a couple of shifts that are happening. One shift has to do with how development is happening, where you had a lot of hand coding being done, where the skill set required to do that work was pretty high. It was just at dinner last night with a number of customers, and one major credit card company said, wow, this Hadoop stuff's really hard. It was fine for our very core small team to hand code this, but when we start pushing this out to our business units, these guys aren't capable of this. So they're looking now at, how do we have graphical tools make development more accessible? I mean, I think of it like, how many people would like to use a Macintosh and actually go to the Unix operating system running underneath? Everybody wants, everyone uses the graphical interface. The same thing with Hadoop. I know what got it actually. Yeah, and by the way, that's not going to end here either, right? That's still going to be true. So there's an increase in accessibility, so there are more developers who can do, and in our case, I mean, we hide that complexity of data integration, so somebody who knows, we have over 100,000 informatica developers that are out there doing data integration in our classic environment. All those people, they're already Hadoop developers, they just don't know it. They can use that same graphical development environment and say, oh, I want to run that logical integration, I want to run that directly on Hadoop. So that's one part is making it more accessible, and the other part is just the operationalization of once I've made it more accessible, how do I know when I have to handle things like change management? Who's affected downstream by this change? Is there a report sitting in micro strategies or Tableau or business objects? Is there a report that the CEO is depending on, and if I change some attribute that's feeding that, I better make sure that I know that what the downstream impact of making a change in some upstream piece of data is, and once again, hand-coding that creates a problem, so tools like lineage, the ability to have metadata management and clear lineage from the source all the way to the target consumer, it becomes more important. And trust is a big part of that, do I trust this information? Because I think we can very quickly, Hadoop has a lot of advantages, but we can very quickly end up where we did with data warehouses where people throw a lot of junk in there. You've got to, and I don't know, it's like the clothing in my kid's room again. I don't know what clothing is clean, and what clothing, they just looked at, tried on and said, I don't want to wear this today, and threw it on the floor. There's some organization that's required there. Well, so let's squint through this a little bit. So we hear in this space, in the big day of the Hadoop world, that it's no longer ETL, it's more the letters of turns, ELT. So talk about that a little bit. Is it just a matter of just moving, just arranging the chess pieces a little different, or is there a fundamentally different approach to the T, to the transformation when you're doing that in a system like Hadoop, versus the old model where you would, you know, bring it out to an informatica environment, do the T, and then load into the data warehouses. How does the T change, not just in the acronym, but in reality? Yeah, so the logical part of the T doesn't change, right? I've got five different data sources, and I want to maybe create it, I want to combine it together to do something interesting with it. I still have to make sure there's consistency of that data, so that dates are all the same format, that social security is all the same format. That data is formatted in a way that's consistent, so that's one part. But then, what's the compute engine that's being used to do the T? So in the old days, the compute engine was Informatica's own engine that ran its own compute farm. Now, the idea, and you'd bring the data to the compute farm, and then you'd load it somewhere. Now the data's already sitting on the compute farm, it's sitting on Hadoop. You've got the storage and the computation tied together, so you're pushing down that transformation logic to run on Hadoop. Now this idea has existed before people talk about push down optimization where even Informatica would take some of our logic and we'd push it down to an Oracle system or push it down to Teradata. The difference is that we are pushing down SQL, and the language of SQL is much more confined than what we can do in Hadoop, so maybe we could push down 30% of the instruction set to run on Oracle or on Teradata. With Hadoop we can push down about 95% of the instruction set of things that we can do in our own system, on our own engine. We can push 95% of that down onto Hadoop, and we think eventually we'll be able to do 100% there. Because you really don't want to, once it's in Hadoop, you don't want to move the data out of it, you want to push the processing to the data, not the other way around. So the big advantage there is just I don't have to move the data, and I get that distributed compute power, so I get the same kind of scaling. Well yeah, I mean you made a very good point. I mean one of the central tenets is you, big data's heavy and you don't want to move it around. There's at least amount of movement as possible. And so taking a step back, I mean how does that impact a vendor like Informatica whose business is about moving data from point A to point B? And it sounds like the approach is we'll push down some of that capability into the repository. So that's a fallacy that I want to correct. So you're saying, because our business really, so it is partly about moving data from point A to point B, and but that's only one part of it. So you still have to extract data from the original source systems. So that's something that's still going to be important. Legacy systems never go away. So the E will still exist and that's something that turns out to be harder than most people think because getting to data in an SAP system is actually a non-trivial problem because you actually have to understand something about the data structure. So that part still exists. You've got to move it in, but the other part that's been really critical for our business has been the T. And the T is how do I once again get data consistency, but then how do I get things like the cleansing of the data? How do I make sure that, how do I profile what's in there? How do I identify, let's say there are missing values. Where are those missing values? How do I then append the information in that system or in that data to fill out the blank values? So things like names and addresses are a classic example, but just data quality in general, I might have a set of data quality rules. So in financial services, they'll have certain expectations of how certain financial instruments should relate to each other and they'll set up a set of data quality rules that will say, well, if this value is more than 5x that value, that should raise a flag and somebody should get alerted about that. Well, they can code all that in our system in a graphical way without having to once again into actual physical coding. So anything that involves manipulation of that data and transformation of that data, short of actually doing analytic type queries, people have been doing that in Informatica for years. It's just that now, once again, you're loading at first and then doing the transformation, but the transformation's still happening and you wanna make it so that you don't have to have a pig hive developer do that, which is the other big issue because there are not enough smart people in the world to do this work. And there's new technologies emerging every day in this space and that's gonna continue to happen. So the goal it sounds like is to allow Informatica professionals, people that have used Informatica in the more traditional world to migrate their skills to this new world, again, through the kind of this graphical user interface approach. Right, and even people who aren't Informatica users, I would argue it's still a lot easier for them to use this graphical approach and then learning pig hive and then whatever's gonna come next and spark. Absolutely, and the data quality component, I think is really interesting and one that gets overlooked a lot because I think some people think, well, okay, it's big data and the volume of the data can make up for some of the poor quality because when you have data at such volume, it can, you have enough volume that you overcome some little discrepancies but the reality is if you're trying to get down to a, in a big data space, we talked about last night on our panel discussion we had here about a segment of one, you want to treat a customer as, specifically that customer, you want to know everything about that customer. If you have a data quality problem there, that can be a major problem. Well, if it happens to be, you have a data quality problem and it happens to be in my $10 million account and it's off by a million dollars, I'm going to be pretty upset about that, right? So I think this is the idea that, that your data quality disappears because it just becomes noise, is true in a very limited set of cases. I used to work for America online and you have this very long tail of searches. So that's true for Google, right? The long tail, you may not care as much about that long tail but for the rest of industry where I have customers who have problems and issues and accounts and if I lose their transaction, they're going to be upset, I can't afford a data quality problem there. Yeah. How much of, I hear a big theme at the event this year is the Hadoop comes to the enterprise or the enterprise comes to the Hadoop, is probably a better way to say it. Yes. And what I'm hearing anecdotally is the Hadoop guys are saying, and actually we hear this in the Wikibon community too, this is the technology that's used, here's how you do it, the DevOps guys are here, follow us. And the enterprise IT folks are saying, well, wait a minute, hold on. We have processes, we have governance, we have data quality, edicts. Let's collaborate on this. And now sometimes it's... Right, but you know, there's a great report about this that MIT Sloan and Capgemin I did where they talk exactly about that. And it turns out that if you do just one or the other, you'll do marginally better. If you do governance only but don't invest in new technology like Hadoop, you will improve your profitability but your revenue will go down. And if you do just Hadoop but you don't govern it, your revenue goes up your profitability goes down. If you do both, you get the combination. And when I look at our customers that are the most successful, the ones who just do Hadoop and don't figure out how to operationalize it and create repeat projects, they get a little bit of a bump of, oh, we have this great project but now they've got to roll it out to the rest of the organization and they can't. So when IT... They can't because there's no governance, there's no data quality. Well, and they're stuck with this little tiny team that has to be a super duper set of experts as well. So it's the combination. And it doesn't scale. It doesn't scale. So you've got that combination. And the opposite's true too, where you over govern, you completely hamper the creativity of the guys who are trying to do Hadoop. So the ones who... I mean, this is true for lots of organizations. The ones who have a better culture where they can combine innovation with good governance, they get the best benefit. And it's a hard trick to do but it's just in churning the mindset around of thinking, those guys are idiots because they don't get my Hadoop thing. Or those guys are idiots because they don't understand the governance. Well, they're irresponsible. Yeah, they instead of... You look at and say, okay, I see that point of view. I understand why they're saying what they're saying. How can I at least take a step towards them because I can see the benefit for my organization by doing that. And those companies that do that to start looking at the glass half full instead of half empty of what the other guy is saying, they end up getting better benefit from the technology faster. There's so much dissonance in large organizations, obviously, many, many agendas. You don't have that nearly as much or at all sometimes in smaller organizations or startups. So what we're trying to do is identify those areas where there is alignment between IT and the business, where there is alignment even within IT between let's say the DevOps crowd and the traditional sort of governance crowd, et cetera. And those are the folks that seem to be really driving value in this business. Yeah, well for Hadoop to really take off, I mean it can't just sit in the startup community and I think if you look at, do you know what, have you ever heard of Corba or DCE? Yeah, DCE. Okay, so and but both, well, but if I had a thousand people in this room and I were to ask that question, maybe 2% of the people would raise their hands and know what they were. And the reason is these are distributed technologies that died on the vine because they were too hard. The difference with Hadoop is there are billions of dollars of investment going in to hide that complexity, to make it the UNIX operating system underneath a graphical interface, whether it's for data integration, whether it's for data prep, whether it's for data analytics, whether it's for building new kinds of applications. We're going to see that complexity get hidden away by lots of investment that are done by the startups that then get consumed into the mainstream enterprise. Well, the whole big data theme is a real tailwind for Informatic, it's interesting. You guys have been in the right place for a while and now the right time has come. So really interested to follow you guys and the contributions you're making to this community. So we got to end it there, Todd. Thanks very much for coming to theCUBE. Thank you very much, it's been great. All right, keep it right there. Everybody, Jeff and I will be back with our next guest right after this. We're live from Big Data NYC, this is theCUBE.