 Okay, we're back. We are in Cambridge, Massachusetts. This is Dave Vellante. I'm here with Jeff Kelly. We're with Wikibon. This is theCUBE, Silicon Angles Production. We're here at the MIT Information Quality Symposium in the heart of database design and development. We've had some great guests on. Scott Houser is here. He's the head of marketing at Adapt, a company that we've introduced to our community quite some time ago, really bringing multiple channels into the Hadoop ecosystem and helping make sense out of all this data, bringing insights to this data. Scott, welcome back to theCUBE. Thanks for having me. It's good to be here. So this notion of data quality, the reason why we asked you to be on here today is because, first of all, you're a practitioner. You've been in the data warehousing world for a long, long time, so you've struggled with this issue. The people here today are really from the world of, hey, we've been doing big data for a long time. This whole big data theme is nothing new to us, but there's a lot new. And so take us back to your days as a data practitioner, data warehousing, business intelligence. What were some of the data quality issues that you faced and how did you deal with them? So I think a couple of points to raise in that area are, one of the things that we like to do was try and triangulate on a user to engage them. In every channel we wanted to go and bring into the fold, created a unique dimension of how do we validate that this is the same person, right? Because each channel that you engage with has potentially different requirements of user accreditation or a guarantee of single user, if you will. So I think the holy grail used to be, in a lot of ways, like single sign-on or a way to triangulate across disparate systems, one common identity or person to make that world simple. I don't think that's a reality in the sense that when you look at a product provider or a solution provider and a customer that's external, those two worlds are very disparate and there's a lot of channels and potentially even third-party means that I might want to engage this individual by and every time I want to bring another one of those channels online, it further complicates validating who that person may be. Okay, so when you were doing your data warehouse thing, again, as an IT practitioner, you tried to expand the channels but every time you did that it complexified the data source. So how did you deal with that problem? Did you just create another database and stall-fibre everything? Well, unfortunately, it absolutely creates this notion of islands of information throughout the enterprise because, as you mentioned, you define a schema effectively and you place data elements into that schema of how you identify and how you engage and how you rate that person's behaviors or engagement, et cetera. And I think what you'd see is, as you bring on new sources, that time to actually merge those things together wasn't in the order of days or weeks, it's on months and years. And so with every new channel that became interesting, you further complicate the problem and effectively what you do is you end up creating these pools of information and you take extracts and you try and do something to munch the data and put it in a place where you give access to an analyst to say, okay, here's another sample set of data, try and figure out if these things align and can you try and create effectively a new schema that includes all the additional data that we just added? So it's interesting, because again, one of the themes that we've been hearing a lot at this conference and you hear it a lot in many conferences, it's not the technology, it's the people in process around the technology, that's certainly any IT person would agree with that. But at the same time, the technology historically has been problematic, in particular the data warehouse technology has been challenging. So you've had to keep databases relatively small and disparate and you've had to build business processes around those databases. So you've not only got deficient technology, if you will, no offense to some of my data warehousing friends, but you've got a process creep. That's absolutely fair. That's occurred. And I think what ends up happening is it's one of the things that's led to the revolution that's occurring in the market right now about whether it's the Hadoop ecosystem or all the tangential technologies around that because what's bound some of the technology issues in the past has been the schema. And as important as that is because it gives people a very easy way to interact with the data, it also creates significant challenges when you want to bring on these unique sources of information because as you look at things that have happened over the last decade, the engagement process for either a consumer, a prospect or a customer have changed pretty dramatically and they don't all have the same stringent requirements about providing information to become engaged that way. So I think where the schema has value, obviously in the enterprise, it also has a lot of historical challenges that it brings along with it. Yeah, so this jupe movement is very disruptive to the traditional market space. As many folks say it isn't, a lot of traditional guys say it isn't, but it clearly is, particularly as you go omni-channel, like through that word out earlier, omni-channel is a discussion that we had at the Hadoop Summit myself, John Furrier, Abhi Metta. And this is something that you guys are doing, is bringing in data to allow your customers to go omni-channel. As you do that, you start to, again, increase the complexity of the corpus of data. At the same time, a lot of times in Hadoop, you hear about schema-lite, schema-less. All right, so how do you reconcile the omni-channel, the schema-less or schema-lite and the data quality problems? So I think for, particularly speaking about Hadoop, one of the things that we do is we give customers the ability to take and effectively dump all that data into one common repository that is HDFS in Hadoop and leverage some of those open source tools and even their own inventions, if you will, with MR code, PIG, whatever, and allow them to effectively normalize the data through iterations in Hadoop and then push that into tables effectively that now we can give access to via a SQL interface. So I think for us, the ability is you're absolutely right. The more channels you can give access to, so this concept of an omni-channel where, irrespective of what way we engage with a customer or what way they touch us in some way, being able to provide those dimensions of data in one common repository gives the marketer, if you will, an incredible flexibility and insights that were previously undiscoverable. Assuming the data quality is there. Assuming data quality is there. So that can be my question. So what are the data quality implications of using something like HDFS where you're, essentially, schema-less, you're just dumping data and essentially go up to a raw format and it's raw format. So now you've got to reconcile all these different types of data from different sources and build out that kind of single view of a customer, of a product, whatever is your, you're looking at. So how do you go about doing that in that kind of scenario? So I think the repository in Hadoop HDFS itself gives you that one common ground to work in because you've got no implications of schema or any other preconceived notions about how you're gonna massage the data, if you will, and it's about applying logic and looking for those universal IDs. There are a bunch of tools around that are focused on this, but applying those tools in a means that doesn't handicap them from the start by predisposing them to some structure and enabling them to decipher or cull out that through, whether it's, again, homegrown type scripts, tools that might be upstairs here, and then effectively normalizing the data and moving it into some structure where you can then interact with it in a meaningful way. So that really kind of the old way of trying to bring snippets of the data from different sources into yet another database where you've got to apply structure. That takes time, months and years in some cases. And so Hadoop really allows you to speed up that process significantly by basically eliminating that part of the equation. Yeah, I think there's, and there's a bunch of dimensions we can talk about things like, even like pricing exercises, right, and the quality of triangulating on what that pricing should be per product, per geography, per engagement, et cetera. I think you see that a lot of those types of workloads have transitioned from mainframe type environments, distributed environments of legacy to the Hadoop ecosystem, and we've seen cases where people talk about going from eight month exercises to a week. And I think that that's where the value of this ecosystem and the commodity scalability really provides you with flexibility that was just previously unachievable. So could you provide some examples, either from your own career or from some customers you're seeing, in terms of the data quality implications of the type of work they're doing? So one of our kind of theses is that the data quality measures required for any given use case varies. In some cases, depending on the type of use case, and depending on the speed that you need the analysis done, the type of data quality or the level of data quality is going to vary. Are you seeing that? And if so, can you give some examples of the different types of way data quality kind of manifests itself in big data workloads? Sure, so I think that's absolutely fair. And obviously there's gonna be some trade off between accuracy and performance, right? So you have to create some sort of confidence coefficient. Pardon me if you will, that within some degree of probability, this is good enough, right? And there's gotta be some sort of balance between that accuracy and time. Some of the things that I've seen a lot of customers being interested in is, there's this sort of market emerging around providing tools for authenticity of engagement. So as an example, I may be a large brand and I have very open channels that I engage somebody with, might be email, might be some web portal, et cetera. And there's a lot of phishing that goes on out there, right? And so people phishing for whether it's brands and misrepresenting themselves, et cetera. And there's a lot of desire to try and triangulate on data quality of who is effectively positioning themselves as me, who's really not me and being able to sort of take a cybersecurity spin and start to block those things down and alleviate those sort of nefarious activities. So we've seen a lot of people using our tool to effectively understand and be able to pinpoint those activities based upon behaviors, based upon outliers and looking at examples of where the engagement's coming from that aren't authentic. So if that makes sense, I'm trying to be somewhat nebulous, but. Right, so using analytics essentially to determine the authenticity of a person of an entity, of an engagement, rather than taking a more, rather than kind of looking at the data itself, using pattern detection to determine. But also taking, you know, there's a bunch of, there's a bunch of raw data that exists out there that needs, you know, when you put it together, again, back to this notion of this sort of, you know, landing zone, if you will, or data lake or whatever you want to call it, you know, putting all of this data into one repository where now I can start to do, you know, analytics against it without any sort of pre-determined schema and start to understand, you know, are these people who are purporting to be, you know, firm XYZ, are they really firm XYZ? And if they're not, where are these things originating and how can we start to put filters or put things in place to alleviate those sort of, those activities? And that could apply, it sounds like, to certainly private industry, but I mean, it sounds like something, you know, government would be very interested in terms of, you know, in the news about different foreign countries potentially being the source of attacks on US corporations or part of our infrastructure and trying to determine where that's coming from and who these people are. And of course, people are trying to, it gets complicated because they're trying to cover up their tracks, right? Certainly, but I think that the most important thing in this context is it's not necessarily about being able to look at it after the fact, but it's being able to look at a set of conditions that occur before these things happen and identify those conditions and put controls in place to alleviate the action from taking place. I think that's where when you look at what is happening from an acceleration of these models and from an acceleration of the quality of the data that you're gathering, being able to put those things into place and put effective controls in place beforehand is changing the loss prevention side of the business. In this one example, but you're absolutely right from what I see and from what our customers are doing, it's multi-dimensional in that there's cybersecurity that's one example, there's pricing that could be another example, there's engagements from a funnel analysis or a conversion ratio that could be yet another example. So I think you're right in that it is ubiquitous. So when you think about the historical role of the, well, historical, we had Stuart on earlier, he was saying the first known chief data officer we could find was 2003. So I guess that gives us a decade of history, but if you look back at the whole, I mean data quality, we've been talking about that for many, many decades. So if you think about the traditional role of an organization in trying to achieve data quality, single version of the truth, information quality, information value, and you inject it with this disruption of Hadoop. That, to me anyway, that whole notion of data quality is changing because in certain use cases, inference is just fine. In false positives are great, and who cares if you're analyzing Twitter data in some cases. In others like healthcare and financial services, it's critical, but so how do you see the notion of data quality evolving and adapting to this new world? Well, I think one of the things you mentioned about this single version of the truth was something that was, you know, when I was on the other side of the table. They were beating you over the head. Very, you know, we can do this, we can do this, and it's something that, it sounds great on paper, but when you look at the practical implications of trying to do it in a very finite or stringent controlled way, it's not practical for the business. Because you're saying that the portions of your data that you can give a single version of the truth on are so small because of the elapsed time lag. That's right, I think there's that dimension, but there's also this element of time, right? And the time that it takes to define something that can be that rigid and that structured, it's months, and by that time, a lot of the innovation of the business is trying to accomplish. The eyes have changed, the initiatives have changed. You lost the sale, hey, but we got the data. Yeah, but look here, yeah. And I think that's, you're right, and I think that's what's evolving, and I think there's this idea that, you know what, let's fail fast, and let's do a lot of iterations and the flexibility that's being provided out in that ecosystem today gives people an opportunity to iterate, fail fast, and you're right, that you set some sort of confidence in that for this particular application, we're happy with an 80% confidence coefficient, right? Or something a little higher. It's good enough. So having said that, now what can we learn from the traditional data quality, chief data officer practitioners, those who've been very dogmatic, particularly in certain industries, what can we learn from them and take into this new world? I think from my point of view, and what my experience has always been, is that those individuals have an unparalleled command of the business and have an appreciation for the end goal that the business is trying to accomplish, and it's taking that instinct, that knowledge, and applying that to the emergence of what's happening in the technology world and bringing those two things together, right? I think it's not so much as, there's a practical application in that sense of, okay, here's the technology options that we have to do these desired, engage with each other. Again, it's the pricing engagement, the cybersecurity or whatever. It's more, how can we accelerate what the business is trying to accomplish and applying this technology that's out there to the business problem? I think in a lot of ways, in the past it's always been, hey, I've got this really neat technology, how can I make it fit somewhere? And now, I think those folks bring a lot of relevance to the technology to say, hey, here's a problem we're trying to solve. Legacy methodologies haven't been effective, haven't been timely, haven't been scalable or whatever. How can we apply what's happening in the market today to these problems? You guys adapt in particular, to me, anyway, a good signal of the maturity model and the maturity of Hadoop, it's starting to grow up pretty rapidly, seeing Hadoop 2.0 and so where are we at? What do you see as the progression and where are we going? So, I mentioned it on theCUBE the last time at Summit and I said I believe that Hadoop is the operating system of big data and I believe that there's a huge transition that's taking place. That was, there was some interesting response to that on Twitter and some of the other channels but I stand behind that, I think that's really what's happening. I look at what people are engaging us to do is really start to transition away from the legacy methodologies and they're looking at these, not just lower cost alternatives but also more flexibility and we talked about, at Summit, the notion of that revenue curve and cost takeouts great on one side of the coin or one side of the fence here but I think equally or even more importantly is the change in the revenue curve and the insights that people are finding because of these unique channels or the omni-channels you describe and being able to look at all these dimensions of data in one unified place is really changing the way that they can go to market, they can engage consumers and that they can provide access to the analysts because ultimately that's the most important. We had Stuart Madnick on who's maybe got written textbooks on operating systems, we probably use them, I know I did. Maybe they were gone by the time you got there but not really young at night. But the point being Hadoop as an operating system, the notion of a platform is really changing dramatically so I think you're right on on that. Okay, so what's next for you guys? We talked about customer traction and proof points, you're working hard on that I know. You guys got great tech, amazing team. What's next for you? So I think it's continuing to look at the market in being flexible with the market around as these use cases develop. So obviously as a startup we're focused in a couple of key areas where we see a lot of early adoption and a lot of pain around the problems that we can solve but I think it's really about continuing to develop those use cases and expand in the market to become more of a holistic provider of analytics solutions on top of Hadoop. How's Cambridge working out for you, right? I mean the company moved up from, the founders moved up from New Haven and chose the East Coast, chose Cambridge. We were obviously really happy about that as East Coast people. You don't live there full time but you might as well. So how's that working out? Talent pool, the vibrancy of the community, the young people that you're able to tap. How's that all going? So it seems there's a bunch of dimensions around that. One, it's hot, it's really, really hot. Inhuman. Yes, but it's been actually fantastic and if you look at not just the talent inside the team but I think around the team. So if you look at our board, right? Jit Saxena, Chris Lynch, right? I've been very successful in the database community over decades of experience and getting folks like that onto the board, Felda Hardyman has been in this space as well for a long time. Having folks like that as advisors and providing guidance to the team, absolutely incredible. HackReduce is a great facility where we do things like hackathons, meetups, get the community together. So I think there's been a lot of positive inertia around. The company just being here in Cambridge but from a development or resource or recruiting point of view, it's also been great because you've got some really exceptional database companies in this area and history will show you like there's been a lot of success here, not only in incubating technology but building real database companies and we're a startup on the block that people are very interested in and I think we show a lot of dynamics that are changing in the market and the way the market's moving. So the ability for us to recruit talent is exceptional. We've got a lot of great people to pick from. We've had a lot of people join from other previously very successful database companies. The team's growing significantly in the engineering space right now. But I just, I can't say enough good things about the community, HackReduce and all the resources that we get access to because we're here in Cambridge. Because the HackReduce is cool. So you guys are obviously leveraging that. You do how-tos, bring people into the, so HackReduce is essentially this, it's not an incubator, it's really more of an idea cloud. It's a resource cloud really. Started by Fred Lalonde and Chris Lynch and essentially people come in, they share ideas. You guys I know have hosted a number of how-tos and it's basically open. We've done some stuff there, it's very cool. Yeah, and I think it's even for us, it's also a great place to recruit, right? We meet a lot of talented people there and with the university participation as well, we get a lot of talent coming in and participating in these activities. And we do things that aren't just Hedap related. I mean we've had people when they teach Hadoop sessions and just sort of evangelize what's happening in the ecosystem around us. And like I said, it's just, it's been a great resource pool to engage with. And I think it's been as beneficial to the community as it has been to us. Very grateful for that. All right Scott, as always awesome seeing you. I knew you were going to have some good practitioner perspectives on data quality. So really appreciate you stopping by. My pleasure, thanks for having me. Great to see you again. Take care. All right, keep it right there everybody. We'll be right back with our next guest. This is Dave Vellante with Jeff Kelly. This is theCUBE, we're live here at the MIT Information Quality Symposium. We'll be right back.