 From Cambridge, Massachusetts, it's theCUBE, covering MIT Chief Data Officer and Information Quality Symposium 2019, brought to you by SiliconANGLE Media. Welcome back to Cambridge, Massachusetts everybody. This is theCUBE, the leader in live tech coverage. We go out to the events and we extract the signal from the noise and we're here at the MIT CDO IQ, the Chief Data Officer Conference. I'm Dave Vellante with my co-host, Paul Gillan. Day two of our wall-to-wall coverage. Aaron Kalb is here. He's the co-founder and Chief Data Officer of Elation. Aaron, thanks for making the time to come on. Thanks so much, Dave and Paul, for having me. You're welcome. So words matter, you know, and we've been talking about data and big data and the 3Vs and data is the new oil and all this stuff. You gave a talk this week about, you know, we're maybe not talking the right language when it comes to data. What did you mean by all that? Absolutely. So I get a little bit frustrated by some of these cliches we hear at conference after conference. The one I sort of took aim at in this talk is data is the new oil. I think what people want to invoke with that is to say, in the same way that oil powered the industrial age, data's powering the information age, just saying data's really cool and trendy and important, that's true, but there are a lot of other associations and connotations people have with data and some of them don't really apply as, sorry, with oil and some of them apply as well to data. So is data more valuable than oil? Well, I think they're each valuable in different ways, but I think there's a couple of issues with the metaphor. One is that data is scarce and dwindling and part of the value comes from the fact that it's so rare, whereas the experience with data is that it's so plentiful and abundant, we're almost drowning in it. And so what I contend is instead of talking about data as compared to oil, we should talk about data compared to water. And the idea is, you know, water is very plentiful on the planet, but sometimes, you know, if you have salt water or contaminated water, you can't drink it. Water is good for different purposes depending on its form, and so it's all about getting the right data for the right purpose. Well, we certainly, at least in my opinion, fought wars, Paul, over oil. And over water. And certainly conflicts over water. Do you think we'll be fighting wars over data, or are we already? You know, we might be. One of my favorite talks from the sessions here was a keynote by the CDO for the Department of Defense who was talking about, you know, the civic duty about transparency, but was observing that actually more IP addresses from China and Russia are looking at our public data sets than from within the country. So, you know, it's definitely a resource that can be very powerful. So what was the reaction to your premise from the audience? What kind of questions did you get? You know, people actually responded very favorably, including some folks from the oil and gas industry, which I was pleased to find. We have a lot of customers and energy, so that was cool. But what was nice being here at MIT and just really geeking out about language and linguistics and data with a bunch of CDOs and other people who are kind of data intellectuals. All right, so if data is not the new oil. And water isn't really a good analogy either because the supply of water is finite. So that's true. What is data? Yeah. Space? Yeah, it's a good point. Maybe it is like the universe and it's always expanding, right? Somehow, right, because anything, any physics which is on the planet probably won't be growing at that exponential speed. So give us the punchline. Well, so I would say that water while imperfect is actually a really good metaphor that helps for a lot of things. It has properties like the fact that if it's a data quality issue, it flows downstream like pollution in a river. It's the fact that it can come in different forms useful for different purposes. You might have gray water, right, which is good enough for irrigation or industrial purposes but not safe to drink. And so you rely on metadata to get the data that's in the right form. And you know, the talk is more fun with a lot of visual examples that make this clear. I actually want to present to the audience say that he uses similar analogy in his own company so it was fun to trade nuts. So Chief Data Officer is a relatively new title for you, is it not, in terms of your role in elation? Yeah, that's right. And the most fun thing about my job is being able to interact with all the other CDOs and CDAOs at a conference like this. And it was cool to see, I believe this conference double since the last year, is that right? No, it's up about 100 people though. Well, it was about double from three years ago. And when we first started in 2013, 130 people. Yeah, it was like a very small and intimate event, so. Yeah, here we are growing this building. Yeah, they're kicking us out. I think what's interesting is, if we do a little bit of analysis, this is small data within our own company, our biggest and most visionary customers typically bought elation. The buyer champion either was a CDO or they weren't a CDO when they bought the software and it's since been promoted to be a CDO. And so seeing this trend of more and more CDOs cropping up is really exciting for us. And also just hearing all the people at the conference saying, two trends we're hearing, a move from sort of infrastructure and technology to driving business value, and a move from defense and governance to sort of playing offense and doing revenue generation with data. So those trends are really exciting. So don't hate me for asking this question. Because what a lot of companies will do is they'll give somebody a CDO title and it's kind of a, it's a little bit of a gimmick, right? To go to market and they'll drag you into sales because I'm sure they do as a co-founder. But as well, I know CDOs at tech companies that are actually trying to apply new techniques, how to figure out how data contributes to their business, how they can cut costs, raise revenue. Do you have an internal role as well? Oh, absolutely. Yeah. Explain that. So elation, you know, we're about 250 people. So we're not at the same scale as many of the attendees here. But we want to learn from the best and always apply everything that we learn internally as well. So obviously analytics, data science is a huge role in our internal operations. And so what kinds of initiatives are you driving internally? Is it sort of cost initiatives, efficiency, innovation? Yeah, I think it's all of the above, right? Every single division and both in the sort of operational efficiency and cost cutting side as well as figuring out the next big bet to make can be informed by data. You know, our goal is to empower a curious and rational world and have every decision be based not on the highest paid person's opinion but on the best evidence possible. And so, you know, the goal of my function is largely to enable that, both centrally and within each business unit. I want to talk to you about data catalogs a bit because it's just, it's a topic close to my heart. I've talked to a lot of data catalog companies over the last couple of years. And it seems like, for one thing, the market's very crowded right now. It seems to me, would you agree? There are a lot of options out there. Yeah, you know, it's been interesting because when we started, we were basically the first company to make this technology and to kind of use this term data catalog in this way. And it's been validating to see, you know, a lot of big players and other startups even kind of coming to that terminology. But yeah, it has gotten more crowded. And I think our customers who are prospects who used to ask us, you know, what is it that you do? Explain this catalog metaphor to me. You're now saying, oh yeah, catalogs. What about that? Which one should I pick? Why you? What distinguishes one product from another? What are the major differentiation points? Yeah, I think one thing that's interesting is, you know, my talk was about how the metaphors we use shape the way we think. And I think there's a sense in which kind of the history of each company shapes their philosophy and their approach. So we've always been a data catalog company. That's our one product. Some of the other catalog vendors come from an ETL background. So they're a lot more focused on technical metadata and infrastructure. Some of the catalog products grew out of governance. And so it's sort of governance first, defense first and then offense secondary. So I think that's one of the things I think we encourage our prospects to look at. It's kind of the soul of the company and how that affects their decisions. The other thing is, of course, technology and what we at Elation are really excited about and it's been validating to Hero Gartner and others and a lot of the people here, like the GSK keynote speaker yesterday, talking about the importance of comprehensiveness and on taking a behavioral approach. We have our behavior IO technology that really says, let's not look at all the bits and the bytes but how are people using the data to drive results as a great differentiator? Your customers generally standardize on one data catalog or might they have multiple catalogs for multiple purposes? Yeah, we heard a term more last season of catalog of catalogs. You know, people who can get arbitrarily meta, meta, meta data, we'd like to go there. I think the customers we see most successful tend to have one catalog that serves this function of the single source of reference. Many of our customers will say that their catalog serves as sort of their internal Google for data or the one stop shop where you can find everything. Even though they may have many, many different sources typically you don't want to have siloed catalogs. It makes it harder to find what you're looking for. Let's play a little word association with some metaphors. Data lake. Data lake's another one that I sort of hate. If you think about it, people had data warehouses and didn't love them but at least when you put something into a warehouse you can get it out, right? If you throw something into a lake, you know, there's really no hope if you're going to find it, if you're not going to be in great shape. And we're not surprised to find that many folks who are helping the data lakes are not having to invest in a layer over it to make it comprehensible and searchable and so forth. Yeah, the lake is where we hide the stolen cars. Data swamp. Yeah, I mean, I think if your point is it's worse than lake, it works. But I think we can do better than lake, right? How about data ocean? You know, out of respect for John Furrier, I'll say it's fantastic. But to us, we think it isn't really about the size. The more data you have, people think, oh, the more data, the better. It's actually the more data the worse, unless you have a mechanism for finding the little bit of data that is relevant and useful for your task and put it to use. I want to set up, enter the catalog. So how, technically, how does the catalog solve that problem? Totally. So if we think about, maybe let's go to the warehouse for example, but it works just as well on a data lake in practice. You know, the catalog is, it starts with the inventory, you know, what's on every single shelf. But if you think about what Amazon has done, they have the inventory warehouse in the back, but what you see as a consumer is a simple search interface where you type in the word of the product you're looking for and then you see ranked suggestions for different items, you know, toasters, lamps, whatever books I want to buy. Same thing for data. I can type in, you know, if I'm at the DOD, you know, information about aircrafts or information about, you know, drug discovery if I'm at GSK. And I should be able to therefore see all the different datasets that I have. And that's true in almost any catalog where you can do some search over the curated datasets there. With relation in particular, what I can see is who's using it, how are they using it, what are they joining it with, what results do they find in that process. And that can really accelerate the pace of discovery. Good, I'm sorry to, to what degree can you automate some of that detail, like who's using it and what it's being used for? I mean, doesn't that rely on people curating the catalog or to what degree can you automate that? Yeah, so it's a great question. I think sometimes there's a sense with AI or ML that it's like the computer is making the decisions or making things up, which is obviously very scary. Usually the training data comes from humans. So our goal is to learn from humans in two ways. There's learning from humans where humans explicitly teach you, like somebody goes and says this is gold standard data versus this is, you know, low quality data. And they do that manually. But there's also learning implicitly from people. So in the same way on amazon.com, if I buy one item and then buy another, I'm doing that for my own purposes. But Amazon can do collaborative filtering over all of these trends and say you might want to buy this item. We can do a similar thing where we parse the query logs, parse the usage logs and BI tools and can basically watch what people are doing for their own purposes, not to extra work on top of their job to help us. We can learn from that and make everybody more effective. Aaron, is data classification a part of all this? Again, when we started in the industry, data classification was a manual exercise. It's always been a challenge. Certainly people have applied math to it. You've seen support vector machines and probabilistic latent cement tech indexing being used to classify data. Is, have we solved that problem as an industry? Can you automate the classification of data on creation or use at this point in time? Well, one thing that came up in a few talks about AI and ML here is regardless of the algorithm using whether it's, you know, a major SVM or something really modern and exciting with people learning. Stuff that's been around forever or like you say, some new stuff. You know, actually I think it was said best by Michael Collins at the DOD that data is more important than the algorithm because even the best algorithm is useless without really good training data plus the algorithms kind of everyone's got them. So really often training data is the limiting reactant and getting really good classification. One thing we try to do at elation is create an upward spiral where maybe some data is curated manually and then we can use that as a seed to make some suggestions about how to label other data and then it's easier to just do a confirm or a deny of a guess than to actually manually label everything. So then you get more training data faster and it kind of accelerates that way instead of being a big burden. So that's really the advancement in the last five to five, six years where you're able to use machine intelligence to sort of solve that problem as opposed to brute forcing it with some algorithm. Is that fair? Yeah, I think that's right. And I think what gets me very excited is when you can have these interactive loops where the human helps the computer which helps the human you get again this upward spiral instead of saying, oh, we have to have all this, you know, manual step done before we can even do the first step or trying to have an algorithm brute force it without any human intervention. It's kind of like no scheme on, right? It's except that actually works. I'm just kidding, Tom, I do friends. All right, Aaron, hey, thanks very much for coming on theCUBE. Last, give you a last word on the event. I think it was, is this your first one? Or no. This is our first time here. Okay, so what are your thoughts? I think we'll be back. It's just so exciting to get people who are thinking really big about data but are also practitioners who are solving real business problems and just the exchange of ideas and best practices is really inspiring for me. That's great. Well, thank you for the support of the event and thanks for coming on theCUBE. It's great to see you again. Thanks, Dave. You're welcome. Thanks, Aaron. All right, keep it right there. We'll be back with our next guest. Right after this short break, you're watching theCUBE from MIT CDO IQ. Right back.