 From Cambridge, Massachusetts, it's theCUBE, covering MIT Chief Data Officer and Information Quality Symposium 2019, brought to you by SiliconANGLE Media. Welcome back to Cambridge, Massachusetts, everybody. You're watching theCUBE, the leader in live tech coverage, and we are covering the MIT CDO conference, MIT CDO IQ. My name is Dave Vellante, and I'm here with my co-host, Paul Gillan. Mike Stonebreaker is here. A legend, founder, CTO of Tamer, as well as many other companies, inventor. Michael, thanks for coming back into theCUBE. Good to see you again. Nice to be here. So this is kind of a repeat pattern for all of us. We kind of gather here in August at the CDO conference. You're always a highlight of the show. You gave a talk this week on the top 10 big data mistakes. You and I are one of the few, two of the few people who still use the term big data. So I happen to like it. You know, I'm sad that it's out of vogue already, but people associate it with a dupe. A dupe is kind of waning, but regardless. So welcome, how'd the talk go? What were you talking about? So I talked to a lot of people who were doing analytics, who were doing operational data at scale. And they always make, or most of them make, a collection of bad mistakes. And so the talk was a litany of the blunders that I've seen people make. And so the audience could relate to the blunders, but most of the enterprises represented make a bunch of the blunders. So I think one blunder is not planning on moving most everything to the cloud. So that's interesting, because a lot of people would love to debate that. And I would imagine you probably could have done this 10 years ago, and a lot of the blunders would be the same, but that's one that wouldn't have been there. But so I tend to agree. I was one of the two hands that went up this morning in Gokul's talk when he asked, is the cloud cheaper? For us it is anyway. But so why should everybody move everything to the cloud? Aren't there laws of physics, laws of economics, laws of the land that suggest maybe you shouldn't? Well, I guess two things and then a comment. First thing is James Hamilton, who's a techie's techie works for Amazon. Yeah, we know James. So he claims that he can stand up a server for 25% of your cost. And I have no reason to disbelieve him. That number has been pretty constant for a few years. So his cost is a quarter of your cost. Sooner or later prices are going to reflect costs as there's a race to the bottom of cloud servers, so. So can I just stop you there for a second, because there's some other data on that. All you have to do is look at AWS's operating margin and you'll see how profitable they are. They have software like economics now with deploying servers. So sorry to interrupt, but so carry on. So anyway, sooner or later they're going to have, they're going to be wildly cheaper than you are. The second vignette is from Dave DeWitt, who's database wizard. And here's the current technology that Microsoft Azure is using as of 18 months ago. It's shipping containers in parking lots, chilled water in, power in, internet in, otherwise sealed. Roof and walls optional. So if you're doing raised flooring in Cambridge versus I'm doing shipping containers in the Columbia River Valley, who's going to be a lot cheaper? And so the economies of scale mean that the big cloud guys are building data centers as fast as they can using the cheapest technology around. You put up a data center every 10 years and you do it on raised flooring in Cambridge. So sooner or later the cloud guys are going to be a lot cheaper. And the only thing that isn't going to, the only thing that will change that equation is for example, you know, my lab is up the street in the Frank Gehry building. And we have an IT department who runs servers, you know, in Cambridge. And they claim they're cheaper than the cloud. And they don't pay rent for square footage and they don't pay for electricity. So yeah, so I think externalities, if there are no externalities, the cloud is assuredly going to be cheaper. And then the other thing is that most everybody that I talked to, including me, has very skewed resource demands. So in the cloud, I need three servers except for the last day of the month. And the last day of the month I need 20 servers. I just do it. If I'm doing on-prem, I've got a provision for peak load. And so again, I'm just way more expensive. So I think sooner or later these combinations of effects is going to send everybody to the cloud for most everything. And my point about the operating margins is the difference between price and cost. I think James Hamilton's right on it. If you look at their actual cost of deploying, it's even lower. Now they price where the market allows them to. They're growing at 40 plus percent a year and they're a 35, 40 billion dollar run rate company. Sooner or later it's going to be a race to the bottom. Yeah, right. And the only guys are going to win who have the best cost structure. Couple other highlights from your talk? Sure, I think the second thing I'd like to stress is that machine learning is going to be a game changer for essentially everybody. And not only is it going to be autonomous vehicles, it's going to be automatic checkout, it's going to be drone delivery of most everything. And so you can either, and it's going to affect essentially everybody. I could sort of say categorically, any job that is easy to understand is going to get automated. And I think that's, it's going to be majorly impactful to most everybody. So if you're an enterprise, you have two choices. You can be a disruptor or you can be a disruptee. And so you can either be a taxi company or you can be Uber. And it's going to be AI machine learning that's going to be determined which side of that equation you're on. And so I was, a big blunder that I see is people not taking ML incredibly seriously. Do you see that in fact? Everyone I talked to seems to be bought in that this is a, you've got to get on the bandwagon. Yeah, I'm just pointing out the obvious. Yeah, yeah. I think, but one that's not quite so obvious is a lot of people I talk to say, I'm on top of data science. I've hired a group of 10 data scientists and they're doing great. And when I talked, one vignette that's kind of fun is, I talked to a data scientist from iRobot, which is the guys that have the vacuum cleaner that runs around your living room. So she said, I spend 90% of my time locating the data I want to analyze, getting my hands on it and cleaning it, leaving me 10% to do the data science job for which I was hired. Of the 10%, I spend 90% fixing the data cleaning errors in my data so that my models work. So she spends 99% of her time on what you'd call data preparation, 1% of her time doing the job for which she was hired. So data science is not about data science. It's about data integration, data cleaning, data discovery. Which is your latest venture. So Tamer does that sort of stuff. And so that's the real data science problem. And a lot of people don't realize that yet. And they will. I want to ask you, because you've been involved in this, by my count in starting up, at least a dozen companies. Nine. Nine? Okay. That's still a lot. You must not overstate it now, Paul. You estimate it high, Paul. How do you decide what challenge to move on? Because they're really not, you're not solving the same problems. You're moving on to new problems. How do you decide what's the next thing that interests you enough to actually start a company? Okay, that's really easy. I'm on the faculty at MIT. My job is to think of new shit and investigate it. And I'm paid to come up with new ideas. Some of which have commercial value, some of which don't. And the ones that have commercial value, I commercialize. And so it's whatever I'm doing at the time. And it's why all the things I've commercialized are different. So going back to Tamer, data integration platforms. A lot of companies out there claim to do data integration right now. What did you see was the deficit in the market that you could address? Okay, great question. So there's the traditional data integration is extract, transform and load systems and so-called master data management systems brought to you by IBM, Informatica, Taland, that class of folks. So a dirty little secret is that that technology does not scale in the following sense. That it's all, well, ETL doesn't scale for a different reason than MDM. ETL doesn't scale because ETL is based on the premise that somebody really smart comes up with a global data model for all the data sources you wanna put together. You then send a human out to interview each business unit to figure out exactly what data they've got and then how to transform it into the global data model and how to load it into your data warehouse. That's very human intensive and it doesn't scale because it's so human intensive. So I've never talked to a data warehouse operator who says I integrate, well, the average I talk to says they integrate less than 10 data sources. Some people do 20. If you twist my arm hard, I'll give you 50. So here's the real world problem, which is Toyota Motor Europe wants, right now they have a distributor in Spain, another distributor in France. They have a country by country distributor, sometimes canton by canton distribution. So if you buy a Toyota in Spain and you move to France, Toyota develops amnesia, the French guys know nothing about you. So they've got 250 separate customer databases with 40 million total records in 50 languages. And they are in the process of integrating that into a single customer database so that they can do the customer service we expect when you cross an EU boundary. I've never seen an ETL system capable of dealing with that kind of scale. So ETL doesn't scale to this level of problem. So how do you solve that problem? I'll tell you. They are a tamer customer, I'll tell you all about it. Let me first tell you why MDM doesn't scale. Oh yeah, okay, great. So ETL says I now have all your data in one place in the same format. But now you've got the following problems. You've got to deduplicate it because if I bought a Toyota in Spain I bought another Toyota in France, I'm in both databases. So if you want to avoid double counting customers you've got to dedupe 30 million records. And so MDM says okay, you write some rules. It's a rule based technology. So you write a rule that's, so for example my favorite example of a rule. I don't know if you guys like to do downhill skiing. I love downhill skiing. So ski areas are in all kinds of public databases. You assemble those all together and now you've got to figure out which ones are the same ski area. And they're called by different names and different addresses and so forth. However, if vertical drop from bottom to the top is the same, chances are they're the same ski area. So that's a rule that says how to put data together into clusters and so I now have a cluster for Mount Sonate. So now I have a problem which is one address says something or other, another address says something else. Which one is right or both right? So now you have a goal, what's called the golden record problem to basically decide which data elements among a variety that may be all associated with the same entity are in fact correct. So again, MDM that's a rule based system. So it's a rule based technology and rule systems don't scale. The best example I can give you for why rule systems don't scale is Tamer has another customer, General Electric. Probably heard of them. And GE wanted to do spend analytics. And so they had 20 million spend transactions for the year before last. And spend transaction is I paid $12 to take a cab from here to the airport and I charged it to cost center X, Y, Z. 20 million of those. So GE has a pre-built classification system for spend. So they have parts underneath parts or computers, underneath computers or memory and so forth. So pre-existing classification for spend, they want to simply classify 20 million spend transactions into this pre-existing hierarchy. So the traditional technology is let's write some rules. So GE wrote 500 rules, which is about the most any single human can get their arms around. So that classified 2 million of the 20 million transactions. You've now got 18 to go. And another 500 rules is not gonna give you 2 million more. It's gonna give you love diminishing returns. So you're gonna have to write a huge number of rules and no one can possibly understand them. So the technology simply doesn't scale. So in the case of GE, they had Tamer help them solve this classification problem. Tamer used their 2 million rule-based tagged records as training data. They used an ML model to then work off the training data and classify the remaining 18 million. So the answer is machine learning. If you don't use machine learning, you're absolutely toast. So the answer to MDM, the answer to MDM doesn't scale. You've got to use ML. The answer to ETL doesn't scale. You're putting together disparate records. Again, the answer is ML. So you've got to replace humans by machine learning. And so that seems, at least in this conference, that seems to be resonating, which is people are understanding that at scale traditional data integration technologies just don't work. Well, and you got a great shout out on yesterday from the former GSK. Right, Mark Ramsey. Jay Leader, Mark Ramsey. Exactly. Showed you guys how they sort of solved their problem. He basically laid it out. EDW didn't work, MDM didn't work. All right, I mean, he kicked the can, the top-down data modeling didn't work, kicked the can to governance. That's not going to solve the problem. And, but Tamer did. Yeah, along with some other tooling, obviously. Yeah, of course, of course. The, well, the other thing is no one technology, there's no silver bullet here. It's going to be a bunch of technologies working together. Right. Mark Ramsey is a great example. He uses stream sets and a bunch of other, a bunch of other startup technology operating together. And the traditional guys, are we okay? Yeah, we're good. I don't want to ask you a question. I want to be sure we have time. So, I mean, the traditional vendors by and large are 10 years behind the times. And if you want cutting edge stuff, you've got to go to startups. Jump in. I want to jump in. It's a different topic, you in the past were a critic of the NoSQL movement. And NoSQL isn't going away. It seems to be a, it seems to be actually gaining steam right now. What are the flaws in NoSQL? And has your opinion changed at all? No. So, NoSQL originally meant NoSQL, don't use it. Then the marketing message changed to not only SQL. So, SQL's fine, but NoSQL does other stuff. And now it's all SQL, right? And my point of view is now NoSQL means not yet SQL. Because, you know, high level languages, high level data languages are good. Mongo is inventing one, Cassandra is inventing one. Those, unless you squint, look like SQL. And so I think the answer is the NoSQL guys are drifting towards SQL. Meanwhile, JSON is, that's a great idea if you've got your regular data. The SQL guys are saying, sure, let's have JSON as the data type. And I think the only place where there's a fair amount of argument is schema later versus schema first. And I pretty much think schema later is a bad idea because schema later really means you're creating a data swamp. Exactly, yeah. And so if you have salary, so you're storing employees and salaries. So Paul's salary is recorded as dollars per month. Dave's salary is in euros per week with a lunch allowance, mines. So if you don't, if you don't deal with irregularities up front on data that you care about, you're going to create a mess. Yeah, no schema on right was convenient. It allowed you to store a lot of data cheaply, but then what? You know, hard to get value out of it, created data swamps. So I think I'm not opposed to schema later as long as you realize that you are kicking the can down the road and you're just going to give your successor a big mess. Yeah, right. Michael, we got to jump, but thank you so much. Sure, that's fine. Coming back to theCUBE, appreciate it. All right, keep it right there, everybody. We'll be back with our next guest right after this short break. You're watching theCUBE from MIT, CDOIQ. Right back.