 Live, from Cambridge, Massachusetts, extracting the signal from the noise, it's the cue, covering the MIT Chief Data Officer and Information Quality Symposium. Now your hosts, Dave Vellante and Paul Gillan. Hi everybody, welcome back to MIT, Cambridge, Massachusetts. We're here at the MIT Information Quality Conference, the CDO conference. Paul Gillan, my co-host and I are very pleased to have Michael Stonebreaker. I'm here at MIT, entrepreneur, co-founder, CTO now, a company called Tamer. Mike, welcome back, good to see you again. Thank you, thank you Paul. So, really amazing times we live in, seeing your career, just won the Turing Award, congratulations. Thank you. Unbelievable accomplishments that you've had. I'm still pinching myself. And a lot of you know, a lot of the CUBE alum, their students that we've had in Colson's been in the CUBE a number of times, and Diane Greene and Joe Hellerstein. So it's great to have the master back. So what's up with you these days? I mean a lot, obviously, just the high level. Well, I think in the last couple of months I've been distracted by something called the Turing Award. And so I got to write a Turing Award paper and give, there's a Turing Award talk. So there's been a whole bunch of time preparing that. What was the topic of that paper? Is this something I would understand? Oh, sure, it's up online if you go to ACM.org. The title is called The Land Sharks Are On The Squawk Box. Yeah, I probably get that. And so it basically talks about building postgres, which took a decade and was a huge amount of slogging through the swamp, interspersed with a bicycle ride my wife and I did across the United States in 1988, which was a lot of slogging through the swamp. And so it just, mostly about building system software is really hard. And commercializing is even harder. And so anyway, go listen, you can see the talking heads version of the talk online. But sort of that aside, I was spending a lot of time on Tamer, which I think is a fabulous idea. You're solving a data integration problem. Well, I mean, data integration is the 800-pound gorilla in the corner. And everybody's got the problem in spades. And so the whole idea is how to do it cost effectively. Because I think a lot of the people in this conference think that you can, all we have to do is have some standards. And we'll get everybody to adhere to the standards. And life will be great, and there won't be any data integration problem. And everybody will have global IDs for everything. And to me, that's a nirvana, which even if you could construct it inside an enterprise, will only last until you buy your next company. Right, and then it falls apart. I guess Esperanto was a version of everyone would speak the same language, so it happened with that. Our previous guest actually, Nick Marco, brain surgeon and CDO, was talking about the challenges they have of integrating unstructured data like Doctor's Handwritten Notes with radio tomography and image data and genetic sequencing. And these three vastly different data types. Is that a problem that you can see a way to solve? Well, I think that if you look at medical data, we're in fact using a public domain medical data set called MIMIC. Which is the data from 26,000 intensive care unit patients from Beth Israel Deaconess Hospital here in Boston. And they have real-time data from bedside monitoring devices, which is 125 hertz signal processing stuff. They've got all the standard metadata, patient record data. They've got doctors' notes, nurses' notes, and they've got prescription data. And they're in the process of trying to get imagery data. And they absolutely want to put all this stuff together. And it's a huge data integration problem. And the minute you get this to work, then you want to do the same thing for Mass General Hospital and put the two data sets together. So I think the upside in producing composite medical records is just unbelievable societal good. So the gleam in my eye is that I go into the doctor and I complain of chest pain. So he takes a chest x-ray. What he wants to do is run a query saying, find everybody on the planet whose chest x-rays look like mine. And what was their diagnosis and what was their mortality? And that's going to require us to put all this medical data together. It'll be a huge challenge. Personally, I think the biggest challenge by far is the privacy considerations that are going to keep that from happening. You don't think the big problem is technical then? The problem you just outlined sounds staggering. There's a technical challenge, but I think Mass General Hospital is doing exactly this for data they control. And this is across a whole bunch of practices, a whole bunch of hospitals. I mean, they're a pretty sizable thing these days. And yeah, it's a lot of work. But the biggest problem is that they're not going to get to share data with Boston Medical Center because they're in different political fiefdoms. And that's a political issue. And so I think the thing that's going to keep a national record system from happening is all political. Technical problems are expensive, but they're not insurmountable relative to the social good that would come out of such a system. I think you can agree, yeah. You're at the CDO conference. A lot of CDOs. The last couple of years, a lot of the themes in this event have been data quality, data governance, data cleansing. There seems to be a desire to move toward innovation. So I wonder if I can get your take on the state of data governance and where this all is going from an innovation standpoint. I think to me the biggest innovation is if you look at a snapshot of right now, every enterprise has thousands of silos. And yes, they can chip away at data governance to try and do better next time. But they've got a huge after-the-fact data integration problem. Let me just give you one quick example. So one of Tamer's customers is a Fortune 50 company that is a manufacturing conglomerate. And they are running 325 ERP systems, basically one per division in the company. So if you want to buy paper clips and you're in division one, they have an ERP system that has the contract with Staples. You go to the next division, they have a separate ERP system, 325 of them. So the gleam in the eye of the CFO is to say, well, I can create a golden record with what ERP ought to look like. But that isn't my problem. My problem is I got 325 of them. And he says my immediate problem is that he ran a query to say, you know, presumably they have a contract with Staples 325 times because they buy paper clips. So he said, are the terms and conditions for the same supplier across all these 325 systems the same? The answer is obviously not. And so he then said, suppose I had a system so that when the contract with Staples came up for renewal in system 17, I could let the purchasing guy in the 17th division know what all the other terms and conditions were that had been negotiated. He could then clearly demand most favored nation status by just getting the race for the bottom, which is find out who negotiated the best terms and conditions and have everybody else demand those. So that would save this particular conglomerate $100 million a year. And the $100 million is all in the long tail. It's not in buying paper clips from Staples, which is the high volume stuff. It's all in the 100,000 other suppliers. And so he has a data integration problem that can save him $100 million a year if he can solve it. And he's using Tamer to do that, which is my view is that between now and when hell freezes over all this data governance stuff to actually get operationalized and work, there's a huge amount of value to be had from doing after-the-fact data integration. What's more, data governance actually only works until you buy your next company. Because these 325 divisions, a bunch of them, are companies this conglomerate purchased. Well, they have their own data, their own golden records, which you then have to expose facto, integrate with what you've got. So I think I'm much more tactically focused on knocking down the cost of doing data after-the-fact data integration so that you can capture value in the short run. So data integration compliance, governance, most line of business people say gets in the way. And the explosion of big data, Hadoop, whatever catalyzed that has made this golden copy vision just even more remote. But you hear a lot of people say, well, we're spinning up these big data projects and they have no governance attached to them. There's no data integration consistency. How does Tamer solve that problem? Okay, so you said several different things. The first thing is I have this big data project. Well, big data is a marketing buzzword that means all kinds of different things. So to me it means I've either got too much data, I've got a volume problem, or it means the data is coming at me too fast and my systems can't keep up, I have a velocity problem, or it's coming at me from too many places and I have a variety problem which is I can't put this stuff together. So Tamer is squarely focused on the third V, which is the variety problem, which is if you want to scale in the variety V, then you can't do it using traditional techniques, which is the standard way of doing extract, transform, and load is simply not going to work. It won't scale to any significant number of data sources. So Tamer's whole vision is to scale the V, the variety piece of the big data issue. So we're squarely focused on that. How do we solve that? Let me give you a couple simple examples. So back to this explosion of ERP systems that we talked about a minute ago. So there's 325 golden records, types of golden records, because in each of these 325 systems, staples does not have a unique ID. It's not identified in any common way. And there's huge value in putting these staples records together. So the way Tamer works is you give us source number one. So we suck it in and we then try and deduplicate source number one, saying do you have duplicate records in this data source? So we try and find them using statistical techniques. When you set up a Tamer system, you say, you have to tell us how accurate do you want us to be to allow us to make automatic decisions? Because if we find two things that we think are duplicates, we can just coalesce them. Or we can ask a human, are these really duplicates? And so you can set a threshold for how accurate you want us to be to do it automatically. Because the whole idea is that if you can't make some decisions automatically, it's going to be costly. So this is sort of accuracy versus cost. If we think something might be a duplicate, but we can't figure it out, we ask a human. And the human you ask is a domain expert. It's not a programmer because if you say is, just to use an easy example, an ICU-50 is a genomics term. An ICU-50 is another genomics term. Are those the same thing or are those different things? You and I have no clue. You need a domain expert. So fundamentally, TAMR has a crowdsourcing system that organizes domain experts to answer questions. And so we ask questions and we start off being fairly stupid. But as we get answers to questions, we build up a knowledge base about what stuff happens. You then integrate the I plus first data source. We try and integrate it with the I that you've already given us. When we can't figure stuff out, we ask a human for help. But over time, we get smarter and smarter and make more and more decisions automatically. So the answer is we use machine learning and statistics to automatically make decisions which are made by a human in traditional techniques. When we have to ask a human, we ask a domain expert. We don't ask an ETL programmer. So we organize human labor differently. And we make decisions automatically using machine learning techniques. And it works like a charm. I want to switch up a little bit because, of course, most people would know you as a pioneer in the development of relational databases. Today, so many different profusion of different options for how we process data depending on the output or the outcome that we want. What is the role of relational going forward? Okay, so I think I wrote a paper in 2005 called One Size Does Not Fit All. And I think that's so true today, which is in every vertical market I can think of, there is a technology which will beat the traditional relational vendors by between one and two orders of magnitude. So, for instance, in the data warehouse market, column stores like Vertica are two orders of magnitude faster than row stores like Oracle. And so it's still a relational database system, but it's not the traditional implementation at all. So in every vertical market I can think of, the legacy relational database systems are no good at it. That there's something way better. Every vertical market you can think of. Every vertical market I can think of. So I think the historical legacy relational database systems are not good at anything anymore. There's generally a lot of revenue for Oracle. There's a great book by Clayton Christensen called The Innovator's Dilemma. They are up against the innovator's dilemma in spades because they're selling legacy technology which has been superseded by all kinds of other good ideas. And so they've got to figure out how to morph to the new stuff without losing their customer base. We live in interesting times, but I think the answer is they're going to be somewhere between half a dozen, a dozen data technologies that are going to survive all different. And none of them are going to be the architecture of the existing legacy engines. So you don't think relational database is really optimal for any use anymore? That's not what I said at all. I said the implementation from the Oracle's IBMs and Microsoft's of the world. So those are 40-year-old code lines that aren't good at anything. So the theory is still... Relational database systems I think are going to be... If you go market by market, in the data warehouse world it's a relational model. All the vendors are relational. If you look at transaction processing, what looks like it's going to happen is that that's going to be a main memory SQL transaction processing world. So I think SQL will be the answer in OLTP, although it will not... In neither of these two cases is it going to be anything that looks like the legacy implementations. In the no SQL world you've got 150 different vendors, all of them with different data models, different implementations, different ideas as to what to do. 150 are not going to survive. There's going to be some much smaller number. So there'll be a huge shakeout. What that market will end up looking like remains to be seen. If you look at the Hadoop market, if you look at what Cloudera is doing, they just released a system called Impala. To a first approximation it looks like Vertica. It's a relation. It's a SQL database system. It's a column store. So I see the Hadoop market and the data warehouse market coalescing. It ends with Spark the transaction market as well or not necessarily? 80% of the Spark market is Spark SQL. So Spark is a SQL market and I think right now Spark has no persistence story. In the same way that Cloudera started off marketing MacReduce and is now marketing SQL, I think the Spark guys will morph into something that looks very much like a conventional database system. Where that ends up we'll see. But I think one of the most interesting things to me is that we've all heard that business analysts, business analysis is going to give way to data science. So suppose you're Walmart. So Walmart has a transaction level record of everything that went under any wand anywhere in the Walmart system for some number of years. All that's in Bentonville, Arkansas. So business analysts, you can run any query you want against this warehouse. So suppose you're the guy who's charged with provisioning Walmart's in Massachusetts around snowstorms. So we had three of them this year. I don't know if you're local. I have a roof though. Massive storms. So there are three big snowstorms. So you can have your business analyst run a query that says what's sold in the week before the snowstorm, what's sold in the week after the snowstorm. Compare that with same-store sales in Maryland. What happened? And you get a big table of numbers. And so that's what business analysts currently do, and the products allow you to draw pretty pictures from big tables of numbers. So data scientists wouldn't look at things that way at all. He said if I'm charged with provisioning around snowstorms, I'm going to build a model of what snowstorms look like. And so you can run my model and do anything you want to with my model, and it will produce output that you're interested in. So would you rather have a model or a big table of numbers? The answer is everybody on the planet would take the model. And the only thing that keeps us from getting there is that there aren't anywhere near enough trained data scientists to do the kind of predictive modeling that we're talking about. And Sloan has a big program that they're starting up to train data scientists. But I think over the next decade, we're going to slowly morph from people doing business intelligence to people doing data science. And data science is regression, predictive modeling, machine learning, K nearest neighbors, Bayesian classification. It's a whole bunch of stuff. And these are techniques that data scientists are good at knowing what to do with. However, those are all based on arrays. They're not based on tables. This is not a table world. It's an array world. So if you want to run any of this stuff, it doesn't naturally fit into relational database systems. So the tools that data scientists use right now are mostly things like R and SAS, SPSS, MATLAB. These are array-based tools. So will there be an array-based DBMS for the data science guys? Well, it could be. Maybe not. But I think that's an area. So there's a lot of, you know, it'll be interesting to watch how things unfold. But I think the changes that are occurring and what people want to do is going to drive continued innovation in data systems. And it's going to drive, you know, I don't see one size fitting all ever being true again. And so I think there are going to be a proliferation of data architectures. Maybe there'll be six, maybe there'll be 12, but there isn't going to be one. Well, I see your point now about the business analysts. I'm hoping for a citizen data science fall. Yes. So we don't need the data science group. And I know you've got to go. We're getting jumping jacks, but maybe we have time for one more question. I'm old enough to remember when Ingress, you created the Ingress technology. I'm old enough to remember when Ingress and Oracle were equally sized in the market. We saw what happened. And by most experts' knowledge, Ingress was a superior technology. What did you learn from that experience? What can others learn from that experience? I think life is really very, very, very simple. So Oracle started first. Ingress had much better technology in everyone's opinion. And in 1983, Ingress was growing faster than Oracle, and they were almost neck and neck. And then what happened was in 1984, IBM announced DB2. And in a single day, SQL became the Intergalactic Standard. And Oracle happened to be a SQL system. Everyone agrees quell was a better query language, but it was dead in a day. And Ingress got SQL quicker than anybody else did, but it took them about a year. In that intervening year, Oracle leaped ahead. And so they were three times the size of Ingress at the end of 1984, when they were sort of neck and neck at the end of 1983. And that was the end of the game, because then Larry Ellison very successfully adopted what's called the America Cup strategy. Which is if you're the boat ahead, you do exactly what the boat behind you is doing, and you stay ahead. So whatever Ingress did, Oracle copied it, and brilliantly could confuse future tense with present tense. It acted like they invented it. And so I think Oracle was doing it to this day. But fundamentally, they were three times bigger than Ingress, and the game was over. So I think the answer is it's IBM. I mean, it was all SQL. Another dominant position, IBM handed off to the competition. Exactly. Mike, unfortunately, you have to go. We can go on for a long, long time. Appreciate you coming on the Cube. It's great to see you again. Thank you very much. Keep right there, everybody, the Cube. We'll be back with our next guest, Paul Gillin and myself, Dave Vellante. Right back after this.