 Well, if I'd known I was going to get such a nice turnout, I would have put together a better presentation. You know how you can tell a extroverted actuary? He looks at your shoes when he's talking to you. I just want to mention a couple of things up front. We're going to get to the topic about the three baskets on about slide 25. No, I'll weave it in. But the point is there's a lot of background material I want to get through before we get to that point about what the rationale is behind that. But let's just talk about Hadoop for a minute. Hadoop is not going to go away. Hadoop is very raw. Hadoop is not big data. It's part of big data. But there's a lot of other things involved in big data besides Hadoop. And I'm going to mention some of those. And hopefully get through. I have a lot of slides with a lot of information. Now, you all have the slides, right? They're on the CD. You won't have the same exact slides. I think I only added one slide. And I changed the order of a couple. But you'll have all of these slides, basically. I think the most important thing is that big data has no value at all without analytics. But analytics have no value at all without informed decision-making. And this is the biggest problem we have in this industry. I call it the Jordan River phenomenon. We get you to the Jordan River, but we can't get you across. And that is we do the data. We do the analytics. Maybe we even do it all right. But we haven't figured out what happens from the point someone is informed to decisions being made. And even worse, tracking those decisions in the same vein in which they were made. It's a difficult problem. So I would say for the next 15 years or so until I fall over, that's something I'm going to spend a lot of time on. Because I think that's really crucial. Another thing is, I don't mind if you interrupt me with a question. And I'm not going to ask any kind of survey questions, because I know you never raise your hands anyway. But I would interrupt you in a session. So feel free to interrupt me. Very quickly, I want to get through this notion of scarcity. Because this is part of the problem and part of the promise of big data. Everything we've done in my very long career has been managing from scarcity. We've never had enough CPU. We've never had enough storage. We've never had enough anything. So we always came up with solutions that manage from scarcity. And with big data, for the first time, we can relax that constraint. We can say, I don't care how many threads I have to use. I'll put together a cluster of 1,000 different machines to solve this problem, if that's what it takes. So scarcity, when it falls out of the equation, all of your approaches are different. And that, I think, is the essential difference in big data. And remember, it's not just Hadoop. There's a lot of other things involved in big data. So this charter, this graph, or whatever you want to call it, it's representational, please. Don't say, oh, that happened in 1962, 1963. But the point is that in my working life, we've had this division between operational enterprise systems and analytical systems. And they were separated for some reasons that were realistic and some reasons that were artificial. The artificial ones were, oh, you can't do that on my machine. It'll bring the machine to its knees. Oh, there was an element of truth to that. But what we're seeing now is analytics. And I'm going to have to define that for you, and I will. Analytics coming together with the operational processes as a whole. And that's going to unleash all sorts of opportunities for us, good and bad. Because don't forget, the bad guys can use this stuff too. So your job here, unless you're a bad guy, is to make sure that you use this for good. Now, in the old days, a classic data warehouse looked something like this. A lot of piece parts, a lot of latency, a lot of stuff moving around, and most importantly, refined, aggregated data. We never really got down to the knits and bits and pieces. Sure, we built data warehouses that pulled pure transactions out of relational systems. But we're talking about other stuff now. We're talking about machine-generated data. We're talking about web logs and so forth. This is much lower detail and much sparser detail. So the data warehouse was, let's say, a well-defined approach. It didn't always work. Some of the reasons it didn't work is because of what I said. We came up with the data. We came up with the analytical tools, but we never figured out how to use them. In the world of big data, though, this big diagram looks something like this. It's not gone. I don't think the data warehouse is going to disappear. We talk about data virtualization, about pulling data in real time out of operational systems and out of streams and message queues and so forth. But if you've got terabytes and terabytes of historical data that you can report from, why in the world would you kill the data warehouse? What you would do is you would surround it. You would build these other things around it that complement it. Well, I suppose in this diagram, I could say the data warehouse compliments the rest of it. Whoops, sorry. I switched to a Mac, so I no longer know how to use Windows. How the hell do I back up? Okay, there we go. Now, with all the hype around big data, I think that Tom Friedman, I know those of you who don't like liberals don't want to read this, but I think Tom Friedman really gets this, right? Because he talks about you can innovate and manufacture more products and services, and we're gonna talk about that for a minute because there's a lot more than products and services that make people's lives more healthy, educated, entertained, productive, and comfortable. That's the promise. Now, why in the world would a technology like big data allow these things to happen? That's one of the things we're gonna talk about. But how big is big data? Well, that's a big fish. All right, I admit I photoshopped it, but the point is that anything larger than your current methodologies is big data, as far as I'm concerned, whether it involves unstructured data or multi-structured data or semi-structured data, whether it's real time or not real time, if it stresses your existing methodology, and when I say methodology, I don't just mean your platform, but I mean your whole way of designing and building and maintaining these systems, then it's big data. That means your existing way of doing things. Oh, let's face it, you can have a relational database, data warehouse, and so forth, and ETL tools and VI tools, and be doing what you do and say, ooh, now this is stressing my capacity, but there's other products out there that could take you 10 times or 100 times farther using the same kind of technology. So the point is it's big data if you really can't accommodate it anymore. But the most important thing about big data is what I call, I call it strange data, and in this case, diverse and different sources. The data isn't like the data you've normally seen. Most of the data that we've seen in organizations is tabular, in fact, most of it now is relational. So it's relatively easy to handle on a physical level. Obviously the semantics are a big problem, and they're probably gonna remain a big problem. But this data is different, it's ugly. Now one thing I will tell you is that when somebody tells you that they're loading video or images into their Hadoop cluster and doing analysis on them, it's a big fish story. Nobody is doing that except in research labs. We're probably five years away from that. So big data in Hadoop, for example, is, like I said, it's machine-generated data, it's sensor data, it's large streams of data, but mostly it's data that's not yours, okay? Now here's a given. When it comes to analytics, all data that's used in analytics is used data. You're looking at data that was not designed for the purpose you're using it for. It was designed for the system it comes from. And within that system it may be complete and consistent and coherent, although most of the time it's not, but one attribute about used data that's universal is used data doesn't like to live with other used data, and bringing different pieces of used data together is a mess. I like to say it's like making sausage or violin strings. You don't really want to know that much about it. So this is the big problem, is that in order to do analytics, we barely scratched the surface in figuring out how to do this, and now we have all these crazy data sources we've never seen before, and we're trying to fit them in. And that's why Hadoop is so popular, but have any of you actually had your hands on Hadoop, written a map reduced routine, one, two, three, so I said I wasn't gonna poll anybody. It's a mess. It's like a hackers tool. I built my own cluster at home, and I started playing around with the map reduced, wrote some routines, and then of course I immediately went to Mahoot because I wanted to see how it did machine learning. What a mess. I haven't written that much code since Lisp. No wonder they want data scientists. Nobody else could do this sort of thing. So, but one thing I will say about Hadoop, it's here to stay. It's not going away. It's not the only answer, but it's here to stay. And because of the way it's done, because it's open source and it's supported by the biggest meanest data companies in the world, it's gonna improve very rapidly. So it's not gonna be in the same state it is today, but I have to tell you today it's a mess. I guess this last point, maybe it's a little bit out of place here with how big is big, but I honestly believe that given the difficulty of moving these things around and the immediacy and the short tenure of some of these data sources in your analysis, that what you're gonna find is a lot of companies are gonna bypass Hadoop and they're gonna go to third-party data aggregators who are doing that data. Why in a world would 500 Fortune 500 companies map the Facebook graph? It doesn't make any sense. Wouldn't you rather pay a penny a sip to get it on Amazon? So you're gonna see a huge opportunity for people to become data aggregators. And in a few minutes, I'm gonna talk to you about a whole other opportunity I call data markets. Really interesting thing that can come out of this technology. Now, big. Look at these two charts. You can see that the growth curve on both of them is about the same. But what you see is, oh my God, look at the growth in data warehouses from 2001 to 2012 from zero terabytes to 100 terabytes. That's big growth. But look at the perspective from 2000 to 2005. So now put yourself back seven years and say, I can't believe how fast data is growing in 2005. So this is not new. This has been happening all along. You could argue about the shape of the curve and the slope of the curve and so forth, but exponentially growing amounts of data to be analyzed has been around for, I don't know, 20 years. And it's not gonna stop. So what? I mean, who cares? So it gets bigger, we use more computers, faster computers, we use SSD drives instead of magnetic drives and everything will be hunky-dory. And to a certain extent, it's true. But the other part about big data is that it creates all sorts of new opportunities we didn't have before. Because we can look at things we didn't see before. I mean, it's one thing to look at transactions out of your general ledger and your order management system. It's another thing to look at telemetry data coming out of machines you manufacture, capturing them over time to look for issues in preventative maintenance or fraud or who knows what. A couple of days ago, I was talking to Lexus Nexus and they have a product called HPCC, which I call the Better Hadoop. They've had it for 10 years. They've used it internally for their own data products, speaking of data marketers and data markets. And one thing they said to me just really startled me. They said, well, one of the things we have is a 300 million database of disambiguated identities. Okay, that's 300 million people and stuff. That's interesting. Yeah, but what's interesting is we've created a graph, a relationship graph that has five billion nodes in it. So what good is that? I want you to just think about that for a second. Suppose you're trying to figure out who's committing fraud by, say, flipping houses. And the way this works, this is a great scheme. Somebody buys a house at the asking price, gets a mortgage, sells at the next day to another person in the network for more money, sells at the next day to somebody in more money and so forth, and then at some point somebody defaults on the mortgage and takes all the money. Now, if you started looking through records of different transactions in people, it may take you the rest of your life to figure that out. But in a graph, you can see that almost immediately. You can say, holy cow, these people are related, right? Just like that. And there are a million applications for this. We're not just talking about fraud, and we're talking about all sorts of applications. So I believe that there are wonderful new opportunities in adopting these techniques, but there's also a lot of danger and everybody knows the danger of losing our privacy, which of course we don't have anymore anyway. But it also creates all sorts of opportunities for mischief. I just read something today. Some guy at Google was, what did he do? It had something to do with identities. Well, whatever it was, it was downright criminal. And that's a big problem. And I don't know how we're gonna police that because at least in the United States, there's not a lot of policing of this. In Europe, it's less of a problem, but in the United States, it's a big problem. So I'm gonna caution you and admonish you to things. I'm gonna caution you to keep on the lookout for things that happen in this kind of technology that are crooked. I'm also going to admonish you to think seriously about how you can apply this technology to something other than selling shoes to fashionable ladies. Because that's about all we hear about. Oh, we're using big data to sell more shoes. To group our customers, to classify, get closer to our customers and sell them more junk. Well, here's what I'd like to see this technology used for. Medicine, healthcare. And when I say healthcare, I don't mean helping well point with Watson, pay fewer claims. I mean healthcare, helping people have better health. Getting to the bottom of things. This was a wonderful book. I'm gonna pitch this book actually because I think it's so wonderful. It's from my friend of mine. His name is Eric Topol, MD. He's a cardiologist and a geneticist. And he's just written a book called The Creative Destruction of Medicine. And here's, let me tell you how I met him. He was standing up giving a keynote speech at a conference. And he pulled out his iPhone and slapped a little gizmo on it and pulled something out of his pocket about the size of a credit card. And he said, no, I'm supposed to squeeze gel on this, but I forgot to bring it. So I just have some hand lotion from the hotel room. He rubs it on. He opens up his shirt and he slaps it on his chest. And up on the screen is his EKG. He says, wait, I'm not finished. He brings out another gizmo. This is a little thicker. Slaps some gel on it, well, lotion. Puts it on his chest. And there's a 3D sonogram of his heart beating, right? And he said, there are 10 cardiologists around the world right now looking at this. And if they wanna come back and look at it tomorrow or the next day, they can do that too. He said, you can, at this point, digitize the entire human being. You can take all of these devices and save that data and not only look at a person as an individual. And we're talking about sequencing their genome, which you can do now for 500 bucks. And not only treat people as an individual, but to take all of that data together and now begin to study things at the real level of detail. Because folks, if you look at clinical trials that the FDA sponsors in this country, what you're gonna see is there's no medicine involved, there's no science involved, just statistics. They put some people together and they measure things and they say, ooh, I saw one recently where they had 2,000 people. This was for a drug, of course. And 1,000 people got the drug and 1,000 got the placebo. And 1,000 people, 23 died of the people on the placebo. And 12 people died on the drug. And it was trumpeted as a 50% increase in life expectancy. I said, how are 12 people out of 2,000 a 50% increase? This was in the New York Times. This was reported in the New England Journal of Medicine. Yeah, this is where we are. And the point is, part of it, of course, is the profit mode of above people's health. But the other part of it is, we can't study all the variables because we didn't have the capacity to do it. But now we do. We can look at 1,000 variable problem and we can sort through it and figure out what affects what. Where the collinearity is, where the causal effects are. Or, God forbid, maybe they'll even go into the laboratory and do some experiments. I don't know. It's possible. There's this issue with the three Bs. Every technology, of course, has its buzzwords. Volume, velocity and value. Mostly, everybody talks about the volume. Big date is about how many terabytes or petabytes. eBay has a 39 petabyte cluster and that's really interesting. But if you know anything about Hadoop, you know that 39 petabytes isn't all it looks like. Because when you bring data, does anybody answer this question? How many copies of your data does the Hadoop HDFS make when you load it? Three. It triples the size of your data. And it does that because it has no real fault tolerance. So if a node fails, it uses a copy. And they figure, well, three copies is pretty good. Probability of having it fail three times in a job. And if it does, then they just restart the job. Right? I'm trying to tell you, this is not enterprise class software. But it's open source and it's supported by some huge companies. And it will get better. But right now, it's kind of silly. Velocity is a very interesting thing to say about big data. But when you look at Hadoop, it's a batch system. You don't get an answer in 10 seconds out of Hadoop. It may take 10 hours, right? You have to load. Also, you have to prepare the data. You can't even load. You can load anything you want in Hadoop, right? You can put pictures. You can put unstructured data, anything you want. But you can't process it till you clean it up. And the cleaning up process takes most of the effort. Which makes me wonder why we need so many data scientists if 80% of the work isn't data science, okay? But that's, anyway, that's another story. But it is a force, right? I believe today, the most useful application of Hadoop in the enterprise is as some form of ETL. And I use the term generically. Whether ETL or ELT or some sort of extraction or some sort of transformation. It's good for preparing the data for something else. But it is definitely not the place to be doing your analytics. It's definitely not the place. Now, I will say this. There is a real-time API in Hadoop. And if you use a product like Cassandra or something like that, you can get data flowing through Hadoop and out to you for real-time decision-making. But I can't imagine why you would do that when you have mature CEP products that already do that. And they do it better. No one has a complaint here. No one wants to argue about any of this. Okay, okay. I guess it's after lunch or I'll have a sleep. Main reason for using Hadoop? Pretty obvious. Mind data for proof. I mean, this is a survey. So, like all surveys, it depends on how you ask the question. But I think it is interesting that the really complex kind of mathematical stuff is all down there at the bottom. They're still doing the basic kind of business intelligence and analytics and trying to get their hands around the data. Next question is, is Hadoop a competitive advantage or is it a Red Queen experience? Is there anyone here who doesn't understand what I mean by Red Queen? Well, that's always a tough question because nobody wants to say, no, I don't understand. All right. You don't understand Red Queen? Okay, Red Queen is a term from evolutionary biology. And what it means is the Red Queen and Alice through the looking glass. Remember when Alice was running with the Red Queen and the landscape was moving with them and they couldn't get ahead? So in business, a Red Queen experience simply means that you're not doing this to get a competitive advantage. You're doing this not to fall behind. And you know, that's noble. That's not, it's not embarrassing to say you're doing something to not fall behind. Of course, nobody gets quoted in Forbes or something for that. But I believe that Big Data also has Red Queen aspects to it. Wow, that should have been Red, not Blue, but, well, anyway, she was the Red Queen. And I think that's important to consider too for your own businesses, if you say, well, you know, we don't see a competitive advantage here. We don't see how we're gonna get ahead of our competitors. But you know, it's like the old joke about the two guys are being chased by the bear and the guy stops and to tie his shoes and the guy says, you know, why are you stopping? We have to outrun the bear. He said, no, I just have to outrun you. I love this sign. You know, I took a picture of this when I drove by it. And I don't even know the context of it. It was just so funny. But unfortunately, it's true sometimes. Anyway, I've coined a phrase. I call them dupe marks. And if you know what the derogatory phrase spread mark is, it means data marks that were made out of spreadsheets. Well, Hadoop has the same potential to create these things that I call them dupe marks, where you have a couple of programmers who run Hadoop and they build these silos of data and they become the backstop. They become the go-to guys. And if you want access to that data, you have to go see them. And I've visited quite a few firms where this is already happening. So it's something you have to watch out for. I already told you what I thought about the value in these things. Decision making is not magic. It's not something that happens. It's like, we're going to do all this work. We're going to have our analytical tools. We're going to do data mining. We're going to do predictive modeling. And we're going to do standard BI and so forth. And then a miracle is going to happen. And we're going to make good decisions. It just isn't like that. Let me digress for a second. There's this thing called the abalone paradox. Has anybody ever heard of this? This is a great, this is actually a true story. So there's four people and they're sitting in this house about 60 miles from Abilene, Texas, in the middle of a dust storm. They have no air conditioning. It's blazingly hot. And they're sitting around. Two guys are sitting on the porch playing dominoes and somebody's in the kitchen and somebody's out back, you know, I don't know what. And one person says, let's get in the car and drive to Abilene and have lunch at the Walgreens counter. And everybody says, okay, let's go do that. And they hop into the car, which of course has no air conditioning. And they drive through this dust storm and they have lunch and then they come back and they're all filthy dirty from the dust and the sweat and everything else. And one person says, that wasn't a very good lunch. And the other person says, well, I didn't even want to go. Why didn't either? And the other two people in the other room say, well, I thought that was your idea. I didn't want to go. This is the abalene paradox that people in groups in terms of, and there are hundreds of variations of this that people in groups don't make logical, rational decisions just because they have a decision to make. They do all kinds of goofy things. And we in our business need to understand some of this. We don't. We had this hopeful, naive idea that if we gave them, has anybody ever heard that idiotic phrase, get the right data to the right person at the right time so they can make better decisions? Not gonna happen. Just not gonna happen. All right. Well, I think that people working alone do good work. But when people get in groups without certain kinds of controls, there's the potential for chaos. And if you've worked in a medium or a big size organization, I'm sure you've seen it. Yeah, yeah. Well, you know what? I hate talking about IT versus the business. And I'll tell you why. Because if you went to anybody in an organization who wasn't in IT and asked them what they did, they wouldn't say, I'm the business. IT invented that word, right? It's us and them. And that's the reason you have these problems because there's an us and them mentality. Now, I agree in a lot of organizations, IT is kind of a specialized skill and group compared to the rest of the company because God knows engineering and sales and marketing and finance, they don't matter, right? But IT people are just like everybody else, only more so. And data doesn't speak for itself. I can't say this more and more. I think we're lulled into the sense of security that we have the data, we have the knowledge, not true, absolutely not true. And you can't figure something out by looking at the data. What else do you need? Can anybody volunteer an answer? Contextive meaning helps, what else? A model, a theory. What's this business all about? What's this organization all about? What are our problems? What's the first thing? What needs our attention, right? So it's not just the data. And that's what I mean by it's not a magic eight ball. Okay, so I'm gonna gather up a couple of petabytes of data, I'm gonna throw it into a cluster and now I'm gonna know the answer to everything. Not gonna happen, okay? You need a lot more discipline than that. Oh, didn't I just say data doesn't speak for itself? Son of a gun. I think that, well, I have to get through a lot of slides. I think you understand where I'm coming from with this. So what are the lessons from business intelligence that are relevant to big data? And I think this will now be the third time that I've said this and maybe you think that's because I wanna emphasize it. But it's true. Big data has no value without analytics. Big data doesn't speak for itself, but analytics have to have informed decision-making or the loop is incomplete. Pervasive BI never happened. It was a bad idea to begin with. Provisioning data for people to use along with good technique, that's important. I already said that. Overpromising, we've, oh, wow, no, I've overpromised for years. But that's because it was a business. And the biggest problem is we need to disambiguate and love that word, the term analytics. What does analytics really mean? Well, I've given this a lot of thought and I may be wrong, but I have a theory. And if I'm lucky, that'll be the next slide. Yes, okay. So I look at it and look, George Box. Anybody know George Box's? Invented the Box Jenkins method. You know Box Jenkins, right? Seasonality, statistical model, Box Jenkins. George Box said, all models are wrong, some are useful. A quote, by the way, that's misattributed to Edward DeDeming. He didn't say it, George Box did. But so this is my model and it's not meant to be, it's not meant to be prescriptive. It's meant to add some clarity to what we mean by the word analytics. So as far as I'm concerned, at the very top of the heap, and it's not really a top of the heap in terms of importance or prestige or anything else. It just, it really descends in terms of maybe complexity. But you have the true data scientists and these are the people who create theory. They're in academics or they're in think tanks or they're in, you know, intelligence agencies and that sort of thing. And they don't usually work in the enterprise. But they create this stuff, right? Like Judea Pearl at USC who created the Bayesian belief networks. He would be an example of this. Also unfortunately the father of Daniel Pearl. Did you know that? But below that are what the industry is calling data scientists. And I don't think they're really scientists, frankly. But the name has stuck, so there's not really anything I can do about it. They may have an advanced degree in math and statistics. They don't necessarily have to have a PhD. This is, by the way, a real breaking point between me and Tom Davenport. We argue about this all the time. He thinks these people have to have PhDs. They have to be mathematical geniuses. Not necessary because they have the background and they have the solid business and domain background. But what they're basically doing is applying the models that are developed by the true data scientists, okay? By the guys like Judea Pearl. So I can go out, I'm speaking hypothetically, I can go out and develop a Bayesian belief network and that would make me a data scientist. But it's Judea Pearl who's the real data scientist. But forget that because nobody's calling it. They're using type two as the term for data scientists now. Now below that are what I call operational analysts. And these are the people, like in the marketing department and so forth, or in the portfolio department at one of the investment banks that ruined the world's financial system, who took these beautiful models that were built by the quants and applied crazy parameters to them and got us into this mess that we're in. But those are the operational analysts. That's what I call them. And they're not really writing code. They don't really know what the moment generating function is of the Poisson distribution and they don't care. These models are developed. I've got a predictive model that these guys are using and maybe it changes once an hour, maybe it changes once a year. But we're using this model, we're supplying the parameters. And we're maybe even doing some AB testing or champion challenge or testing after the fact and to try to test the validity of our assumptions and the parameters and so forth. And then down below at the type four, the people we know having done analytics all these years using business intelligence and even spreadsheets and whatever. So to me, this is a good way to look at analytics. It's a pretty broad spectrum of skills and roles. And it's much more diverse, I think, than we would say, oh, that person is analytics. And you wanna fight with me over this? No? Huh, you guys are easy. Okay. So here's the data scientist. I was kind, I probably shouldn't even say this because I'm being taped, but what I normally say it was a term invented by a bunch of young punks at Yahoo. Who think they discovered analytics and poetry and love and everything. Now the current definition of a data scientist, we've already talked about that. The people who were, let me separate this for a minute. There's like two kinds of organizations in the world which reminds me of a great line by Mark Twain. He said there are only two kinds of people in the world. The kind who think there are only two kinds of people in the world and the kind who know better. I had to give a very short introduction at a conference a couple of days ago and I started it by saying, I'm gonna quote from Mark Twain. I was gonna write you a short note, but I didn't have time, so I wrote you a long one. The data scientist, the constant, the current concept of a data scientist is someone who has the skills to program to manage the data, the rather deep understanding of the mathematics and the quantitative methods. Where are you gonna find those? Well, I'll tell you one place you can find them. Actuaries. If you don't know what actuaries go through to become an actuar, I'm gonna tell you. You start, a lot of the people I worked with when I was an actuar, had bachelor's degrees in math or sometimes in physics or even biology or something like that, but they had a quantitative capability and they'd manage to get through few semesters of calculus and linear algebra. And then you start taking the exams. And the first two exams are really easy because they're about calculus and linear algebra and statistics and probability. And then when you get to the third one, they get kinda messy. And the failure rate is tremendous. I know when I was going through the fellowship exams, like, there were 10 of them, now there's many more, they've broken them up into small pieces. But the sixth exam for the Casualty Actuarial Society was a three-day exam, multiple choice in the morning, essay in the afternoon. You study like hell for six months for the exam. The pass rate is 15%. And you're talking about people who've already passed five exams, right? The idea is to keep the numbers small. So you're just, I mean, you'd probably do better if you weren't so tense about the damn thing. But when you get through those fellowship exams, they're no longer about mathematics. They're about the applied mathematics to risk, underwriting, pricing, claims, everything about the insurance business. So that when you become a fellow of the Casualty Society or a fellow of the Life Society, you are actually certified, just like a CPA, to sign the statutory annual report of the company. You know how to program, quantitative methods, you know everything about the business? Is that not a data scientist? Now I'm not trying to brag about having lived through that. What I'm saying is, is that you're not gonna get data scientists out of college. You're gonna get these people with the same kind of capabilities, and you, as organizations, have got to set up an internal training program and a mentoring program to bring those people along. Because if you go hire PhDs, sorry for those of you who have PhDs, you're gonna get people who have specialized in one small thing, but they don't have that breadth of information either. So this whole problem of data scientists has got to be solved by the companies that are gonna be employing them. That's my firm belief. And again, if somebody wants to argue with me about that, see me afterwards, no, I'm kidding, I'm kidding. Now there's another way to become a data scientist. I didn't make this up. You can't make this stuff up, you know? I better watch the time. What time do we finish? I'm sorry? 220, ooh, ooh, okay, no more jokes. Okay, hey, the three baskets. All right, all of that I set up until now leads up to this. And the point is that you have to have the right thing for the right purpose. Relational databases aren't dead. Relational databases have tremendous capabilities. They have years and years of development behind them, and not just data structures, but in security and load management and survivability and all these other things, right? But if you're doing this, you probably want to be in a high-end relational database. And whether that's like a teradata or it's like an appliance from teradata or you get it from Vertica or Netiza or any of those, I'm not gonna pump one over the other. Frankly, I think they're all fantastic. And some are row-based and some are column-based and some are both, okay? Doesn't matter, you know, pick the one you want. But if you're gonna be doing this kind of work, this is where you need to be because these capabilities don't exist in Hadoop. I mean, they do, but they exist in the forge and things you can download and try to understand the code. And they probably don't even work. Well, probably not the way you want them to anyway. Of course, Hadoop is particularly useful for this stuff, okay? Archival, see, nobody talks about that. If you've got a few petabytes of stuff to dump, that's a good place to put it. It's probably the cheapest place to put it because if you look at that guy on the right, he's already using SSD and high-priced memory and all sorts of really good stuff. Whereas in Hadoop, you're using the cheapest junk you can put together and just lash it together and try to do the work, right? And remember, Hadoop is not a massively parallel processing system. It's a distributed processing system. Each node is completely unaware of what the other nodes are doing. And there's just one control node that understands the whole job. And if that guy fails, the whole job fails, okay? It's really not a robust system. It's a big system. It's a fast, crunchy system. Now, but what if the kind of work you want to do is really like map-reduce kind of work? Well, then there are all those things in the middle that are hybrid systems that have just come out and are continuing to come out. I think the first one, and I may be wrong about this, but the first one was probably Aster data systems, which is now part of Teradata. And what Aster does is it loads all that data into its own file structures and then lets you run SQL against it to generate map-reduce jobs. So you don't have to fuss with Hadoop or map-reduce. Why would you use that instead of Hadoop? Not necessarily. You'd use it if you want to do interactive or ad hoc kind of analysis of that data instead of running batch jobs that you write in Java and submit across the cluster, okay? Other systems that are doing that, I guess Green Plum, EMC Green Plum, Vertica. I'll probably get shot for not mentioning a couple of the others. Oh, Daniel Abadi. You guys know Daniel Abadi, Yale? He's started a company called HADAPT, H-A-D-A-P-T. It's a true hybrid database that uses both relational and Hadoop underneath the covers. Pretty cool. There's other things out there. There's products like Rainstore, which replaced the Hadoop distributed file system, which is a horrible mess, and replaces it with a much better structure and just snap in, snap out and give you much better performance and so forth. I don't know why everybody isn't using it, frankly. So let's talk about the market itself. Maturity level, I probably already mentioned all these things. The clusters, by the way, in most companies I've surveyed are running at about 20% of capacity. So they've got all this hardware, but they're not really using it. My guess is because they're trying to figure out how to clean up the data. There's a skill shortage. It is very weak at real time in ad hoc. It's great for sifting through data, but in the real time in ad hoc area, not gonna be very helpful to you, at least not right. But there's a version two coming out later this year and it may change everything. I don't know. I really don't. How's the segment? It's services, okay. I think the real money in big data is gonna be services because the hardware is cheap, the software is open source, even the distributions are very inexpensive. I mean the services, but it's the services that are gonna drive this. So if you're looking for investment opportunities, if you're looking for a job or something like that, I think services is where this is really gonna happen. Open source. Other pieces of software that wrap around Hadoop are also feel the drag down because of the lower inexpensive cost of Hadoop. So you can't get Hadoop for a $20,000 license for distribution and then spend $300,000 for a little utility that works with it. It's not gonna happen. And I think there's lots of other opportunities besides Hadoop. And I'm rushing because I have some other slides I wanna show you. The industry is relational not dead yet. We talked about that. You forgive me for a minute. I believe a lot of the investments in big data, not just Hadoop, are gonna fail. And I think the reason is the same reason that data warehousing and BI wasn't very successful either and I know that isn't fair to say, but it never met its real objectives. And I think the reason for that is we're still not at the point of understanding what it takes to build systems to help people with decision-making. We only know how to inform them at best. I don't know why this sentence is in there, but it doesn't make any sense to do analytics in the cloud, it's too slow. Oh yeah, I'm sorry, sensor data. You know, sensor data is, I have one client that makes extremely expensive medical equipment and this stuff sends out 200 pieces of information every 10 seconds back to headquarters and there are thousands of these machines out there. So you can imagine over time how much data this is. The engineers look at it on a current basis using, I think using MySQL and just taking some of it to look and see for preventative maintenance kind of stuff and then they throw it away. Then they said, why don't we do this in the cloud? We don't have the hardware to just, by the time you gather the data, move it to the cloud, analyze it there, move it back to another platform, you know, your nuts. So I don't see the cloud as being a good opportunity for real time kind of analytics. Ba, ba, ba, ba, mention that. Okay, let's talk about healthcare. This is what I really wanted to get to and that is, what are the industries that can really use this stuff for advantage? Definitely better outcomes. Genomics and personalized medicine is really the future. I've already pumped, did I pump that book? I did say the creative destruction of medicine. Now, reduce patient in hospital bedstays. This is one of those things where I scratch my head over and I say, well, I guess that's a good idea if people are spending too long in the hospital, but I don't know if that's the case, right? Maybe they're just kicking them out early. Better informed patients would be a nice thing, yes, sir? You're actually re-hospitalization. Yeah, re-hospitalization. When they talk about outcomes and how they're gonna compensate doctors and hospitals, one of the things they measure is readmissions to the hospital. It's like, talk about a silly endpoint surrogate to evaluate, right? Well, what if somebody needs to be re-admitted to the hospital and it doesn't have anything to do with the care they were getting, right? So they never look at enough variables. They just, oh, we're gonna measure that one and you're gonna be 3.5, so you don't get any money. Anyway, another thing that's very interesting is clinical trial selection. Now, when you look at a clinical trial, they have to figure out who they're gonna put in the clinical trial to avoid as much bias as possible. But oftentimes, those decisions, and this is usually done by, what are they called, CPOs or CBOs? David, do you know? What's that? CRE. CRE? CROs. Yes, CROs, that's it, CROs. These are outsourced companies that actually run the clinical trials and they have a lot of techniques for admitting people to the trial. But it turns out, I mean, you look at the Women's Health Initiative, they studied hormones. Most of the women were 65 years or older and a lot of them were already pretty ill. So it's like a useless study. But it only cost us $700 million, so it wasn't so bad. The future of retail. I mean, you can read the slide as well as I can. I'm not, everybody in the world is talking about using this for retail, so you don't need to hear me say anything about it. But manufacturing, I think, is a really interesting thing too. And even in any kind of analytics and in BI, so seldom talked about manufacturing. You know, we're always talking about all these other things. But there's all sorts of things you can do with manufacturing if you can actually quite, let me give you an example. And this isn't really manufacturing, but it's kind of the same. When a commercial airliner flies from New York to Los Angeles, it generates four terabytes of telemetry data during that flight. Now it's analyzed in real time in the plane's computers and on the ground. But if the plane lands without incident, it's flushed. Because if you're running 35,000 flights a day times four terabytes, times 365 days a year, I don't know if any real quick arithmetic genius is here, but that's a lot of data. And what are you gonna do with that? But the point is, now you can. You can save it on these incredibly cheap drives and these cheap machines that you lash together. And then you can start sifting through that data and say, you know, we think there may be a relationship between this particular engine failure and some other things, or do some clustering and say, you know, what are the variables here? Do some, you know, sorry, it's getting late in the day. Oh, God, Jamie, come on. Reducing the number of variables. Component? Yeah, but it's like three letters. What? What did he say? No, no, no, no, no. It's something critical component and a critical component analysis. Oh, God, I forgot the phrase. Variables. Yeah, he had to simplify the calculation. Right, what the hell is it anyway? First time I swore in the whole 45 minutes. All right, this is another thing you should know. I know it's funny, but it's real. You know, because a lot of us, and a lot of the ways we do things, are based on this model. But this is where we are now, okay? So you have to think about how people are using data and making decisions and where they are with this model, because this is what's happening now. And if, you know, we're dame and we're probably still back at the other one, but. And here, I looked for a picture of someone who was horrified. And I heard someone at an IVM conference say, it's the onset of a new age of enlightenment. I almost lost my lunch. You know, I approach these things with my usual, you know, Anoui and world-wearingness. But even this, just, anyway, I firmly believe that no one can predict individual behavior at a point in time. You can predict propensity. You can predict behavior of a group to a certain extent. But you cannot determine what one single person is gonna do with the next single incident. Impossible. One of the nice things about being human. I'm worried about dupe marts. I think that a lot of people are worried about this because they're terrified of math, which, if you notice my tie, I'm not. Actually, I wear this to keep math vibes away from me. No, that isn't true. Here's another problem to think about. Data scientists, if they're really data scientists, they have to spend their lives reporting to people who have no idea what they're talking about. They think how miserable that is, right? You know? And a dupe has lots of shortcomings. So here's the promise, right? It goes beyond cool new algorithms and selling shoes to fashionable ladies. That's my term, thank you. Medicine, science, environment, poverty, and disease. Hey, how about this? Let's use big data to minimize the number of civilian casualties in a war. That's a big data problem. Why don't we do that, you know? Instead of Facebook, trying to figure out what we're up to next. Now, this is all real cool, but. But. But. But. But. But. Now, unfortunately, my time is up and I have a whole bunch of other slides. I knew I wasn't gonna get to them, but I also knew you'd have them. And they're texts that you can read. But, you know, even though we're late, I wouldn't mind taking a couple of questions. So, anyone? Oh, great. Come on. Jamie. Well, yeah, Jamie's question was, isn't there something below type four? You know, who just doesn't even use a tool, just goes, you know, whatever. Yes, but that wouldn't be any fun to talk about. Right, yeah, that's it. Yeah, yes, sir. Stop that. Okay, you guys are free to go. Thank you, you've been a great audience.