 Live from the Fairmont Hotel in San Jose, California, it's The Cube at Big Data SV 2015. Okay, welcome back everyone. You're watching The Cube. We're live in Silicon Valley at Big Data SV. This is our second event here in conjunction with Stratoconference and Hadoop World and we're covering all the action in Big Data. It's The Cube, our flagship program. We go out to the events and extract the signal from the noise. I'm John Furrier, the founder of Silicon Island, my co-host Jeff Kelly. Big Data SV 2015. Big Data and it's like wikibond.org who broke usually the Big Data survey and actually called it right about practitioners making all the value and certainly that's the theme of this show. Follow the money. The money is in computer science, it's in machine learning, it's in all the greatness of Big Data and certainly customer acceleration is the theme. Our next guest is Shree Ambati, CEO of H2O. Welcome back to The Cube. Good to see you again and welcome to The Cube. I'm excited. Machine learning is fun as a slogan of what you guys do and Big Data is an interesting market these days because depending on who you talk to, you're talking to an analytics guy, a database guy, computer science, math or some other discipline, right? So you guys are doing some compelling work. IBM was on earlier and I asked them a direct question. What is the hottest wave that no one sees that's going to be a real competitive advantage for companies? The answer was machine learning. So do you agree with it obviously? What's your take on the piece of Big Data where's the action at? What's your perspective? So about, I mean, one of the key things the way we look at it is ML is the new SQL, right? SQL defined databases back in the 80s, 90s and everything we know is a query today. Everything we will do in the future with data will be pattern recognition. I have done X, what else can I do? I want a recommendation for X. I want to do this, this, this. What else will fall out of it? It's prediction. It's the future not looking backwards what happened but looking forwards what can happen and what the possibilities, what probabilities are those possibilities based on patterns in your data. So Big Data, while it arrived, you saved all the Big Data. Now you're looking for patterns and sifting for patterns of whether it's fraud detection, whether it's churn, you want to know which customer is more likely to pay. You want to know your better pipeline prediction through just real data. All of this is machine learning and that's kind of where the crux of the matter is. To drive value out of your data, you need machine learning. So let's take a step back. We're super hot on machine learning. We love Big Data. We just really get intoxicated by all the possibilities around what computer science and all these new disciplines are doing. But let's take a step back. Machine learning is a big important component because why? Data is the center of the value proposition. What is about the day that trends now that makes machine learning so popular? Is it the tsunami of data? Is it because the software hasn't usually dealt with data in the past? What's the Big Data? Why is machine learning so hot? So it's a very good point. When Alan Turing came up with the Turing machine in the 40s and 50s cracking the code, he was peering into the initial pieces of machine learning and AI. AI has been with us for a long time, will be with us for a long time. And in the 80s and the 90s, that's when the real innovation of deep learning, neural networks came about and this promise of, we can understand how the brain thinks, how we can apply that to our businesses. And that kind of never really fruitioned because of lack of sufficient hardware, efficient distributed systems, cheap inexpensive hardware, cheap inexpensive memory, and data and network. So if you think about what we are, the Moore's Law brought us into a space where computers become cheaper, as you said, but data collection is cheap. And what happens with statistics in machine learning algorithms is after enough data, they follow the Shannon's Information Theory, after enough data entropy, it actually starts peaking. You can actually derive causality from data, not just correlations. So now you can actually say, this weather is driving my business, my retail business. The weather, the rain conditions is what define Bodo Prizes, not much anything else. So you can now really go down because you have enough data. And so it's a combination of the power of data, largely surprising aspect of large amount of data which Google has demonstrated to us. The second aspect of it is people have built data-driven, have built static internet, they built data-driven internet. They want to build a smarter internet. And now the internet has also been dispersed. All the intelligence is at the edge. All the smarts has gone to the edge. Sensing is happening at the edge. So you have this almost a three-way mix between data, IoT, and intelligence. And powering intelligence is kind of the way we have it. Machine learning. Dave and I always talk about machine learning. Dave Volante and I always talk about the magic juju. It's like the magic underneath. You put machine learning in any presentation. It sounds impressive. I mean, certainly machine learning is a technology. So I have to ask you, can you talk about the difference between unsupervised machine learning and supervised machine learning and the role of the human into this, either the coder and or the data scientist. I think this is where interesting machines can add value. Certainly the internet of things brings this delight in the data. So what does this all mean? Unsupervised versus supervised and the role of a human. So if you think about, I mean really bringing it down to the point where it's very simple, you're taking each day base, every event that's coming to you in life. You're kind of giving it a score. Is it a positive thing? Is it a happy thing? Is it not a happy thing? You're trying to make that. So smiley face versus sunny. Everything. And how did you come up with that? Because you've experienced life in different forms in unique ways. You built a model for the world. And as an event is coming at you, you're scoring it as good, bad, happy, and continuously. This is what humans and living systems have been doing for a long time. That model building is a historical model. So historically we've always, I mean data warehousing, if you think about data warehousing, data mining, we've kind of relinquished intelligence to an offline process all along. What we are seeing now is the advent of speed and technologies like H2O and others which has made it very fast. You can actually do this modeling and learning online. In other words, as the data is coming at you, you're learning these patterns behind it. So now you can actually learn from the new patterns. And when you have to learn from new patterns, you have no labeled results. You don't know what is good. You don't know what is bad yet. And as a result, it's an unsupervised learning. You're learning after data as it's coming out. Liars, anomalies. Is the signal a heart attack or is the signal a blimp for this person because he runs very fast, he cycles a lot. And you're trying to learn on the fly, kind of doing unsupervised learning and building models on the fly. You have a supervised model as your back end, as a big intelligence, the center of intelligence. Then you connect the two models, correlate and see and improve the old model. If this was a happy face, you want to say, yeah, I want to see this director's movie as I go forward in my life. I thought this was going to be a hard-able experience. Well, look, this movie is actually beautiful. So kind of try to really learn and improve the models and then complete the loop very fast. We'll talk about, and you kind of alluded to this, but essentially operationalizing big data versus that offline mode where very analogous to the data warehouse world. And I think a lot of big data practitioners have been looking at big data as just a bigger data warehouse. But in fact, the real value is going to be in operationalizing those insights and learning as more data comes in. It sounds like that's where machine learning is really can play a role. I think it's a combination. If you think about the amount of data stored historically, let's say weather data. I mean, we do need the historical data to predict kind of what might happen as well as changes. And so you combine what you've learned and so taking a kind of data assessment of all the data you have. So most of the times we talk to customers, they don't even know what data sets they have. And once they bring those data to the table, they want to look at what are the data they can bring and combine all those different data blocks and flatten them into normalized fashion, which I mean, the things like that are happening in the news space with no SQL and denormalized flat data. You're already actually getting the flat data sets on which you can on analytics. But that's kind of the things that people are doing in the beginning. Then cleaning data, which actually assumes a lot of the part of the job, 80% of most of our data scientists spending time cleaning data. And then post-cleaning and munging this data, it's now ready to go to the prom kind of data. That's when you're running algorithms. That's the beautiful part, which gives you all the value. So you're now beginning to derive insights, make decisions from a reasonably clean data set. So what Hadoop has done for the most part is got us that break of data in a kind of data lake. Now once you put the data lakes into play, you need, every data lake needs a predictive modeling factory. Some of our customers build 60,000 models twice a quarter to predict their quarter, the demand of their annual and their sales. And if they could do that, not every six weeks, but every two days, every half a day, they can iterate their business weekly, not on twice a quarter. So then they can actually feedback and have just-in-time inventory to reach their quarterly goals, which actually has substantial benefits whether they're profitable or not. So these are real business transformations that are going to happen. But that's really old-school modeling in a factory model. The next phase that we are seeing, which is the onset of embedding machine learning into applications. And this is not something out of the ordinary. If you think about why this is happening, data is not-more data has been created on an hourly basis than it was like a few years ago on a monthly basis. So data velocity is causing data growth to be very fast and where the growth is very high. As a result, what we can do is actually have to assess data on the fly. In other words, data is not coming from ether, it's coming from applications. And there are more applications now for the ordinary person than ever before. So embedding analytics into applications, data, analytics, and applications, that's the DNA of your business. And so when you apply that, you have created a very close loop cycle where you can make quick decisions on data as it's changing. You can impact your customer when he's on your website not after he's gone home trying to get him back. Personalization, it's all about user experience. I mean, eventually you want to deliver machine learning through better user experiences. What you take on Spark and all this in-memory stuff, obviously we're going to hear from people of silicon with analytics in the silicon. I mean, the trends are getting a lot more is law. Absolutely. That's goodness for you guys, right? Absolutely. We've seen that for voice recognition. Hello, Google. I mean, there's a deep learning neural network predicting it, translating it to Hello, Google every time someone says that to their Android phone today. So there's deep learning. There's the other big trend in the space. We'll get to that in a quick second in Spark. The focus we've done is to get sophisticated algorithms, not just simple algorithms. People are always able to run mean at scale on all their data. The problem with mean on all the data is it's not getting better after a million samples, right? The big difference between before and after H2O is with H2O, you're able to get sophisticated algorithms on all of your data. You can run logistic regression on GLM on all of your data, gradient boosting machine on all of your data and the same time it took to do a very sample dataset. So we kill sampling, so that helped with more data. So more accuracy. More accuracy. It's really about speed, accuracy, and adding more data. And the latency piece gets solved by external factors, Moore's law and compute. It's better implementation. So what we've done is actually aligned our data along the cache lines to really make it really fast. So it's a fast implementation. The output of the modeling is a scoring engine. The scoring engine is nothing but a simple, it's just scoring, it's judging. It's a rule engine in the old world. So historically, you build apps based on rules. Pre-existing logic with machine learning, you build apps that can be dynamically changing and it's learning and changing the rules as well. So that rule engine that we produce, it's called a scoring engine in our space, is nanosecond fast. So it's very fast. When it's very fast, you can do hundreds of these models and so you don't have to pick the best model. You automate the heck out of it so it becomes an ensemble model. And so it's no one model defines even one person. So you have to kind of like own up to say I can put hundreds of models of my population and then now you have a much more robust prediction that you can stand behind. But going back to your other question about the in-memory trend, the in-memory trend is huge of course. All our customers are now able to use sparkling water. We came up with a very good way to use Spark and H2O where you can actually go between the killer app on top of H2O. You can go pull data with Spark SQL, do all the pipelines and then with a simple H2O RDD, start running H2O algorithms, run MLlib algorithms and then combine the results of the two and start scoring. So we want to capture the end-to-end user experience. And eventually this is all going to boil down to having better workflows, better way of sharing your results, sharing how you got to the result and kind of improving the overall data science knowledge across the companies. What we see in our customers is massive silos of smart people trapped between installing their own data problems in similar ways. So when we actually work within a big company, we actually manage to connect them all together and then they're suddenly sharing, oh, I use these three things and I got a better result to create my customer journey. I am using gradient boosting machine as opposed to old school generalized linear model with this particular problem and then suddenly exchanging results between each other. So that's what I'm doing. We're getting the hook here, but I want to ask one final question because it brings up the data science kind of debate we've had before. Not everyone can be a data scientist, but not everyone can be a machine learning magician and coder and implementer. It's ours. But it's getting better. I mean, more people say the machine learning crosses Stanford online than probably... H2O has quite a bit, by the way. We walk into customers that say, you lowered the bar in ease of use terms. So this is my point. This is my point. What do people want? How do you help me? So what is it? How do you guys help developers? So we've come up with an application program for artificial intelligence where it's all about applications. The last mile, applications solve problems. Tools help solve, like help one or two person applications and help hundreds of people in one shot. So what we're doing now is a first Friday hackathon where everybody sits on the table like rest of us, build applications, use data signs, and make applications smarter. So you come with an idea. So you come there with lending club data. We build a better lending system. You come there with the Bodo data. You predict which whiskey you want to use. We actually predicted what move Billy Chek will do at the tail end of the game. So things like that. You actually can predict each move. Three corners and gold line stand. Exactly. Can you predict and prescribe? Yes. That's the last mile. But the crux of it is that it's a hard problem because you have innovation. When innovation happens, it happens at the border of two different domains. You know we're patriot's fans so you get to our good side, right? Absolutely. Absolutely. Absolutely. So the crux of it is that math and engineering. So if you think what software is eating everything, software is going to need data science as well. So you've got to have to let developers learn data science and data scientists learn software engineering so they can build better applications. The next phase is data science and business. Business is business people are looking for making money. Data scientists are looking for finding the truth. So you kind of have to blend those two together and come up with beautiful applications. So the rubber meets the road. Tomorrow when you're going to find where you want to eat lunch or who you want to go out to dinner with or whom you're going to meet in your life and whom you're going to build a company with or build a team with and invent with, all of this is going to be defined by machine learning. Machine learning is going to define human experience in a way that... And data is critical in the development process and all this. Absolutely. I mean if you think about Hurricane Katrina I'll say how we can save lives, right? Walmart's machine learning algorithms pre-built, pre-arrived all the data, all like they bought staples long before Katrina arrived. And they were better prepared than FEMA. So algorithms will actually save lives. And the more data in, the more sampling, the more accuracy. That's the innovation to me, right? I mean that's you agree. No data, no sampling. Outliers are no longer things you remove. You actually... And that's what defines the rest of the data in some sense. So the game that's changed, if I hear you correctly, is speed, scale, data and accuracy. How you get the data and how you act on it and report it is critical. Absolutely. That's what's changing. The last mile is that you tell somebody, you did all the analysis. The last mile is presenting it in a way that it evokes response. Data-determination making still needs courage, right? You have to evoke an emotional response to your data science. It's actually data art, not data science, where you can come up with... It is the new clay, right? So you've got to come up with... We could talk for an hour. People have to react to it. We love this topic. We got a break, but we're getting a hook here from the folks here. But great conversation. We love this. The causation entropy message is totally right on. You're looking at a whole new game changing. What we are seeing is nothing short of a revolution. Over the last year, we've had 100x growth in adoption. We're seeing 2,000 corporations have installed their product and we are now beginning to see large customers becoming partners in building a great ecosystem around us. It's changing the software development lifecycle, certainly, and developers need this new asset. Absolutely. This is going to transform so many businesses and so many applications. It saves time, too, and reduces steps it takes to get things done. We talk about so much innovation in the valley, and then you talk to... I mean, you see Apple year after year, quarter after quarter, makes things simple, easy to use, and in the market. Yes, that's a form of success. It saves time, money, and steps to do things. It's a winning formula for innovation. Time is the only non-renewable resource. Well, the exciting thing for Mark's standpoint is the software market's changing. Certainly, open source. We were talking earlier. It's everywhere. It's the new standard. New things like data, machine learning, science behind it. It's really compelling. Congratulations to you on all the great success. This is theCUBE bringing our entropy and data science to the table by extracting the signal and sharing that with you. Thanks for watching. Really appreciate it, and thanks for coming on. Thank you for having me. And good job. Appreciate it. Thanks.