 Hi everybody, we're back. This is Dave Vellante live from Santa Clara, California. We're at Strada, the Strada Conference, the O'Reilly Media big event, big data event, making data work. This is day two for us. It's day two for the conference. Yesterday was a lot of deep dives, a lot of practitioners. Today really kicked off the morning sessions, the big tent, the keynotes. We've had a number of those morning speakers on here. I'm here with my colleague, Jeff Kelly, who's a big data analyst. This is SiliconANGLE's continuous coverage, The Cube, where we bring you all the smartest people, the knowledge, the technology people, the practitioners, the bloggers, the opinion makers, and we're here with Mike Hoskins, who's the CTO of pervasive software. Mike, welcome to The Cube. Thank you very much, Dave. Pervasive, making Hadoop faster and easier. Two things that we need. You guys are all about. Is that right? Yeah, in a nutshell, actually, you've captured it very well. We just announced our first product from the fledgling pervasive big data startup division that we have inside pervasive software. It's called Rush Analyzer, and we're showing it inside in the booth, and it's exactly about making big data processing, in this case, predictive analytics and data mining faster and easier. You nailed it. Yeah, so why is Hadoop slow and hard? Well, it's not slow and hard. Would you characterize it as fast and easy? No, but I think it's safer. I'll just go into extremes. To be fair, I think immature is maybe... My words, folks, not Mike's. Slow and hard. We love Hadoop. A good designation. So I'm CTO of pervasive. We're a public company. We do about 50 million year in sales. Been around a long time. Headquarters in Austin, Texas. We started doing something that we call the pervasive innovation dividend. We invest some of our money in startups that were organically growing inside pervasive. We've got one-around cloud computing called the pervasive data cloud. And then I just took over a new division that we created from scratch inside pervasive. Only 15 people called pervasive big data, and it's kind of a merger of our data rush product, which is our next generation massively parallel, scalable, fine-grained, thread-level processing engine, and Hadoop. And for several years now, we've been investing heavily in data rush and that parallel processing technology. But for three years I've been running something called the Innovation Lab inside pervasive where I get to experiment with any technology I want. And I picked Hadoop early three years ago as something to keep my eyes on. And the more I got into it, the more I was really, really impressed with it. And so I think it is fair to say it's immature. I think we're in a highly disruptive period, actually a fascinating period for those of us who are into data and data management as we're kind of bridging from the old to something new. We don't know what it's going to be yet, but Hadoop's going to be a big part of it. A lot of what I'm saying is tongue-in-cheek, right? We were just not that long ago in the what-is-a-doop mode, and now we're really beginning to solve some pretty serious business problems. Obviously the internet giants are using it and we're seeing at conferences like this, enterprises everywhere we had Nokia on before earlier today. We've had them on many other customers. So it's real. So talk about your business model with DataRush. It's actually quite interesting here, you talk about pervasive. It was a big theme in our community about how the big whales just don't innovate anymore. They buy companies to innovate. You guys at well under 100 million are still very innovative and yet you feel it's necessary to have these little skunkworks going on to really stay ahead of the curve. So that's quite an interesting model. What's your model with DataRush and big data? Is it parts of an open source? Do you have a community addition? How do you go to market with this stuff? So some of this is being formulated as we speak on the fly. We're again only five months old and thank you for recognizing the attempt at innovation. We're a 30-year-old company and it's hard sometimes. So we did do this skunkworks model. We literally put people away and said you got to use new technologies and new IDs and new tools. DataRush kind of sprung out of that. It's a platform. We saw the multi-core revolution very early, earlier than almost everybody. DataRush fully exploits all the cores in your servers. It does very fine-grained thread-level programming. Using a data flow technology it can process huge volumes of data at extreme low latencies of just CPU and memory speeds. We don't do coarse-grained parallelism. We're not doing inter-processed communication. We're not spilling the data onto the disk and off the disk. There's a lot of inherent advantages in DataRush. So we think that's our secret sauce. We've got it. We've used it for several years to build data quality tools. For example, we ship some data profiling and data matching tools. But the big idea here is to blend DataRush into the Hadoop infrastructure and try and bring some value there. So first a little bit about the technology and then I'll come back to your business model. Yeah, let's stay on the technology for a bit because I get some follow-up. We'll come back to the business model. This is really interesting. It's your sweet spot. Our audience loves this stuff. I think we have the secret sauce in DataRush. It gives us some really game-changing price-performance advantages. I mean, you bought the hardware, you ought to use it. Server utilization rates are shockingly low in the industry in general. If you're getting 15-20% utilization, you pat yourself on the back. Maybe we have virtualization that chops up an intelligent machine and several less intelligent machines so that we can get some higher utilization on the server. Yeah, I get 25-30. Exactly. But it's still tragically low. And so the idea here, if you're going to crunch big data and you bought the hardware, you might as well use it. And so the big idea behind DataRush is fully exploiting this magnificent multi-core gift that we've been giving and fully, in a scale-up kind of way, fully ringing every ounce of performance out of that server. And of course then with Hadoop, you can then spread that goodness across N number of nodes in your cluster. But I think it behooves all of us to worry about efficiency. And I think we're going to see a lot of investing. And you just heard Eric talk about next-gen MapReduce and the opportunity for alternative computational engines to drop into the API. MapReduce becomes an API, not a single go-to pattern and model. And DataRush, in fact, as a data flow engine in a programming paradigm is an ideal candidate. And we're actively working with Hortonworks to try and fit that hand in glove and get DataRush as another alternative computational model and engine to live inside Hadoop. So we're very excited about the opportunity for DataRush and our secret sauce to give us continued scaling advantages and game-changing price performance advantages. We're excited about what you're saying. We've been covering in the Wikibon community our RCTO, David Floyer, has written a number of pieces on the whole changing IO infrastructure, IO architecture and particularly you've got so much data coming out of these cores now and the notion of having flash on the other side of the channel to be able to capture some of that and write to a persistent resource changes the way in which applications companies are really looking at developing applications and then the functionality and the value that's coming out of that is enormous. Does flash play into this? It's not relevant to us. I mean if our customer to adopt that they do that for their own workload and workloads that are a little more IO intense, a little more random IO I think are able to really exploit flash. But the truth is for a lot of sequential processing and an enormous amount of what goes on in Hadoop is large batch style sequential processing. If you can, HDFS is pretty amazing. We've got tests in our labs where we're getting two to three gigabytes a second out of commodity on a single node out of commodity spinning disk and of course you get enormous terabytes at very reasonable dollars. This is one of the interesting things that's happening I think with the advent of Hadoop and a lot of what you see out here. It used to be one database, one hammer, everything's a nail and whether you had OLTP or analytic workloads you used the same database. And analytic workloads are actually different than OLTP transactional workloads. And that's where we have focused a lot. So we're really focusing on streaming large data acquisition kind of applications. You start with terabytes coming in, tens of millions, hundreds of millions, billions and tens of billions of rows streaming in from the left going into something like Hbase for example where big fans of that column store that's inside Hadoop. So I think analytic workloads at scale being able to exploit the distributed file system, the locality of the disk near the compute, really interesting columnar and parallel and resilient and temporal databases like Hbase. That's the exciting thing to us in the Hadoop ecosystem. So how about, I mean, two questions. Where do you play in the open source? Where do you fit and tuck in and how do you make money at this? That's actually a good question, a vexing question for many people in the industry. I can tell you what we're doing. We toyed originally with thinking about open source, we're not. We're a closed source back and we toyed a little bit with a community addition. I think we're not going to go that way. I'll talk a bit about the product and I think it helps understand maybe the go to market then for how we might make that a business. So we have this the data rush engine. We wanted to offer that to the marketplace. Of course we can sell it as a raw SDK but people like to consume things at a higher order of tooling. We're familiar with SQL tools. It's a little bit like the SQL database wars in the late 70s and early 80s. I was there, I remember. It was Wild West kind of like it is now. There was no obvious winner. There is now when we look back. So true. I was there too. You couldn't have predicted what happened. No way. There were debates over which SQL dialect and people were even debating whether SQL was the right standard. Now 30 years later, it's cast in stone but it took 30 years to get there. In those early days, like Hadoop it was very mature. There was no tooling. There were no loaders and unloaders and reorgs and the BI thing hadn't happened. So I kind of see that as the opportunity now. So what we're taking is the goodness of Hadoop and its scalability. The goodness of data rush to be able to get even more game changing price performance out of that and build a layer of tooling that is familiar to us in the sense that we can use traditional SQL clients but is radically more scalable and high performing. What we're doing is taking a tool that we find, typically an open source tool that we think is really good. We found a great data mining tool for example. Has a beautiful UI, great plug and play framework. Doesn't scale that well for really big data. Most things don't scale that well for really big data. So we kind of extracted their runtime, dropped our high performing runtime engine, added our native bindings to Hadoop, HDFS and HBase. And so now we're showing in our booth over here a completely big data scalable end to end pipeline for extracting and collecting and doing preparers and pre-computes and deep data mining, machine learning and predictive analytics on data at scale living inside or outside Hadoop. Product like that, which we call rush analyzer, we're going to go to market with by making it free to developers. So I think the game and software has changed. I've been in it a long time. I know that. And one of the ideas that I like is that developers are your friend. We eventually expect some kind of monetization off of people who go into production but while they're kicking the tires and evaluating we'd like to be as friendly as possible. So we will be a completely free model for developers so they can download our data rush raw SDK platform. They can begin to get some of this performance if they want or they can adopt the higher order tooling. So far we've just done rush analyzer but we're looking at extremely advanced scalable business intelligence, monitoring applications and all of them will be available free for developers. So we hope to get some adoption with that model and then find a production sharing license where we can embed our technology. So just to summarize this it's put it out there, make it free for developers and then make it sticky hopefully and then sort of figure out downstream how you're going to monetize it. Yeah I mean at the end of the day they have to like it. You can't force it on them. So we want to make it easy to adopt and experiment and hopefully proliferate. And you're betting that if they like it and it proliferates you'll be able to make money out of it. Exactly. I buy it. That's good philosophy. It works. Jeff go ahead. Yeah I'd like to I wonder if you could talk a little bit about use cases. What kind of use cases do you think are really going to, this is really going to support. What kind of new business models could it support. Really take it from the technology to okay here's what it's going to do for my organization. So that's a deep question and I'll go sort of short term first and then I think what the real long term implications are because I think they're impactful in a way that not everybody appreciates. So you often hear the phrase big data, big data and it's sort of a testosterone test how big is your data. And that's sort of cool but you're increasingly seeing big data analytics stitched together. And I think data science and data analytics are in some ways the real promise here. We've had it for quite a while but with the volumes of data from the different disciplines and domains that we're now collecting I think we can do predictive analytics in a way that we haven't done before. So why are people excited? Because they can predict the future. I think we sometimes throw around the words data science and predictive analytics but what does it mean? It means that in a way that we've never been able to do before we will attempt to and get better and better with each succeeding generation at predicting the future. That's what predictive modeling is it's passing through mountains and mountains of data to establish a model of behavior that you can use to predict the future. Not any different than the FICO score that we all have to predict whether we're deadbeats or not, whether we're going to pay our mortgage or not. And I think that is the big idea here. Why is it society changing? And so the investment everybody's making is yeah there's big data and there's undiscovered patterns and we can begin to discover the real models and we can predict it. What's really going on here is the societal shift from human-based decisioning to machine-based decisioning. I don't know if you saw the movie Moneyball, apart from being a great book and a great movie, it's really a great illustration of the shift from a system where how do we make decisions? Well you know we scratched our channel, a wizened grizzly veteran looked at something and said yeah I think that's a baseball player and human intuition drives an enormous amount of decision making that's going on out there. Even the most advanced domains, lawyering and doctoring are still dominated by human intuition to a large degree. Nothing wrong with having that in the game but what if we could augment that with predictive models? What if the gathering of data allowed us to build better and better predictors? Why did Netflix pay a million dollars to improve their recommender system by ten percent? Because it really makes a difference and so businesses of all sizes will want to predict their own future and I think that's the big idea here and if I can close the loop here, if you're building these predictive models you can tie them back into the real-time decisioning systems eventually and so the big dream here is that we can get better and better at building our models and then applying those models into our day-to-day decisioning systems which are mixed man and machine and increasingly have machines help us, augment us, make better and better decisions. The challenge with predictive models in the past has always been that the assumptions really dictate the quality of the outcomes and many predictive modeling initiatives in the past didn't live up to the hype and the promise because the assumptions maybe weren't all that what they were cracked up to be so now it's a function of the data so we're not doing a top-down approach anymore we're doing the bottoms up, no sampling the entire data set. I could not agree you made a point that is the religion inside pervasive big data there have been enormous compromises made in data science over the past and we make compromises, we sample the data because if you ran all the data the job would run for six hours. That just means you have bad hardware and software we have to compromise and shrink the data, miss black swans don't make better predictions, we have to make compromises around the science we can't use the best algorithms, we have to use fuzzy approximation algorithms because if we use the true science we're doing some work with the University of Texas around genetic pipelines and analytics and the best sort of genome sequencing algorithm isn't used because it would run for 20 hours and so they use an approximation that isn't as good and so compromises have crept into our science maybe because we can't process all the data. As the data volumes grow it's a more demanding problem and so the investment we're making is so that you can actually scale at a reasonable price point for all the data and your best science and I think as Google has proven more data and more data makes better decisions, in fact they famously said data trumps, bigger data trumps better algorithms. I think the right answer is both no compromises on the data side, no compromises on the science side and I think we'll see predictive modeling come back from the black eye that it had as you pointed out. And you can actually simplify the algorithms if you have a lot of data and normal people can actually conceive of things that you could do to the data that could have valuable outcomes and I think the difference is this time around is even though we might not get it right the first time you can iterate very quickly because the data stack has been commoditized and technology is not the gate anymore, do you agree with that? Oh that word iterate is very important to us. The essence of science is experimentation and iteration and we talk to scientists regularly who are frustrated because in a two hour or three hour run time they just can't really do their science. They have to make compromises. So if we can collapse the run time, if we can give them scaling that goes linearly regardless of the data volumes that they have then they can begin to improve their science in ways that they couldn't even imagine so far. So no compromises on your data volumes and your science allows you to iterate that much more frequently and science results come from that constant loop of iteration and improvement. What about you mentioned that today the vast majority of business decisions are made on intuition. Changing that is going to take technology. Some of the things that you're talking about that you're doing are pervasive but it's also a cultural shift. What world can you talk about that other side? How do you get people to change their behavior? To stop relying on their gut and start believing in the data. Some people have been at their job 25 years doing it the same way, making decisions the same way with intuition, with their gut and it's not easy to just make a switch just because you now have a piece of technology that can help you. How do you address that? These as you know are the hardest problems sometimes technology is easier than culture in terms of orientation. Take that movie again Moneyball. He chose it out of desperation. His budget is 40 million and he's at 170 million. He can't compete and he has to look someplace else. I think we're going to see innovation from start-ups. We're going to see innovation from companies coming in from the side. The innovators dilemma is that famous example. Maybe the large players don't necessarily adopt the latest technologies as fast. Maybe it comes from left field and so I think you'll see companies and newer companies certainly if you look at internet businesses they've adopted this from the very beginning. Every kind of marketing and web marketing analytic process is essentially inviting more data in and inviting better processes and models so that I can make better predictions about what display ads show because they can directly prove the monetization. I'm a little less wary about business adopting this. Business adopted BI in a big way. They like performance monitoring. They like dashboards. Most businesses are very motivated to make better decisions and I think if you showed them concrete results I think you're going to find the culture it takes time and it won't work every place but I'm pretty optimistic about this important change in bringing this advanced science of data mining into day-to-day business processes. I'm pretty optimistic. It's a multi-year process but I'm optimistic that we can do it. Mike Hoskins he's a very creative software exciting times. It's great to see a 30-year-old company innovating by spinning off pieces of the organization focusing people on the problems and solving them. Thanks very much for coming on theCUBE. It was great to have you. My pleasure. Thank you. We're going to take a quick break and please take a look at this ad from 1010Data and we will be right back.