 Live from New York, extracting the signal from the noise. It's theCUBE, covering RapidMiner Wisdom 2016, brought to you by RapidMiner. Now, your hosts, Dave Vellante and Jeff Brick. Welcome back to New York City, everybody. We're here at RapidMiner Wisdom 2016. Ingo Merzobar is here as the CTO and founder of RapidMiner, the heart and soul of this company, the star morning speaker. Great to see you. Thanks for coming back on theCUBE. Oh, thanks for having me here. So I want to talk about all the things that you didn't want to talk about, vision, why you're awesome, why you're special, roadmap, all that cool stuff that your audience, you know, the core data science crew knows very well our audience, but anyway, welcome back. It's great to see you. We can talk about aliens if you want. We can do that as well. First question, why did you start the company, RapidMiner? Actually, really interesting because I was a data scientist myself, it wasn't called like that, it's like years ago, but I was in a consulting project with a telco in Europe, and I was solving data science problems around churn, mainly, and I tried to communicate about those results. So I was trying to explain to business people, well, this is what the model is doing, this is what you should do now and next to actually get to lower churn rates. And they didn't get it. I don't blame them. I made all the mistakes you could imagine about communicating those results, but at the same time, also you need the right communication device to some degree. So in code, it's not a great communication device, you can't explain to a business person how this model works by showing them lines of code. So I need to build something which is a visual, and at the same time also, I want to take this chance to build as much reusable things as possible to make our life simpler. And so I looked into the market, I didn't really like what I saw there, and then at one point we decided like, let's build it. And that's what happened. I just answered my next question. We heard a lot this morning about the dissonance between the data and what it implies and the conclusions that people make in reality. Use the great example of the big spike in alien sightings in July 4th. Because of course, aliens love America. Yeah, exactly, that's the reason. So you really wanted to close that gap. That's exactly the point. And it's so easy to make all kinds of mistakes around data, data science really. So that's what I tried to bring across this morning as well. If you just look at the number of UFO sightings over the year, you will see around July 4th or on that evening, they are more than at any other point of the year. So now if you think about this, is this really because the aliens are coming down to earth and looking at the fireworks? Probably not. Probably the real reason is people look at the fireworks and think, oh, wait a second, isn't that the UFO? And then they report this. So there's this good old topic around like there's correlation and then there are like causal relationships, like is there causality or is there not? And it's difficult sometimes to figure it out. But if the way to get there is already difficult, the question is no longer actually important. So you need to get to the point quickly that you can actually like get those insights and finding. Then you need to interpret this data and the more you know about the world, the better results you will get. And then the last most important step, now you need to do something with those insights. And that was, you asked for the reason, two reasons have been around communicating the results, the machine learning models, reusing things to become more efficient. And the last thing really, why we built this whole rapid minor story was really because we wanted to build a platform where you can take those models, those insights and transform them immediately into something which is triggered and action actually, which is tapping into your business process and is doing something to optimize it. Dr. Weissman this morning showed that nice quadrant and of course you always want to be in the upper right where the didgeridoo were. But most practitioners complain about the complexity of doing predictive analytics. And they spend all their time cleansing data and actually getting the data to the shape that it can sort of fit into the algorithm and sort of data sciences wants to bend the algorithm so that it can improve and fit the data. So where are we in that cycle? Dave, I think the most important thing really is that problem is not going away unfortunately because there's always going to be new data sets, new data sources, new data structures. So like a couple of years ago nobody was really analyzing text data. Now we all do, not all, but many people are doing this. Now we have new kinds of information like sometimes you analyze social media data, you started on the tweets and now you realize maybe I can take something out of the images. So you add more and more data sources so and this will never stop. So unfortunately I can't tell you this like kind of like fairy tale where in a couple of years from now everything happens automatically and I don't believe in this. We can support people as much as we can. So one of the elements we came up with and one of the basic ideas really of RapidMiner was the concept of wisdom of crowds so that we actually look into what other users are doing, how do they solve their data prep problems, how do they set up their machine learning models and then we share those insights with all the other users so you don't need to go through all the steps yourself. You can pick it up for the other people who have been already. That's a great idea but at the same time everybody is special, every organization is special, every data source is special. So you will need to adapt things and there's actually even a scientific proof, the so-called no free lunch theorem that there is no perfect algorithm for every single problem in the world. So you will always find a data set or a problem where your best algorithm so far will completely fail and so there's no way around manual optimization and manual trial and error sometimes even but that's okay, let's just make this experience as simple and joyful as possible. It means we need humans for a while anyway. I agree. So that says that future improvements in predictive analytics solutions are going to come both from data sources and more data sources, internal, external, different types of data and improve algorithms. Is that right? Absolutely and I'm not even sure where there's faster progression really. I think it's kind of like a race almost, it's amazing and then think about the whole big data explosion. What this really was doing for us as data scientists, it's not so much about the storage. It's actually about having like a distributed compute platform. So now all of a sudden those algorithms we know sometimes even since decades, let's keep in mind machine learning is around since decades now. Now all of a sudden we have the platforms where we actually can compute much more in parallel but we need to rethink those algorithms then. So we can't just go with the old stuff and just run it on a distributed platform. We need to redevelop them sometimes. Well, this is an allergy for decades. This industry has marched to the cadence of Moore's law but our computers don't go any faster because we're putting more applications on them so everything goes in lock steps. So we're kind of the engine of innovation at least in this world is really the data sources and the algorithms moving in lock steps. Exactly, exactly. And that's really this kind of race we see. And really I compare this sometimes to what happened with Deep Blue when Deep Blue was beating Kasparov in the chess championship kind of. So what happened back then was not that IBM came up with smarter artificial intelligence algorithms. They had the same algorithms than before but they just threw more halls forward to this. More computation power. So just to outperform Kasparov. But actually Kasparov was the better chess player and still is today probably. Well, the interesting part of that story is the best chess player in the world is not a machine. Yeah. Right, it's humans plus machines. Yeah, I agree. Kasparov started the competition, right? That is absolutely true. So I think around data science we're in a similar situation right now. Like right now we get a lot of improvements by thanks to distributed computing, more horsepower, more computation power, which is absolutely great. But at the same time we of course also need to come up with new algorithms with completely new innovative ideas we might not even know about today and that's just what we see today. But isn't that part of the promise of the citizen data scientist right now? Yeah. If you're exposing that capability, some version of those algorithms to just a broader set of people, just by pure numbers, you should get some new variety of outcomes, new variety of ways of looking at problems and really start to not just have it faster computers or better algorithms but actually you're twisting the lens in the way that you perceive what you're looking at. I honestly believe that, yes. I think it's just a couple of years ago we maybe globally had maybe 5,000 people who did really like machine learning practice and maybe 5,000 more who did it in research and that's it. Today we already have hundreds of thousands of people. I think in a very short amount of time we have had millions of people. And I don't think that every Excel user in this world who also works with data will become a machine learning expert or a data scientist. But we will push the envelope here to quite a large degree actually. So as a result of that, the first result is going to be more news cases. So people will solve new problems. Then as the next step then is like, how can we solve them? The first step I would expect is by combining algorithms we already have. But by combining existing stuff, that's a good thing. That's a normal thing. We develop actually over the past centuries like that. We're not inventing every single time something new but we already had the wheel, now we have an engine, well let's build a car. So it's really, that's a very normal progression and that's exactly what I expect. First use cases, combination of existing algorithms and those will become the new generation of algorithms. And that's driven by this explosion of people doing this actually, yeah, I believe that. We heard some discussion this morning about companies being data driven. We saw some data, what percent of companies are data driven. An implication that the digerati who are data driven are more profitable and so forth. One of the challenges that we hear a lot in our community is people are trying to balance the focus on improving the business outcome. Reducing churn as you talked about, maybe reducing fraud, versus becoming data driven, improving their analytics capabilities. What are you seeing in the community in terms of how people are balancing those challenges? Probably almost every company you would ask today, almost every CEO you would ask today, will come from like absolutely via data driven. Yeah, yeah, yeah, yeah. We know that. Well, it is unfortunately not exactly that. I do believe that. So now I think people at least realized the need for really looking into the facts, looking into the data and stop making gut based decisions for every single situation they are in. Because frankly, well, there will always be room for gut based decisions. There has to be. But in situations where the decision is kind of small, but it happens like thousands, maybe millions of times per day, you can't ask people. So organizations realize this now more and more. Wait a second. Well, I can't maybe use all those data driven examples for making the big strategic decision of, should I acquire this company or maybe that company? There is some data going into this, but at the end of the day, there's also chemistry, culture for this, or so many other aspects you're not even measuring. So at the end of the day, there will be human beings making this kind of call. But on the other hand, maybe I should make, offer a discount for a customer of ours as an incentive to stay with us, to reduce our churn rate. Well, is a discount the right way? Or should it be 3% or 5%? Is maybe the price not an issue at all, but it's the quality of the customer service. If you go down on that level, and that's another thing Big Data brought to us, like really having a lot of raw data, really on a very fine level. If you go down there, gut feeling still would work in theory, but first of all, to complex and second, there are too many decisions. And that's what I really call data-driven. So basically automating decisions where you can, typically getting to a better aggregated outcome. And that's something I don't see often enough. I think we will end up there because there will be competitive pressure. Some are doing this, and all the others will need to follow. But as of today, most people try to support their gut-based decisions with more data. That's why we first set BI reports. Now we will move more towards predictive analytics. And I think the natural next step is what we can all call prescriptive analytics, where really it's all about, great data know what's going to happen. Now tell me please, dear algorithm, what do I need to do to get to the best outcome? Well, then the next logical step is, oh great, then just do it. And that's what I expect. And that's truly data-driven. It's interesting because you just described ad tech. It's interesting that ads and serving ads, because of the competitive nature, because of the demands and the speed, really pushing the edge on, as you said, a lot of very quickly made decisions that there's just no way a person can get involved in that. Exactly, this is a perfect example. I think we have really use cases where there's immediately clear, ad tech is a perfect example, case basically for this really no-sume being, could make the millions, hundreds of millions of decisions every day to actually deliver the right ad to the right person in the right time at the right point at the moment. Exactly. And what's actually the value of this? Exactly, making the bidding, everything. So you need to automate this. But think about larger enterprises. Think about an AT&T. I don't even know exactly how many customers they have. But let's say 50 billion just for the sake of the argument. You can't make the decision for every single of your 50 million customers, ooh, is there a risk that this customer is joining tomorrow? So there's no way. So even if you're not going to this very deep technical level, just thinking about customers, just at scale it no longer works. So when this is exactly the thing we are really interested in, how can we create models? How can we, what we call operationalize them so that you can actually embed them into your business processes? Fraught detection, another perfect example. Like, I think it's 2.5 trillion credit card transactions every year. Yeah, who's looking into them? The machine, no other way. Yeah, and I think some of the assumptions you made earlier, I mean, it's early days. Things are improving. But then we saw some big ROI numbers this morning. And of course at an event like this, you want to project the best ROI's. But as you say, not all organizations are data-driven most aren't. So on balance, you don't see those types of ROI. A question is, again, the practitioners that we talked to, there's sort of two vectors. One is the business outcome and one is the technical feasibility. And it seems like many of the initial predictive projects have been based on the technical feasibility versus the business outcome. Do you see that, is that fair assessment and is that changing? I think it depends on the organization. Some have the business outcome in mind from the start, which I think is always a very promising beginning. Really, because it's ultimately, if it doesn't work for your use case, it just doesn't work. So you shouldn't just do it because it's cool or because if you can do it, you should really start with use case in mind, which are really strategic for you and define the outcome. What does it mean? This is a success versus not. And then you actually start with the project. And every good data scientist should do it. On the other hand, yes, it's just new technology. So people are excited about trying out things and figuring out, well, could this be something for me? In fact, I was mentioning the early days of RapidMiner. This was really a non-strategic initiative first. So this is years ago, but still, the idea was just like, well, let us explore how much could we reduce our churn rate if we would do something predictive. So nobody at this point in time actually plans to deploy anything, but that's seven, eight years ago. Today, things change. Yes, I'm with you still. There's a lot of technical initiatives, maybe too many. But I also see that practically all the conversations we have with prospects and customers are all now really circling around, okay, how do we measure success? How do we actually make the point also like, well, this is now an RI of X. This didn't happen maybe two, three years ago. Still, two, three years ago, it was good enough to just say like, well, this is something innovative. Today, you need to make the point. So we always try to make this kind of like a proof of value. If we say like, well, we don't need to prove the concept. Neural networks are around since 30 years. Support vector machines in 15 years. The mathematics didn't stop working all of a sudden. But we need to prove to your organization that actually there's true value by using those machine learning methods for your organization. And that's what's important, not the concept. We know the concept. We're out of time, but I have to ask you. So some of the things you didn't want to talk about, maybe time for one, what makes you guys different? What makes you special? And it's a crowded space. What makes RapidMiner so special? I think it's, as always, it's a combination of things. And to make it a short answer, I think it's this feature which is basically it's complete. It's complete platform from data ingestion to modeling to deployment, operationalization. Most platforms stop after the modeling. They don't even do the right validation. So basically figuring out how well will this model work. And I'm not feeling comfortable if I don't know for sure this model is going to deliver 90% accuracy in the future. So it's really this complete spectrum, what I sometimes call the analytics lifecycle, that's supported. At the same time, despite the fact that it's so much, it's extremely easy. And the last thing is, we also try to leverage this knowledge, the experience, the expertise of internal and external data scientists into the platform. So you stay in this world, you get the support basically from the user community. Well, it's relatively easy given the complexity of the topic. You got a vibrant community, so congratulations. Thanks for saying that. Yes, I agree. And best of luck. I really appreciate you coming on theCUBE. Oh, thank you. Thank you. All right, keep right there. We'll be back with our next guest right after this. This is theCUBE, we're live from RapidMiner Wisdom 2016. Right back.