 Live from New York, extracting the signal from the noise. It's theCUBE, covering Spark Summit East, brought to you by Spark Summit. Now your hosts, Jeff Frick and George Gilbert. Hey everybody, Jeff Frick here with theCUBE. We are live in Midtown Manhattan at Spark Summit East, day two of our wall-to-wall coverage. It's been a great show. We're really excited to be here. Last year we did kind of a drive-by in San Francisco at the West Show. Really excited to get the whole cube here in Manhattan and the sun has come out. It's as if we're in California, George. I'm joined by George Gilbert from Wikibon and our next guest is Grant Ingersoll, the CTO of Lucidwork. Grant, welcome. Thank you, Jeff. Thank you, George. Great to be here. Absolutely. So search-driven everything, that's what we see. So what does that mean? Give us kind of the quick overview on Lucidwork. Yeah, I mean, search-driven for us is like, I think we're seeing more and more of this evolution of data application. So I think a lot of people, when they think search, they think Google, they think TenBoo links on a webpage, but over the last five years, this has really evolved. I mean, we take in all kinds of different data sources, spatial, numeric, text, of course, enumerated types, all of these things, mash them all together and provide out to your application or rank set of results that says, here amongst your data is what's most important. Here's what you should care about right now. So when you start to build applications, that way you start to realize these are all search-driven applications. So I want to take and drive that importance into my application. I want to bring up that kind of fuzziness that a lot of developers aren't always comfortable with and drive better answers into my data off of that. Yeah, it's interesting because even though the Google analogy's not exactly one-to-one, I think what's interesting about what Google, we talk about the consumerization of IT all the time is the expected behavior based on the interaction with all the apps in my day-to-day life. I'm expecting the right answer to be pretty close. Exactly. And be served up to me with not a lot of effort on my part. Yeah, I mean when you think about it it's like search permeates every part of your life. How did I get here today? I searched for it. How did I find an airplane to get up here from where I live? I searched for it. So I think we as consumers expect that and so what we're trying to do is drive more and more of that feeling into the enterprise. That's when you touch on the word feeling. There's a, developers have grown up now for a couple decades expecting to get definitive answers in their applications and provide those to the end user based on a database, you know almost always a relational one. So consumers maybe are more friendly with a ranked answer. How do you get application developers to wrap their heads around that? Yeah, it's a great question. I mean I can't tell you the number of times I've been with a developer who especially comes from database world and relational world and there's always like, you can almost put it on a calendar of X number of days in of dealing with this new search technology where they have this epiphany moment and they say, ah, I get it now. I get what this scoring is. I get what this fuzziness is and they start to embrace it. I can't tell you the interesting thing is I'll follow up with them like a year or two later and they're like, yeah, I haven't done SQL in a year. All I do is search driven applications because it's so much more responsive. It's so much more, you know, you get that here's what I need to know right now as opposed to here's all of the 500 things I need to go look through for me to figure out what's the best answer, right? The machine is helping me and in the world of big data this is ever more important. So, you know, Google showed us the way it's now apply that across a lot of other places. Right, and the other thing that we keep hearing over and over again in a world without sampling, right? We're a world when that we can afford to hold everything. It puts even more impetus on our trusted intermediary, our trusted filters to help us find the right answer because it's no longer some subset, some sample. Yeah, and you have to bring in things like machine learning, natural language processing. For us, you know, like we look at it as search is this really great way to surface the information to the users. But in the background, you know, in the context of Spark Summit here, we're leveraging Spark to drive, you know, building out machine learning models, building out statistical models where you're constantly updating your understanding of the data, right? So, you're trying to create this virtuous cycle between your users and data. The more the users interact with you, the better your analytics get. Are you saying that, because we've heard this about Google, that Google's advantage isn't just the pay drink algorithm. It's that it's the user interacting with it. You know, and saying I choose this link over that link. Are you saying that search helps refine the machine learning algorithm itself? Yeah, exactly. I mean, we take and we're grabbing all, you know, it's that collective feedback loops that you're looking for, right? When I think about relevance and importance, I like to say there's the three Cs. One is content, you know, so that's the traditional keyword matching. The second is collaboration. What are other people doing with this data? You know, the fact that you clicked on it and he didn't, that's useful information that can help inform me. The last is context. Who are you? Where are you? What activity are you doing? You know, are you in Times Square? Are you, you know, searching for a restaurant? Or are you, you know, looking for bad guys in a terrorist network, right? How can we bring in all of those? By combining those three Cs together, now you've got a really nice way of ranking your information and saying, according to those parameters, here's what's important to you, right? And we're here at Spark Summit. So how's Spark, Ben, are going to be, tell us about where you are kind of in the implementation and how is it a game changer? Yeah, so for us, I mean, Spark has been great. We ship it natively in our product. We publish open source connectors between Spark and Solar so you can query and interact with Solar directly from Spark via the RDD framework and the data frames and all of those kinds of things. We have data locale awareness coming in as well in our product so that you can co-locate your Spark workers with your Solar nodes and be really smart about where the data is. For us, we use it in a lot of ways, ranging from low level DevOps capabilities where we're tracking all of our system metrics, those all go into Solar. So things like garbage collection and all of that, all of those events, we then use Spark to roll those up and present up dashboards to the DevOps team that says, here's the health of your entire search cluster and Spark cluster. We also do things like make it easy for you to index and search over events that are happening like those clicks that we talked about or add to carts or purchases or whatever. And we use Spark to build machine learning models off of those things. We're also just about to release a bunch of integrations with Lucene and Spark so that you can use Lucene's very rich language analysis capabilities as part of your Spark workloads. So I think that will help a lot of people who are dealing with text-based content in their Spark workflows. So over time, what type of stack do you see emerging where, for years it was the lamp stack? We have, as an anti-Hadoop stack, we have the smack stack. Do you see yourself becoming part of a stack or many stacks where there's these key services that are needed, the ones that we're familiar with here with Spark that has the machine learning, the SQL query, graph processing, but also needs the NLP and search and relevance? What would that look like and what are the apps that you would build on that? Yeah, I mean, that's a great question. I mean, I started my career doing parallel distributed electromagnetic simulations. Back at the time it was high performance Fortran, right? That was the stack we built on, right? Like this space is constantly evolving. I mean, right now Spark is a really good fit. Two years ago it was MapReduce. I think the whole distributed space about how do we make it easier for programmers to build distributed things is changing so rapidly, right? So, you know, people probably have whiplash right now between going from MapReduce to Spark. So, to predict what's going to be in a couple of years, you know, that's a pretty hard one. I think the thing though that you've really seen, the overarching trend is we know distributed programming is a hard and challenging thing, right? So, whatever tools we can put together that make that easier. For us right now, it's Spark plus solar. And then we layer on top things like MLlib. I'm also the creator of Mahout. So, of course, I want to give a shout out to the Mahout project, which is now runs natively on Spark. There's lots of great NLP tools out there. I think, you know, for me, the stack that's evolving when you're dealing with all these different data types is tools that help me not only deal with the distributed nature and the scale nature, but also deal with that fuzziness factor that we were talking about, George. Because I think, you know, that's also one of those things that most developers don't have good core training in, right? Those aren't things you necessarily learned in your computer science class, you know, back at university. Nowadays, maybe, but, you know, we're still, those people are still coming online. We've been on this quest, like King Arthur and, you know, the Knights of the Round Table, going after the Holy Grail and, you know, they all die, or almost all of them die, but I can't find it. Just a flesh wound, George. Yeah. Ha ha ha. The Knights of St. Neat. That's right. So this quest that we're on is, I'm not sure it's really Holy, but we're trying to find in this- Holy with a W-H, maybe. Or Holy H-O-L-E-Y, yeah. All of the above. We're trying to understand out of this soup, that primordial soup, where are we going to see big, and by big I mean like applications that, you know, manage a big chunk of the value chain. Not something that says, oh, say this to a customer who, a telco customer who might churn. I'm talking, you know, not quite to the level of ERP, but the question for you is, tell us about a class of apps that could use this, you know, probabilistic, fuzzy-oriented, relevance-ranking type engine. Yeah, I mean, you know, our primary use cases are e-commerce and consumer-facing. So like self-service, you know, support portals, we support all of the big retailers, all of that kind of stuff. So e-commerce and consumer-facing, where you're constantly experimenting with the way users are interacting, that's, you know, front and center, you know, for these kinds of applications, right? When you go behind the firewall, it starts to get more interesting though, when you start to look at these knowledge management applications, especially in large organizations, where you're playing around, your data is more subtle perhaps, you don't have maybe as many user interactions, or you don't think of them in the traditional way, but, you know, starting to mind things like the fact that you and I have been emailing, right? Like that's a signal, that's an event that we can learn from and figure out, hey, you know, perhaps you're the authority within my organization that I should be talking to about Spark, right? And so I can build up an expert network within my organization that says, that's built off of these kinds of things. You also see it in, you know, we do a lot of work in life sciences and financial services where you're trying to piece together a lot of disparate data types. So in financial services, it's often around like fraud analytics and compliance where, for instance, we're tracking that you are emailing with somebody, we're tracking your phone calls, we're tracking the trades that you made, we're tracking which rooms you entered into the buildings because, you know, the bad guys aren't just doing like one obvious bad thing anymore, it's, you know, they're much more sophisticated and so it's like a puzzle that you have to put together and you need a machine to, you know, machines to help you figure out what that puzzle is and say, you know, here as a human, you need to go down this path. We think this is worth you exploring more, right? So all of those are great fuzzy applications. Same in the life sciences. I've got an idea for this particular molecule or drug might be applicable over here. What's all the research out there? What have we done in the past? You know, trying to mash all of that together is a really interesting and challenging problem. Will that ever transition from fuzziness and likeliness to deterministic where, you know, you hard-coded into a workflow or will that always be, you know, we think as opposed to it is? And I don't mean that as a bad thing. Yeah, I know. But now as you said, though, there's value in that we think or before it was, there was only value in the transaction, right? Now it's these softer transactions that have the value. I totally agree. I'm just wondering, do they ever convert not into something better, but into something that's just deterministic? Yeah, I mean, if there's one thing you could say about the human condition is that, you know, there's a lot of fuzziness and ambiguity, you know, right, built into who we are. So will it ever go? You know, I mean, you say potato, I say potato, right? Like those things are, I think, inherent in this. Or with our old DP, and Jenny. Depends how you want to spell it, right? But, you know, I mean, the interesting thing is this machine learning and AI stuff, you know, there's really going to be some really interesting practical applications there where the machines can make better and better decisions. But it's still, you know, there's always a fuzziness, I think. I don't think that ever goes away unless somebody else, you know, invents a whole new level of understanding of statistics that I'm not aware of. But, you know, for now, I think it's something that fits in, but the edge cases, maybe we can try to shrink the edge cases, because that's ultimately what you want, right? Is it's almost always, there's some just obvious, this is good. There's always some obvious, this is bad. It's that gray area where, you know, we have to be involved, right? You know, we know in credit transactions, you know, a whole lot of them are good, a whole lot of them bad, some of them, you know, or maybe gray areas. And that's where you get a phone call from your bank saying, hey, could you validate this? But that is pretty profound. You're saying we need a new stack for applications that don't have a definitive yes or no. Yeah, I mean, I think that's what you're seeing involved with, you know, the likes of Spark and these machine learning algorithms, more interest from developers in how machine learning works. More interest in, you know, a lot of these, you know, I'm sure you've seen a huge rise at universities and people taking statistics classes, taking linear algebra, kind of the core fundamental mathematics behind these systems. Obviously the rise of the data scientist is pretty critical here as well. I guess it was a shame that I took remedial statistics. Well, I think the point you said earlier before was statistics used to be part of like a minor as part of something else. You took your econ stats to go with your econ major, you took psych stats to go with your psych major and really not just statistics as a more general purpose tool to apply to lots of different problems. Yeah, I mean, even, I'll give a shout out to my alma mater, Amherst College. I mean, back when I took my math degree there, it was just, you know, I think the math department is grown by 10x in the years since I've left there. They actually have a whole statistics concentration now. Whereas back then, you know, it was calculus and linear algebra and kind of the traditional math courses and not as much, I think we had maybe one statistic class. Now they have a whole program around statistics and I think you've seen that mirrored across a lot of universities. You know, you see this in the course eras and the Udacity's and all of those guys. So, Graham, unfortunately we're out of time. I want to give you the last word as we look forward six months, nine months, 12 months, you're kind of at the cutting edge of this. The good news is the fuzziness is where the deltas are, right? So what are you excited about? What are you working on in the next six months, nine months? We see in a year, what are we going to be talking about? Yeah, I mean, we're really focused on how do we productize this machine learning stuff? How do we make it easier and easier for people to do this without having to know maybe all the depth of every last, you know, statistical model, all of those kinds of things. How do we make it so that we can transform the logs that you're already generating into models that automatically learn things like your ranking for your search results, the suggestions to give to people, the related searches, the recommendations, all of those kinds of things. And so, you know, it's bigger, better, faster machine learning algorithms incorporated into a package that is easy for people to consume. And then obviously the analytics to then decide whether it's working. We've got some really cool things coming out with banded optimizations in the context of search. So, pretty excited about those. Those will be out here in the next couple of months. So, some good stuff there. Well, Grant, thanks for stopping by. Yeah, thanks, Jeff. Thanks, George. You're great to be here. Absolutely. It's Grant Ingersoll from Lucidworks. You're watching theCUBE. We're in Midtown Manhattan at Spark Summit East. Be sure to log on to SiliconANGLE.tv. If you don't have time to watch a video, we're doing podcasts now. We call them Cubecast. You might have guessed. Search for Cubecast at SiliconANGLE.tv. It's on SoundCloud as well as iTunes. It's a perfect way to take in a Cube interview when you're jogging or driving to work or doing whatever. We'll be back with our next guest after this short break. You're watching theCUBE.