 Live from San Francisco, it's theCUBE. Covering Spark Summit 2017, brought to you by Databricks. Welcome back to theCUBE. We are continuing to talk about two people who are not just talking about things, but doing things and are happy to have from Noveda the director of Predictive Analytics, Mr. Rob Lance. Rob, welcome to the show. Thank you. And also to my right, George, how are you? Good. We've introduced you before. Yes. Well, let's talk to the guest. Let's get right to it. I want to talk to you a little bit about what does Noveda do? And then maybe what apps you're building using Spark. Sure. So, Noveda is an advanced analytics company. We're medium-sized and we develop custom hardware and software solutions. All right. For our customers who are looking to get insights out of their big data. Our primary offering is a hard entity resolution engine. And we scale up to billions of records and we've done that for about 15 years. But you're in the business end of analytics, right? Yeah. Yeah, I think so. All right. So talk to us a little bit more about the entity resolution and that's all Spark, right? This is like your main priority? Yes, yes, indeed. So entity resolution is the science of taking multiple disparate data sets, big data, traditional big data, and taking records from those and determining which of those are actually the same individual or company or address or location and which of those should be kept separate. And so we can aggregate those things together and build profiles and that enables a more robust picture of what's going on for an organization. Okay. And George? So what did you do? What was the solution looking like before Spark? And how did it change once you adopted Spark? Sure, so with Spark it enabled us to get a lot faster. Obviously those computations scaled a lot better. Before we were having to write a lot of custom code to get those computations out across a grid. When we moved to Hadoop and then Spark, that made us, let's say able to scale those things and get it done overnight or in hours and not weeks. So when you say you had to do a lot of custom code to distribute it across the cluster, does that include when you were working with MapReduce or was this even before the Hadoop era? Oh, it was before the Hadoop era and that predates my time so I won't be able to speak expertly about it. To my understanding it was a challenge for sure. Okay, so this sounds like a service that your customers would then themselves build on. Like maybe an ETL customer would figure out master data from a repository that is not as carefully curated as a data warehouse or similar applications. So who is your end customer and how do they build on your solution? Sure, so the end customer typically is an enterprise that has large volumes of data that deal in particular things, right? And they collect, it could be customers, it could be passengers, it could be lots of different things and they want to be able to build profiles about those people or companies like I said or locations, any number of things can be considered an entity. And the way they build upon it then is to how they go about quantifying those profiles and so we can help them do that. In fact, some of the work that I manage does that but oftentimes they do it themselves and they just, they take the resolved data and that gets resolved nightly or even hourly and they build those profiles themselves for their own purpose. And then to help us think about the application or the use case holistically, once they've built those profiles and essentially harmonized the data, what does that typically feed into? Oh gosh, any number of things really and, oh shoot, I mean we've got deployments on AWS in the cloud, we've got deployments, lots of deployments on premises obviously. And that can go anywhere from relational databases to graph query language databases, lots of different places from there for sure. Okay, so this actually sounds like, I mean everyone talks now about machine learning and forming every category of software. So this sounds like you take the old style ETL where master data was a value add layer on top and that that was, it took a fair amount of human judgment to do. And so now you're putting that service on top of ETL and you're largely automating it, probably with I assume some supervised guidance, supervised training. Yeah, so we're getting into the machine learning space as far as entity extraction and resolution and recognition because more and more data is unstructured. But machine learning isn't necessarily a baked in part of that, it's actually entity resolution is a prerequisite I think for quality machine learning. So if Rob Lance is a customer, I want to be able to know what has Rob Lance bought in the past for me and maybe what is Rob Lance talking about in social media. Well I need to know how to figure out who those people are and who's Rob Lance and who's Robert Lance is a completely different person, I don't want to collapse those two things together. And then I would build machine learning on top of that to say, right now what's his behavior going to be in the future? But once I have that robust profile built up, I can derive a lot more interesting features with which to apply the machine learning. Okay, so you are a Databricks customer and there's also like a burgeoning partnership. Yeah, yeah, I think that's true, yeah. So talk to us a little bit about what are some of the frustrations you had before adopting Databricks and maybe why you chose. Yeah, sure, so the frustrations primarily with a traditional Hadoop environment involved having to go from one customer site to another customer site with an incredibly complex technology stack. And then do a lot of the cluster management for those customers even after they'd already set it up because of just all the inner workings of Hadoop and that ecosystem. And so getting our Spark application installed there, we had to penetrate layers and layers of configuration in order to tune it appropriately to get the performance we needed. Okay, were you at the keynote this morning? I was not. Didn't have anything to do with the keynote. I could ask you about that then. But I'm going to ask you a little bit about your wish list. You've been talking to people maybe in the hallway here. You just got here today, but what do you wish the community would do or develop? Or what would you like to learn while you're here? So learning while I'm here is, I mean, I've already picked up a lot. It's so much going on and it's such a fast paced environment, it's really exciting. I think if I had a wish list, I would want a more robust ML lib, machine learning library, right? So all the things that you can get on traditional, well, traditional, in scientific computing stacks, moved on to a Spark ML lib for easier access on a cluster would be great. Is there, I thought several years ago, ML lib took over from Mahoot as like the most active open source community for adding really, I thought scale out machine learning algorithms. If it doesn't have it sort of all now, or maybe all is something you never reach, kind of like the Red Queen effect. For sure, for sure. What else is attracting these scale out implementations of the machine learning algorithms? In other words, what are their platforms? If it's not Spark, then... I don't think it exists, frankly, unless you write your own, right? I think that would be the way to go. That's the way to go about it now. So I think what organizations are having to do with machine learning and distributed environment is just go with good enough, right? Whereas maybe some of the ensemble methods that are, I mean, they actually aren't even really cutting edge necessarily, but you can really do a lot of tuning on those things. Doing that tuning distributed at scale would be really powerful. I read somewhere and I'm not going to be able to quote exactly where it was, but actually throwing more data at a problem is more valuable than tuning a perfect algorithm, frankly. And so if we could combine the two, I think that would be really powerful. That is finding the right algorithm and throwing all the data at it would get you a really solid model that would pick up on that signal that underlies any of these phenomena. Okay, look. Oh, I was just going to say, I think that goes back to, I don't know if it was a Google paper or one of the Google sort of search quality guys who's, you know, a luminary in the machine learning space says data always trumps algorithms. Yeah, you know. No, I believe that's true. That's true in my experience, certainly. So once you had this machine learning and once you've perhaps simplified the sort of multi-vendor stack, then what does your solution start looking like in terms of broadening its appeal because of the lower TCO and then perhaps embracing more use cases? So I don't know that it necessarily embraces more use cases because entity resolution applies so broadly already, but what I would say is it will give us more time to focus on improving the ER itself. And that's, I think, going to be a really, really powerful kind of improvement we can make to no better entity analytics as it stands right now. That's going to go into, we alluded to before, the machine learning as part of the entity resolution, entity extraction, automated entity extraction from unstructured information and not just unstructured text, but unstructured images and video. Could be a really powerful thing, taking in stuff that isn't tagged and pulling the entities out of that automatically without actually having to have a human in the loop, pulling every name out, every phone number out, every address out. Go ahead, sorry. This goes back to a couple conversations we've had today where people say data, Trump's algorithms, even if they don't say it explicitly, and so the cloud vendors, who are sitting on billions of photos, many of which might have house street addresses and things like that, or faces. How do you make better, how do you extract better tuning for your algorithms from data sets that I assume are smaller than the cloud vendors? So they're pretty big, and we employ data engineers that are very experienced at tagging that stuff manually. So what I would envision would happen is we would apply somebody for a week or two weeks to go in and tag the data as appropriate. In fact, we have products that go in and do concept tagging already across multiple languages. That's going to be the subject of my talk tomorrow, as a matter of fact. But we can tag things manually or with machine assistance. And then use that as a training set to go apply to the much larger data set. So I'm not so worried about the skill of the data. We already have a lot and a lot of data. And I think that's, it's going to be that getting that proof set that's already tagged. So what you're saying is it actually sounds kind of important and that actually almost ties in to what we hear about Facebook training their messenger bot, where we can't do it purely just on training data. So we're going to take some data that needs semi supervision and that becomes our new labeled set, our new training data. And then we can run it against this sort of broad unwashed mass of training data. Is that sort of the strategy? It's certainly we would get there. We would want to get there. And that's the beauty of what Databricks promises is that ability to save a lot of the time that we would spend doing the kind of nug work on cluster management to innovate in that way. And we're really excited about that. We've got just a minute to go here before the break. So I want to ask you maybe the wish list question I've been asking everybody today. What do you wish you had? Whether it's an entity resolution or some other area in the next couple of years from Noveda. What's on your list? Well, I think that would be the more robust machine learning library on Spark kind of native. So we wouldn't have to deploy that ourselves. And then, you know, I think everything else is there. Frankly, we are very excited about the platform and the stack that comes with it. Well, that's a great ending right there. George, do you have any other questions you want to ask? All right, we're just wrapping up here. So thank you so much. We appreciate you being on the show, Rob. And we'll see you out there in the expo. Appreciate it, thank you. All right, thanks so much. George, it's good to meet you. Thanks. All right, you are watching The Cube here at Spark Summit 2017. Stay tuned, we'll be back with our next guest.