 Hi everybody, this is Dave Vellante and this is theCUBE where we extract the signal from the noise. We bring the smartest people that we can find and share with you our audience. The latest information on big data, we're here at the HackReduce launch and we're here with Dan Roberts, who's a co-founder of Diffio, a new company. They're going to be based in this HackReduce office. The Cube is going to have a little setup here as well. So Dan, welcome and thanks for coming on. Really appreciate it. So tell us about Diffio. Tell us about what you guys are doing. All right, great. So what we do is entity-centric searching, which let me explain what an entity is. When you do a search on Google, you type in a bunch of key words and they don't really have any meaning. They're just words. They're not connected to anything in real life. But when we do a search, we're really interested in a specific thing that does have meaning in the real world, like for instance a person. But the problem is a lot of people have the same names, for instance, or some people have nicknames and they have different names and different ways of referencing them, or you might be the king of England, you might be a specific king of England, and so on. So what we want to do is we want to gather all the people, all the references to a certain person as one search term that you can search on. For instance, on Wikipedia, you have a disambiguation page. That's a page that lists all the different ways of referring to all the, that for a given name, all the different ways of, all the different people, sorry, associated with that name, and so on. So it wouldn't be great instead of typing in words into a search box, you could point to an article and say, I want that guy. That's the guy I want to track. That's the guy I want to set up an alert for. That's the guy I'm interested in seeing if it's in the news. My name is Dan Roberts. If you try to search for me, you won't be able to find me. There's just too many people with my name. In fact, there's a Wikipedia disambiguation page for 20 other people who are not me. So it would be great if there's some larger knowledge base or collection of articles that could be user generated or otherwise gathered. And then you point to one of those articles that refers to some entity in the real world and say, I want to search on that guy. I'm interested in him because I'm in finance and I'm tracking a company and I want to generate tradable actionable data or I'm in the government and I'm interested in tracking people, for instance, and, you know, we're interested in these people and they have different names so we want to track these people or I just want to set up an alert about myself because, you know, I might show up in the news and I'm curious about myself and, you know, if you set up an alert for me right now, you'll get all sorts of spurious alerts like that I died, you know, as a 58-year-old in Wyoming last year. That's not very useful. So what's the deep tech around entity-based search? Describe that a little bit. Okay, no problem. So it's a pretty audacious effort. You have to take the entire stream of the internet and then do something called co-reference, which is take, first of all, process the text for all the entities, which there's certain software that does that, but first of all, for instance, you separate nouns from verbs and so on and the nouns have the potential to be references to people by their names and then so you can tag using a named entity recognizer what the entities are. And then so this is a word that, this is just a couple of words that may say something like, you know, Google or, I hate using my name again, but, you know, Dan Roberts or so on or the President of the United States, Barack Obama and so on. And then what you do is you have to take the name and somehow derive features that describe based on the context of the article what that person is. So for instance, a quick example is the sentences surrounding that name are often a very good indicator of who that person is because they're using the name in context and then you compare that in some sort of mathematical way with named references across the other documents and this is a hugely computationally intensive process. It's called the number of different ways of creating different groups of these names. It goes way faster than the factorial. It's called a bell number and what we've done or one of the technologies we're working on is doing it in a hierarchy. So essentially collecting groups together and then collecting, grouping on top of that and so on and building these large tree-like structures that in some way represent a person. So like the tree for, let's say Barack Obama would have various sub-trees that would essentially fraction out different parts of his personality or his persona down to the level of the mentioned. So we build these large trees and then you can move the trees around together and in a sense allow that to scale and that can be done massively in parallel and allow us to solve this problem on a day-to-day basis taking in the stream of the internet and then figuring out who the people are, what the clusters are and building these larger structures so that you can come along and then run the query you want. Sounds like it's a combination of linguistics, mathematics and this architecture, this hierarchical architecture that you're talking about. Absolutely. So yeah, so we combine tons of different fields from statistical machine learning, which is like you said, based on the NLP, the natural language processing plus database technology. We're going to be working very closely with Acumulo and the squirrel folks that are also associated in here and also just general kind of graph theory building these large scalable tree-like structures and coming up with different moves and manipulating them in some sort of larger space. Are you using Acumulo? Right now, we will be using Acumulo. We have it set up so that we can use a couple different back ends but we're definitely aiming to make Acumulo one of a major major. So talk about the company. Where are you guys at? You're going to be based in HackReduce, correct? Based in HackReduce, absolutely. And where are you as far as funding? How many people? Where are you guys out of? Excellent. Yeah, so we're bootstrapping. We have a lot of clients already signed up and signed on and so we're hiring. There's three of us that are co-founders and we have two other employees signed up and we're looking to hire more. So I love it. So you've basically taken a sale design build model and funding it with client funds. You're young guys out of MIT? Yeah, so the three co-founders, myself, Max Kleinman-Weiner and John Frank, are all, John and I are theoretical physicists at MIT and Max is a computational neuroscientist. And we all met through our graduate fellowship, the Hertz Foundation, that just brings funds a lot of graduate fellowships and we met there. John is a couple years older. He successfully built up a company called Metacarta and sold it a couple of years ago to Nokia and we just started talking originally about predicting Wikipedia and then from there we came off in this direction. Dan Roberts, Tifayo. Really, congratulations getting the company off the ground. We'll be watching. Let us know if we can help. Thanks very much for coming on. Thank you so much. All right, great to meet you. All right, keep it right there. We'll be back with our next guest. This is theCUBE, Dave Vellante in Boston, in Cambridge at HackReduce.