 Live from Cambridge, Massachusetts, extracting the signal from the noise, it's the Q, covering the MIT Chief Data Officer and Information Quality Symposium. Now your hosts, Dave Vellante and Paul Gillan. Welcome back to MIT everybody, this is Dave Vellante with Paul Gillan. We're here live in Cambridge, Massachusetts and this is the Q, Silicon Angle Wikibon's continuous coverage of MIT IQ. This is our third year at this event. Arka McHurgy is here, he's the founder and CEO of Global ID's out of Princeton, New Jersey. Arka, welcome to the Q. Pleasure, thanks for having me. So tell us about Global ID's, what's the organization do, why did you find the company? In 2001, I was struck by the problem of growing complexity in data landscapes. So I was working in IBM and I was dealing with these very complex problems that IBM was working with and I kept saying that software that we have is not smart enough to handle the problems associated with very large and very complex data ecosystems. So at that point I decided to work on this problem and so we've been at it for more than a decade and we are systematically addressing the problems related to data ecosystems. So the problems are numerous, there's obviously data quality, you know, consistency, there's governance, there's security, there's a single version of the truth, master data management, all the stuff that we've been chasing, it's like chasing windmills. So maybe summarize the problem statement and we can talk about which ones you've attacked and where the white space is. Wonderful. So we think that the way software has been created to handle data is somewhat dated. Today's software is primarily based on the thinking that has been, was done in the 1980s and the 1990s related to these problems. In order to handle high volume, high complexity environments, we need a newer, a better, smarter approach towards looking at data. So global ideas is primarily interested in creating that approach that actually works for the kind of volume, complexity and diversity that we face in current environments. So what are those? So we think that there's a foundational understanding that's required for data and that foundational understanding is based on discovery, profiling and organizing these large ecosystems of data. And once you've built that foundation, you can actually solve the problems related to quality, governance, master data management, security, privacy, because these are all problems that are ecosystem level. They are not silo-based problems. They are all ecosystem problems. So we are approaching this in a scientific way and it's leading to a much more mature approach towards solving these ecosystem problems. So you're helping companies discover the data that they have, understand how it's structured, what it's there for, what it does, and then you sort of assemble this model of the organization based on its data. What do they then do with that? Right. So the organization of the data is the key problem. We think that that's a missing foundation for most large organizations. If you go into any large organization and say, ask simple questions like, can you give me an inventory of all your data assets? Most organizations will say that, no, we can't do that. We have isolated metadata repositories, but we don't really have a holistic understanding of our data assets. Or if you tell them, do you have a semantic understanding of the way this data is being used to solve business problems? It's too large. It's outside my area. It's not in my business unit, et cetera. So there's an absence of a holistic understanding of the ecosystem. So our premise here is that if you use software in an intelligent way, you can solve each of these problems related to the creation of the inventory, the profiling of the data, the recognition of the data, the classification, the mapping of the data. All these problems are actually tractable problems. And that once you have that foundation in an automated way, you can now tackle each of the problems that you want. So for example, the issue of privacy. Most organizations should be able to tell us where all the private sensitive data is. But that's near impossibility these days when you have tens of thousands of databases. How do you know where all your sensitive data is? Well, you need this foundation that I was describing in order to really answer those questions. So this is a metadata challenge. It is. So Michael Stonebreaker today said MDM, master data management, is really going to be a metadata management issue. So is that the fundamental approach that you're taking? We talked about that. So we do create the largest metadata repositories in the world. These are at the scale of 100 million data assets. So we do agree that without comprehensive holistic metadata, it's very hard to solve today's data problems. And so Michael Stonebreaker is absolutely right, saying that you need to create those foundations before you build smarter systems. So part of the other challenges, you ask a customer what data do you have, and they can't tell you. So there's a classification issue, which has been an age old problem with data management. And the challenge is you can't manually classify data. It doesn't scale and humans don't do a good job of it. How is the industry solving the problem? That's an excellent question because we believe that nobody does this at scale. And that is organizing all this information that's coming our way is absolutely critical, right? So if you look at large organizations and say, do you recognize all your data assets? So let's say, for example, you have 100 million data assets, right? Do you recognize each one of them? People will say, no, I give up. I couldn't possibly do that. Or can I classify these data assets into business categories? No, I can't do that either. Can I classify all this into master data objects? No, I can't do that either, right? So the issue of classification and organization is central to any intelligence system, right? And what we are trying to do is to automate all of that by doing it through software. And we know we've done it at the scale of maybe about 70 million data assets. So we think we know that the problem is tractable and that it can be solved. Can we talk about how you solve that problem for a minute? I mean, it's because technology created the problem. Yeah, that's how we have all this data. And you're implying that technology can help us get out of this rut. Yes. Talk through it. Is it agents placed everywhere and doing discovery and then using math to categorize? Is that the approach? It is. You're absolutely right. We have to take those approaches to solving the problem. Marvin Minsky at MIT in the 60s and 70s, he outlined how these classes of problems could be solved. So you may be aware that Marvin Minsky wrote a book called The Society of Mind, in which he said, in order to solve complex problems, you need to break it up. You need to reduce it into small components and then run millions of those agents in parallel in order to solve the problem. Which is exactly the approach that we use. We essentially say that when an organization like an AT&T or a Walmart has 100 million data assets, we need to have intelligent agents that can process each, all the different diversity of this data. And that, so these millions of agents can collectively collaborate in order to lead to an understanding of the ecosystem and an organization of the ecosystem that can be then processed. What do you find is most surprising to your customers when they go through this analysis process? It's the, first of all, the enormity of the challenge. Everyone thinks that this is an impossible problem. But we've done it at the scale of the largest organizations in the world. So the fact that it's a solvable problem is the biggest aha. We understand, so at the end of the exercise, they say we understand all our data assets, right? We know how these data assets can be organized to improve our business. We understand what is redundant in the environment and what can be rationalized, right? And we know how to protect and secure the important part of the data landscape. So do they typically find data that should be protected that isn't? Absolutely. So you identify security vulnerabilities as part of the service? Security vulnerabilities, privacy vulnerabilities. The privacy one is the most obvious one, the approach towards privacy these days is I'm going to interview all my application owners. I'm going to talk to my database owners and say, give me a certificate saying that all your data is sensitive, your data is protected. And everybody does audits, everyone passes the audits with flying colors, except when our system starts looking at it. When our system sort of reverse engineers the environment, all the holes, they just pop right up because the software is the agents are saying that here's a sensitive data area that the software can detect. This is a data quality conference in part. What do you find is overall is the state of data quality in most of the companies that you work with when you first come in? I think it's terrible. So we have about eight layers of functionality related to data quality. And what we say is, okay, can you, do we know all the rules, the data quality rules that should be applied on the data in order to measure the data quality? Because if you are organizations that can understand data quality, everything should be measurable. The data quality of all the data assets should be measurable. But we don't know anyone who does that. If when it is done, it is done in a very sparse way, right? Four silos that are of importance. Nothing is done in an organized systematic way. So we think that the two or three challenges related to this. The first problem is you must know all the rules, all the data quality rules that are applicable to the data landscape. And because it is a large number of rules, you cannot do it manually. You have to automatically generate the rule set that governs data quality. Number one. Number two, you must apply those rules, right? And create data quality metrics for each of your data assets. To say, this asset I trust, and this asset I do not trust. Because the quality metric and the rules that have been run, show that it has a poor metric, right? So you must do that. You have to do that at the point of creation or use, correct? That is right. You can't go back and do it after there's this big bog of data there. Yeah, that would mess up all the history of the system, so you can't do it. So the issue of lineage is very critical. You have to actually trace back across the ecosystem how information is originating and how it's flowing through the ecosystem. Then once you identify the origin of a problem, you actually fix it at that source so that when it flows downstream, everything is corrected. So the problem, to summarize them, you must know all the rules. You must have all the data quality metrics. And you must have a cleansing mechanism that uses the lineage and traceability in order to solve the problem. And we believe that the way ultimately that they should be done is on the basis of data quality controls, right? That at the source, you have all the controls in place such that these low quality data doesn't even enter the system or it's blocked at the point of origination. So just like in the process area, we have process control flows. We should have data quality controls right at origin. So you've developed this software, this technology. It sounds like the architecture is you've got a distributed system so data can stay in place. You're not shoving everything into a single repository. And you've got a lot of math and algorithms to allow you to auto-classify and apply policies. It's a powerful technology and it's unique. I mean, not to pick on, no, I'm going to pick on them. Take a company like Autonomy who claims to have solved this problem prior to the HP acquisition. But essentially what Autonomy is doing is using search as a blunt instrument. And it doesn't scale, the data quality is not effective. But so am I mistaken? Is this not unique in the business or are there? It is fairly unique because these are hard problems and you need to spend at least a decade in order to think about these problems correctly. And most software companies don't have the patience or the money to attack these large problems. So we are fairly unique in the sense that we are attacking these very large problems and making sure that they are tractable. Where we differ from organizations like Autonomy is the sense that Autonomy focused primarily using Bayesian logic on unclassified, unstructured data environments. We focused primarily on structured data environments and we used a combination of Bayesian and non-Bayesian approaches towards healthy. And so you're a private company? Yes. How were you funded? So when I came out of IBM, I sold all my stock in IBM and that created the SEED fund. And we started building out the software at that point. We were self-sufficient right from the beginning onwards. We bootstrapped. We bootstrapped. Self-funded. Self-funded. Client funded. Client funded, you know, steep exponential growth and a very strong position in the market in tackling these very large complex problems. You've never taken outside capital? We have never taken outside capital. Not a dime. Really, congratulations. What's your head count? We have over 100 people, hopefully growing at 50% kind of thing. 15? Five zero. Five, 50? Yeah. In terms of people, in terms of revenues at SIR. Typically, how do your customers find out about you? We primarily operate through our channel partners. We don't have a very large direct sales force. We operate, we have partnerships with the large system integrators. And they understand our value proposition and can explain it well to end customers. So they already have the relationships with the big customers? With the big customers and they typically bring us in. So, you used to be at IBM. You have a partnership with IBM at all? Well, you know, they have been trying to solve this problem for forever. They have been. Again, they claim they do. Yes. But they don't, they touch upon it. And they can integrate it and they can bring in services. But you're talking about a software-led solution that scales. Yes. I've been looking for this since the federal rules of civil procedure changed in 2006 and haven't found much. A couple of small companies that do some interesting things. So, I think the implicit question over here is, who has the audacity to tackle these kinds of problems? So certainly, organizations like IBM, Informatica, Oracle, HP, Microsoft, yeah, they have the wherewithal to tackle these problems. But when you're at the whims of a market that drives you on a quarter-based performance, you don't have the ability to take a 10-year problem and then attack it. So what you land up doing is to buy a whole bunch of different companies and hope that they collectively solve the problem. But of course, they don't. Because all these acquired companies were built for different purposes. They're never meant to work together. They were never meant to solve these large-scale ecosystem problems. So that's one category of organizations that can tackle the problem. The other set of organizations are these small, innovative organizations. But typically, those organizations don't have a large runway because they run short of money and they go to a VC who wants to get an exit in three years. And then they can't solve the problem either. So you have to have a long runway in order to solve these problems. And you have to have the patience to make all the mistakes before you can solve the problem. So you need people with patience and just a drive to actually be interested in solving the problem. So we find ourselves in a category like that. And of course, many of the companies who we are here with at MIT display some of those characteristics as well. But we believe that we have a good solution and we've established that the problem is tractable, that it can be solved. Previously, many of these problems were perceived to be intractable at the scale that we are working. Are you working primarily with structured data or with unstructured data as well? Well, so if you think of our agent-based system, the agent-based system can handle any kind of data. It can handle structured data, unstructured data, big data, doesn't matter. And it's flexible because you can create an agent for any particular incarnation of the data. So wherever the data lives, that's the beauty of it, right? Data is distributed and you're saying you can handle it. Sorry to interrupt. No, absolutely right. So with that in mind, we have to focus on where our revenue comes from, right? So most of the revenue has come from large enterprise companies and their problems are usually related to structured data. So that causes us to focus on structured data, but there's no, the platform itself has flexibility to tackle diversity in real-world environments. Do you find these companies even know where all their data is? I can't imagine they do. They don't and so as Michael Stonebreaker was saying, it requires two kinds of events to happen. One is it requires a strong leadership to say, if we are an information company and we value our information, there is a need to understand our whole inventory and understand how to work with information better. If a mandate comes from the top, it works. The other option is if there's a problem, right? So let's say the target example where you have somewhere from the ecosystem, some data has leaked out, right? How do you deal with that? So a catastrophe often says that I need to know how all of this is working. And so those are the two situations that bring us in. Excellent. All right, well listen, thanks very much, Arcak, for coming on theCUBE's great story. Good luck, you know, and please keep in touch. Let us know the progress. Pleasure, thanks for having me on the show. All right, keep right there, everybody will be back. To wrap up, MIT IQ, this is theCUBE, right back.