 Okay, we're back. This is Dave Vellante. I'm with Wikibon. I'm here with my co-host Jeff Kelly. This is theCUBE. This is, we're at the MIT Information Quality Symposium, a symposium really targeted toward chief data officers and the folks that actually implement the practitioners that are implementing data governance and data quality within the organizations. Data quality is a topic that is really not discussed much, frankly, at a lot of the big data tech shows that we go to, the Hadoop worlds and the Stratas and the Hadoop summits. It's a lot of technology, a lot of how you bring real time to that world. Another topic that's just recently really hit the big data world is security. Our good friend Eli Kahn is here. Eli is an executive at Squirrel. Squirrel is a company that was born out of the NSA. They're really popularizing, commercializing the accumulo database and bringing application development environments to that world and doing a great job there and really trying to move the needle in the big data world. So Eli, welcome back to theCUBE. Good to see you. Thanks for having me. Yeah. So information quality is, as I said, something that's not talked a lot about. But you're in the healthcare industry for sure. You're dealing with a lot of other, you know, financial services, clearly government and really beginning to commercialize the accumulo. So one of the reasons why we wanted you guys here is to talk a little bit about, because I think there are real parallels between what you're doing with Squirrel and what this crowd is doing with information quality. So I wonder if you could talk about that a little bit. Yeah. Data quality was a huge issue that we saw inside the intelligence community, which really led to the development of accumulo. And you can think of data quality in a few different ways. At least that's the way we think about it. One is simply getting access to the data that you need and, secondarily, making sure that data is clean. So inside the intelligence community, back in the mid-2000s, what those folks were facing was many stovepipe databases and stovepipe applications on top of those databases. So intelligence analysts sitting inside the NSA or CIA had great difficulty in committing queries that could search across all of these databases and really get the information that they needed. So those silo databases were created for a number of reasons. Some of them were just, the data was so big they needed to segment them in several databases. But a lot of it was also due to security reasons. They were putting different types of classified material and different types of databases, U.S. persons versus foreign persons data and different sets of databases, which is certainly an issue that we've all read about a lot in the news over the last month or so. So a cumulo was created to really break down those stovepipes and allow folks to securely query across all the data that was available to them based on their authorizations within the organization and provide a scalable platform where they could collapse up to tens of petabytes information on a single platform. So talk a little bit about, from an organizational standpoint, because big government agencies aren't the only ones dealing with this problem. You got to look around at banks, insurance companies, healthcare organizations. You just describe applies to all of those. What are some of the organizational considerations that you see in terms of actually going from that stovepipe world? There's a technology piece, which cumulo is a great example of allowing fine-grained security so you can actually have data in there and have the control so that the wrong person is not getting to the data that you don't want them to. But there are other organizational aspects around that. Sometimes they don't want to give up their data. There's obviously security concerns. I wonder if you could talk about that a little bit. Yeah, I think inside the big data space, you often don't hear about the change management and the cultural transition that needs to happen with an organization to go from these stovepipes databases into a platform where all your data sits. This definitely took the intelligence community a long time to really get their heads around and to put in place the formal organizational policies that really enabled this change, enabled folks to take all this data in a single platform and making sure from a security standpoint that these logical separations of the data were strong enough where you could mix data with different security requirements on the same platform. There's a huge change management piece, making sure the stakeholders are comfortable with the security requirements and security changes and then actually implementing the technology is actually in some ways an easier process than some of the change management pieces. That's precisely what we heard from Tron from the Veteran Affairs this morning. Talking about some of the cultural challenges are often harder than the technology challenges. One of the things he talked about to really enable the development of what he's trying to do at the VA in terms of creating a customer-centric view or workflow process and data view for their stakeholders internally was strategic communication. That is tailor your communication as you're selling your projects internally to different stakeholders. You've got to tailor your communication to those stakeholders. So if it's more of a business person, you've got to speak in a language that they understand and speak to the priorities that they have versus an IT person where you've got to speak in a more technical language so that they feel challenged in terms of the work that's going to be entailed for them. Talk a little bit about how you help customers tackle that problem, the communication problem as you're implementing the technology to make sure that the right people are on board and that the stakeholders understand what the process is going to entail. So I think the most effective way of tackling that problem is through examples of excellence. So inside the intelligence community, we started small with small pilots, demonstrated that things could work on a small scale, really put that pilot out there as an example of excellence and build momentum behind that. So start small and scale big. And we're seeing a number of other federal agencies starting to adopt this approach. We're really excited about the work that is going to begin soon inside the National Institute of Health and their National Cancer Institute. They recently put out an RFP around the development of a cancer genomics cloud. So pulling together cancer genomics data from across various research institutions, from across the NIH and HHS also, and pull that all into a central cloud that brings the analytics to the data. So right now the way it works is that a lot of times research organizations will share data with each other on a one-off basis, but it's not effective from either a computing perspective or from a data quality perspective. So what they're looking to do is bring all this data into a single cloud, utilize things like map reduce and custom tools to refine that data into a standardized format that can be used widely across the larger research institution, and then utilize the economies of scale associated with cloud computing to really bring that computational power to the data and enable more researchers to be able to access really big data sets. So this is a really exciting initiative that hopefully we'll be involved in. So you guys just came off of the Hadoop Summit. We saw you there. You had a good presence there. Talk about that a little bit. How was that show for you? It was amazing. It was our first big show, and we were really excited with the buzz. A number of the Hadoop distributors were sending clients over to us to come talk to us because a lot of folks are now talking about the need for secure real-time analytics, and that combination of security and real-time analytics is something that not a lot of folks can provide. Yeah, it's not out of the box of the distro, is it? No. It's not out of the box. And a lot of what we do is some of the softer pieces, translating an organization's security policies into fine-grained access controls is not just a simply automatic process. It takes some soft skills, too, to figure out how to do that in the best way. So it was really cool to see these folks coming to us asking about how to do this in a secure manner. So I think the momentum around the concept of Hadoop Plus security is really starting to grow. Yeah, it struck me, having been doing this now for a few years, how much more discussion about security there was, even from, say, February at the Strata event to the Hadoop Summit show in June. This year's emphasis, I'd say it was one of the top two or three topics that came up there. And, you know, your point about fine-grained security. We did a great video with Adam Fuchs, who's the CTO at Squirrel. He did a little chalk talk at the Wikibon offices. So you can find that video on our YouTube channel, youtube.com, slash Silicon Angle. It happens to be right now up on the Squirrel site at squirrel.com, sqrrl.com. It's a how-to, fine-grained security for big data. And Adam really described, from a practitioner's perspective, how you should architect that. So he's actually done a number of these for us. We love talking to you guys because you had your hands in it deep within the intelligence community. And now you're sort of sharing that knowledge and obviously commercializing your products as well. So how's that going? You guys had a product launch early this year of Squirrel Enterprise. How's the uptake going? What are you seeing in the ecosystem and the community? So we're in production and a good number of clients now. And a lot of these clients, they had been utilizing Hadoop primarily in a sandbox environment. And they realized that in order to take into production and put the type of data into their cluster that they wanted to put in there, they needed this concept of data-centric security. So a lot of folks are starting to talk about security in Hadoop. I think a lot of that discussion is still about perimeter security and how to properly authenticate and authorize people into your Hadoop cluster. I don't think there's enough discussion yet still about what we call data-centric or cell-level security and really bringing the security to the data itself, which we think, and it's been our experience, is actually the most effective way to secure data in a large Hadoop cluster. We need both, but you know, perimeter security is not enough. Yeah, and people talk about, when you talk about security, talk about the levels of granularity, but there's a lot of nuance there in terms of how to get there, and that's really where you guys have a substantial amount of expertise. Yeah, and I think the question of that more granular security is going to increasingly be asked as, you know, things like yarn, for instance, develops and is an implemented yarn, which really enables Hadoop to become a multi-application platform. So when Hadoop is a single application platform and you've got one group of users, maybe that perimeter-type control is all you need, but when you've got multiple applications potentially running on Hadoop and you've got users from different departments with different levels of authorization running jobs on the same cluster, that's when you really got to start to get to that fine-grain level of security. I wonder, do you agree with that assessment and how do you see kind of yarn and the maturation of Hadoop as a platform, as a multi-application platform impacting your business and what you do in the value proposition that Squirrel brings? We're really excited about yarn. The way that we view yarn is that, like you said, it's going to support development of multiple applications on top of a Hadoop cluster, and that's really what we're all about. We're all about creating an app store environment on top of Hadoop. These cell-level security controls, what they really do is enable multi-tenancy. It allows you to pull lots of different data sets onto a single platform and then expose those data sets to end users and to applications that have many different types of authorization. If you want to create that app store environment, yarn is going to be critical and data-centric security is going to be critical. Here at this show, we've got practitioners from the federal government, we've got health care and financial services well represented. Are those, do you see some common threads among those three industries in terms of their need for security, data quality, data governance? Are you seeing this in terms of perhaps interest in Squirrel, some of the industries that are really focused on this question of data quality and security? Those are our key verticals too. So the intersection of diverse user bases, such as research-oriented verticals like health care, plus really big data and government, financial and health care certainly have really big data are in our sweet spot. So the financial industry, they certainly place a premium on security just because of the cyber security threats that they face. Probably less of a collaborative community than say a health care or even like an intelligence community environment, but because they place such a premium on security and because they have such important regulations and rules about how stakeholders inside a large multinational bank can communicate and how data can flow within that bank, those are other reasons why a data-centric security approach would be really important inside financial services companies. And at Squirrel, you're helping customers understand that workflow and actually program these models so that they can understand who has access to what and when, under what circumstances, et cetera. Because I can imagine just trying to picture that at a big organization on a whiteboard. There's a lot of lines moving in a lot of different directions. Is that something you're helping customers work through? Yeah, we have some unfortunately complex spaghetti diagrams about how this security stuff works. But really some of our secret sauce is in what we call our policy engine. So when we sit down with an organization, we will work through and look at their existing information security policies, help them translate those into machine readable policies that then can be utilized for as an automatic labeling process on the data. So as data is ingested, we are automatically labeling those pieces of data with very fine-grained security rules about who can touch individual pieces of data. And this is becoming more and more important. There's a number of rules that have come out of the Affordable Care Act most recently that have our phones ringing off the hook with health care CIOs that need to figure out how to comply with them. And one of the ways that they're going to need to comply with them is via these fine-grained access controls. For example, doctor's notes now have very specific requirements about who can touch them. And of course that data is unstructured data. So applying those fine-grained access controls to that unstructured data is what cell-level security was invented to do. So you guys just did another raise, which is interesting. So would you raise another? Let's see. I forget the numbers. Can you share those with us? Are you announced it yet? I haven't quite announced it yet. We're in the midst of our series A right now. We'll probably be doing an announcement pretty shortly, but nothing formal yet. OK, so you guys are doing well. I mean, you're going, and it's hard right now. The funding climate has changed because there's a glut of companies looking for series A. So you guys are doing well. You've got a new CEO that's really out of the public company world. So talk a little bit about your business, the momentum. What else is going on there? Yeah, like everyone else, we're hiring like crazy. But we've been fairly successful so far. So our team's up to about 20 folks, so it's starting to feel like a real company here if we're no longer just a small startup. But the plans for the company are to continue to expand in our key verticals, health care, government, finance, telecommunications. We're very strong in a couple use cases right now, cybersecurity being one of them, and helping companies build out a big data security analytics program, basically being that secure data store where they can dump all of their security information into and build exploratory analytical applications on top of. One of the other use cases that we're starting to get really deep in is in health care analytics and ingesting electronic health records and other administrative research clinical data into a single system and looking for patterns in that data that can help predict clinical diagnoses and help improve overall health care services. That's a really interesting question around the EMRs or EHRs depending on the lexicon you want to use. Companies like CERNR and Epic are really doing very well now because of the Affordable Care Act and some of the requirements in there to adopt EHRs. But traditionally, my understanding is that health care systems are very proprietary, don't talk well with one another. How is that impacting the ability of companies like yours and technologies like yours to bring all that data into a single repository to really leverage so a lot of people to use that data for analytics? Is that an extra barrier that needs to be overcome in the health care world? Yeah, I'd say most of the health care folks that we were talking to so far are primarily green field opportunities in that they're still at the pilot stages of their Hadoop deployments, which is great because we can come in there during these early stages and help properly craft security requirements and security approaches on top of that Hadoop cluster. So I think there's tons of rooms for growth of Hadoop adoption in the health care space. I think these new requirements around electronic health records is only going to increase the need for solutions like Hadoop. Yeah, I mean, and really if you think about the end result, I mean how this all translates to helping in the case of healthcare, the patient. I mean, I think we can all relate to, you know, meeting with our doctor or nurses and them not having the right information or asking for information that I just gave to someone else. And that happens for a variety of reasons. There's data quality issues, there's potentially security issues that don't allow sharing of the data. So the implications for the patient, the end user customer for lack of a better term could be significant. Yeah, I mean, let me give you one real quick example of how a big data solution in healthcare might work. So let's say you have a relative that was diagnosed with diabetes. You could utilize a cumulo, a base solution to look for the indicators that led to that diagnosis. You know, what are the pieces of data in the electronic health record and in other surrounding data sets that could be attributed or correlated with that diabetes diagnosis? Now, once we've established that pattern, let's begin looking for that across the healthcare network and see if there are undiagnosed cases of diabetes based on that pattern recognition. So, you know, those are the types of things that we're looking at now, and I think are gonna become more and more prevalent inside the healthcare space. Yeah, it just feels like the whole discussion is a wind at your back, as they say. So, I mean, I think your timing's been good. You guys got the technology chops and now you're building out the organization. You've got adult supervision, which is key, and really excited for the future of cumulo and squirrel. So, Eli, I really thanks for coming by. Always a pleasure. You guys have the greatest perspective, so we love to tap your knowledge and share it with our audience. I really appreciate it. My pleasure. Thanks. Good to see you again. Thank you. All right, take care. All right, keep it right there, everybody. We'll be right back with Tony O'Brien, who's with the Sheffield Business School. This is theCUBE. We're at MIT for the next two days. We're right back.