 Live from Midtown Manhattan, the Cube's live coverage of Big Data NYC, a Silicon Angle Wikibon production, made possible by Hortonworks, we do Hadoop and WAN Disco. Hadoop made invincible. And now your co-hosts, John Furrier and Dave Vellante. Hi everybody, we're back. This is Dave Vellante of Wikibon with Jeff Kelly. We're here at Big Data NYC, which is our event that we are holding this week. It's our third day. We're down here on Sixth Ave at the Warwick Hotel right across the street from the Hilton where Hadoop World and Stratoconfer are going on. Adam Fuchs is here. He's the CTO of Squirrel, a longtime Cube guest, individual who's done a number of great spots with us. We did one of the Wikibon offices, which we're going to talk a little bit about today because that was the subject of one of his breakout talks. Adam, welcome back to the Cube. Thanks, Dave. Good to be here. Yeah, so you evidently had a packed house. We were talking to some of the folks at the Cube party last night. Evidently, they were standing room only to your session on Big Data Lessons Learned at the NSA. Everybody wants to hear about that. Now, of course, you couldn't give the real inside scoop, but still, you provided, I'm sure, frameworks as you did in the video that we did with you. So, how did it go? Yeah, so I think everybody was expecting me to divulge some secrets. Most people were disappointed. But no, it was definitely packed house. I think we counted something like 400 people in which, I don't know, probably violates fire code. Good session. A lot of interest afterwards. So I think people are really getting into Big Data security, trying to fix all that stuff. Well, it's certainly been a talk of this conference and it really started to escalate, I don't know, maybe about a year ago, but really started to escalate kind of around last summer, I noticed. We saw you at Hadoop Summit and the talk around security really started to grow. Now, maybe that's because everybody really started to have some kind of solution to bolt on to their existing infrastructure. You guys always make the case that you had to start from scratch. And that's really what Accumulo was all about. It was developed inside the NSA. Somehow you guys got it open sourced, which is... Yeah, that was quite an effort. Amazing, right? Yeah, funny thing though is that open sourcing software from government and from behind the other side of the wall, it's not that hard. The hard part is actually continuing to participate in the project afterwards. So we actually went through about a year of policy writing in order to just get the right structures to maintain that participation from folks that are actively working on top secret missions. Right, because they want to put them in a closet and not let them out. So there have been a few projects that have kind of been thrown over the fence, stuck on GitHub. There's a lot of good tech out there that nobody will know when it comes in. Well, we talk a lot here. When we started Wikibon, I said earlier today that we were inspired. One of the many inspirations was Don Tapscott's book on Wikonomics, put it out there, great things will happen. And we'll pick up on that again. Government sometimes doesn't want that to happen, but the flip side of that is innovation happens when you open source things. Yeah, and as a startup it's actually pretty tricky for us to decide where do we do open source and where do we do closed source. At some point you've got to figure out how you're going to make money, but also how you're going to innovate and how you're going to really rally a larger development community than just your own team. So now you're also seeing some others hop on the Accumulo bandwagon, not necessarily just through Squirrel, but also say, hey, Accumulo, yeah, we'll support Accumulo. Yeah, a small company like Cloudera may have heard of them. They just announced the support for that. Hortonworks is very active in the development as well. So we're getting a lot of great players out there. Of course, there's still a huge amount of funding that comes in through government contracts. You go pay attention on the lists. You'll see lots of people chiming in that have various projects on Accumulo. We're trying to bring that much more into the private sector outside of government space. Get a lot more participation from folks in regulated industries like healthcare, banking. Huge potential for solving the security problems that are out there. So talk a little bit about some of the lessons learned. I mean, again, I know you can't divulge super secrets, but things around scale and just security practices in general and dealing with diverse data sets and utilizing a key value store. What are some of the things that you talked about in your session and you talked about in the video that we did together? Yeah, so the core axiom behind all of these lessons learned is that application development is key for innovation in any kind of organization. By that, I mean, if you can take an idea and encapsulate that in an application that can be used and can be adopted by a large organization, that application is sort of that nugget of innovation. The more you can do that, the more rapidly you can build applications, the faster you can innovate. So a lot of these lessons learned, we're talking about how do we avoid barriers to innovation. And the core of the Hadoop community is really focused on that as well. So you've got Hadoop pushing for scalability as a very central tenant of the overall infrastructure. You do that because you don't want to hit a point somewhere in the adoption curve where you're not going to have the ability to scale beyond that. So that ability to just continue to scale seamlessly from start to finish, that's one of those core ideas. What kind of questions did you get from the audience that you can share with us? Yeah, so everybody's interested in what's the difference between a cumulo and HBase? We get a lot of questions on that. Oh, okay, interesting. Okay, well, let's talk about that. I mean, you and I have talked about that, but it's worthwhile reviewing. Right, so the two projects actually started around the same time, although a cumulo was a closed source project inside of the government for the first few years of its life. It's an interesting discussion for two reasons, and one of those is that the two projects actually have pretty different focuses, despite being both based off of the Google Bigtable design from the beginning. So, you know, cumulo started out really trying to focus on very heavy ingest rates, and as such you can see a lot of design decisions made, both in the client libraries and the networking code and the compaction algorithms that are backing all of the performance on the server side, versus HBase, which very much had a focus on providing low latency query response times. And, you know, certainly in 2011 it was ahead on most of those metrics. So, you kind of see a co-evolution of the two projects, but now, since they're sitting side by side in the Apache Software Foundation, you get actually a lot of code sharing and a lot of ideas sharing between the two projects. So, I think it's actually very healthy to have the two projects sitting side by side. Now, you know, our position at Squirrel is that a cumulo still has those core security and scalability features that distinguish it, certainly in all of the use cases that we're using it. Well, the race is on, right? So, you've got people talking about putting, you know, cell level security into HBase. You guys have said, okay, you know, it's cool, but if it performs like you need it to perform. Yeah, you know, I'd love to have some point in the future where we can take our Squirrel software and port it over to HBase and, you know, have that not be an issue anymore. Yeah, you would love to do that, of course. The architecture is great underneath the big table design and the systems built off of that. A lot of strengths. You know, the battle shouldn't be amongst, you know, the minute differences between the two systems. Right now, security kind of makes it a pretty big differentiator. Well, this is kind of a related question. We talked earlier today with Jack Norris from Apar and he made a point about, you know, in the Hadoop space, it's pretty much, Hadoop has emerged really as that de facto standard for a big data platform, the foundational level anyway. But the NoSQL role is much more up in the air. There's just, he put up a slide, I think, in one of his keynotes with, I don't know, how many logos of different NoSQL databases. So, obviously, Squirrel is betting on a cumulative load, but in your opinion, what is it going to take for, you know, one or two or three of these NoSQL databases to kind of win the market, if you will? I mean, or maybe, is the premise of my question right? Can they all survive? How do you see this kind of playing out and what are some of those keys that are going to differentiate the NoSQL databases that do thrive from the ones that kind of fade away? Yeah, well, it's almost certain that the ultimate NoSQL database doesn't exist yet. But what we're seeing is that there seems to be some consolidation of the various systems into maybe four different categories. You've got your hash-based key-value stores. You've got your sorted key-value stores, which also add some columnar capabilities. You've got document stores like MongoDB, which have the most market share now in the NoSQL space. And then, I'd also throw in a lot of the graph stores into that category as well. But if you start looking across them, you can bring in some defining characteristics of NoSQL that sort of separate it from SQL. Things like low latency, high concurrency, denormalized models of data. All of those elements, we see commonalities across all the different systems. One of the things that we're trying to do with Squirrel is to minimize that interface perspective as to choosing which database to use under the covers. So we're trying to bring in document features and graph features, storing those on a cumulo side-by-side with the sorted key-value, which gives you a huge feature set. And from the perspective of supporting that low latency, high concurrency, query interface with the denormalized models, we can do pretty well across all of those. So in terms of what you mentioned earlier, a partnership with Cloudera, and so what's the impact of a partnership like that in terms of you see as the Hadoop world and the NoSQL world collide a little bit more and you're seeing more support from people like Cloudera and others, how is that impacting the NoSQL market? Yeah, so from our perspective in particular, we have a lot of different areas where we need to focus our development. Some of those are on things like security and the security ecosystem, building in policy engines and ties into identity and access management systems. Huge amount of work over there. Some of are in doing analytic layers and making sure that we have things like documents and graphs and live updates and graph neighborhood searches and maybe sketching algorithms and approximations. And then there's sort of the other part which underlies everything, which is infrastructure, core infrastructure and enterprise readiness. I'm pretty excited about folks like Cloudera coming in because they're going to be contributing to these open source sub-projects, things like replication across multiple data centers that really solidify that core and we need to have that. We can do that ourselves. I'd much rather focus on these other elements and partner on keeping that stuff in the open source, that core open source database. So really focus on what you do best and where your real-value proposition is and kind of let partner rather than try to do it all yourselves. Yeah, nobody can succeed doing everything themselves. So let's geek out a little bit. So what are some of the innovations you're working on now that you can share to the extent that you can share that you're excited about? Things you're looking to add to Squirrel Analytics, the platform, and a cumulative of the database? Yeah, so one of the things that we put in our latest release is an ability to do online aggregation in a document model. So you can have a field in a document where you can update that field and the effect of updating that field is to add to the value, to contribute to the value that was there before. This is a pretty simple concept, right? Maybe you read the previous value, you increment it, you write it back. We're doing it in a different way. An accumulo under the covers has this thing called the iterator tree, which basically supports doing that type of increment operation, giving consistent results. So you always get the same response to your query. But doing it in a way that doesn't involve read, modify, write. If I'm randomly updating a whole bunch of values throughout my database, and I have to read the previous value at each of them to do that random update, all of my operations are going to turn into random eye-ups in my core storage, which tends to be spinning disk. Right? You don't want to do that, right? You want to batch things up in micro batches and basically be able to do the aggregations efficiently. So we're exposing that at a pretty high level inside of the documents. Part of what we're working on is really extending that to where it's, you know, you can have weights on graph edges that have those types of aggregate functions, but also extending the types of aggregate functions that we support. So a lot of people are really interested in monoids and sketching algorithms. Google published a paper on hyperlog log cardinality estimation. Just a little easier than the reading. Let me go a little bit. So basically, you know, limited memory footprint, but, you know, I get a pretty good estimate for things like counts across huge data sets. So if I take 10 petabytes of data and I want to figure out what's in that, I can have a series of sketches that tell me what's in that and I can store those sketches in a much smaller footprint than the original data, as long as I can keep them up to date online. And that's what we're really building in, you know, to our document and graph models. You were talking before about some of the problems with spinning disk. Do you see the advent of flash memory changing? Some of the unnatural acts that you had to do as a developer? And actually maybe obviating some of the cool things and tricks that you used? That's actually something that we've looked at a lot over the last several years. Is this database technique going to be obsolete? And generally, if you look at the data structures that are optimal for storing things on something like an SSD, they're pretty similar to the log structure merge tree technology that backends Acumula. So you still want to do batching, right? Because a lot of that has to do with synchronization and, you know, how do you efficiently use your threads and your processors. And it also has to do with how do you efficiently compress data? So how do you reduce the relative interpeets between two different data elements? So grouping things into batches, doing a log structure merge tree, still going to be efficient in the SSD in memory world. And we're pretty excited because it's going to give us, you know, 10 times better performance as well or more. Yeah, so the techniques you've chosen have legs of better performance. And the gap, what gives you confidence that the gap you have now in terms of security scalability and all the other benefits of Acumula that you can maintain that gap as all this other competition comes into the marketplace? Well, I don't think those elements of the database really change. I mean, you still have to worry about how do you model that security, how do you keep track of the fine-grained attributes that affect who can see an individual record. And I think the domains that we're talking about, you know, we're pretty far ahead. So because it's fundamental to what you've developed? It is fundamental. Yeah, I mean, in memory techniques are great, but you still have to have the right algorithms and, you know, the data-centric security model in general is still the right one. Yeah, great. All right, Adam, thanks again for coming onto the Cube. Always a pleasure to listen to you get deep and check out the shock talk that Adam did. It's called Big Data Lessons Learned at the NSA. If you go to Squirrel's website, you can see it. You got a little, you got to give your name address kind of thing, or you can just google it and find it somewhere, but it's really good. So check that out. You got a few videos up there, actually, that you did that were really helpful. We've done four so far. Yeah, it was awesome. Kick off some more. Really learned a lot doing those, and they go back every now and then and review them, and they're quite useful. So for the practitioners out there that just want some good advice on how to structure things, it's worth hitting the Squirrel website or, like you say, hit the Silicon Angle YouTube page. Okay, Adam, thanks very much for coming on. Jeff Kelly, thank you for hanging in there with me. And this is the Cube. Right back, we're live from NYC Big Data in New York City. We'll be right back.