 Hi everybody, this is Dave Vellante of Wikibon.org and we're here at Wikibon headquarters. This is theCUBE, Adam Fuchsius here. Adam spent the better part of a decade at the NSA. And if you're an application designer and a big data practitioner, he's going to share with you some of the lessons that he learned there. Adam, welcome. Thanks Dave, good to be here. Yeah, so I've broken out some of my best handwriting to share with you some of these lessons learned and I've got four of them today. We're going to talk about starting small, but designing for scalability. We're going to talk about schemas and ontology development. We're going to talk about application building blocks and discovery analytics. And we're going to talk a little bit about data-centric security. So I'll just start out with a little discussion of adoption curves from an application developer's perspective. So I've worked on, at the National Security Agency, dozens of infrastructure components, applications, and really I've seen a variety of approaches taken. Some of those have to do with, we would pick a huge set of requirements, try to throw those into one application. That results in a huge amount of time spent building an application. Maybe you're designing for scalability from the start, but you're also trying to bite off way too much. So the adoption curve that you might see from that type of approach would be huge amount of time spent. You get to market at the end of that time and then your growth follows from that point. A lot of people have shifted more towards a prototyping type effort where they might do operational prototypes. One of the concepts there is try to get the application to market very quickly. So here we've drawn a second curve where the application will design it for a small scalability, pick off a very small set of application requirements, build the app for that, it'll reach a certain point, we'll redesign for scalability, grow from there, redesign from scalability, grow from there. And there are these levels that to cross those tiers, there's a large amount of remodeling effort. But we did get to market quickly and we got some early insight on how that application might be useful. A third curve, which is one of the things that we're trying to support with Squirrel, is to start small, but design from the beginning for scalability. So a lot of the Hadoop ecosystem and the components in that are really designed to give a prototyping type capability where you can bring that to market very quickly, but it also scale up. All of the elements that you throw into that original application design are designed for scalability from the ground up. And that's a much nicer curve. You don't have to take your system off line for a while to scale up to the next tier. Instead, you can just keep it running, add more boxes, scale horizontally, bring in a lot of elasticity. So that's a very nice lesson learned where a lot of the national securities agencies' applications are shifting more towards that design paradigm. Another concept I wanna explore today is the concept of data modeling. And there are sort of two extremes in that space that we see a lot of application developers tending towards. One of the extremes is to use a very flat schema. And this might be exemplified by people throwing data into HDFS and running MapReduce on top of it in its raw form. For some applications, that's okay. If the application uses data that really doesn't have a very complex schema associated with it, that'll get you a long way. There's the other extreme though, which is for applications that deal with data that has a lot of join points, a lot of complexity in the data, it's nice to have a much more highly modeled form of the data. So if we consider these two extremes from an application development perspective, there's a lot of complexity that shows up in the application when we're dealing with unstructured data. A lot less complexity shows up when we're dealing with a more structured approach. So we actually have a complexity curve that looks something like that. There's a flip side of this as well though. One of the things that is nice about flat schemas is you can bring data in very quickly and you can have it available for application development before you go through a large modeling process to bring that data in. So we have a second complexity curve, which is essentially the ETL or the data modeling curve. So you might think of this as the amount of time that it takes to model the data. Certainly if we're throwing it into flat files, we can just throw it in, it's very quick. There's some small structure that we can bring with other techniques, but if we go to this complete ontology approach, it could be months or even years before we get a model that really handles all of the data that we have. And even then, as we bring in new data sets, we may have to remodel, we may have to totally reorganize that ontology to handle the new complexities that new data brings. So what we're trying to support with Squirrel is a middle of the road approach, where it's not totally flat, it's supporting some higher level application concepts, but we can also bring in data very quickly. And this is sort of more of an ELT approach rather than ETL, where we bring in the data quickly, understand it using the tools that we have for doing big data analysis, and then transform it later on as we understand it more. So that schema refinement cycle is an iterative schema refinement cycle. And inside of Squirrel and inside of Accumulo, we're building a lot of tools to support that, whether it's through flexible schemas to bring it in or through schema statistics to learn about the schema that actually exists in there, or through bulk transformation tools that give us high throughput for transforming data and support that type of activity and that iterative refinement cycle is very key for bringing applications to market very quickly. A third concept that I wanna explore is this concept of discovery analytics as application building blocks. So the space where I see a lot of innovation in government spaces as well as in commercial spaces is in that application development. So you might think of an overall application for risk analysis or fraud detection or cybersecurity, whether it's intrusion detection or forensic analysis, those are all a suite of applications, right? There's no single application for any of those use cases that covers everything. So the more applications that we can develop, the faster we can innovate and the faster we can evolve to support a broader set of scenarios, a broader set of applications and use cases. In order to do that with good scalability and with good security, it's necessary really to have a set of building blocks on top of which to build those applications, right? Nobody goes out and builds an application in to in from the bytes that they put on disk all the way up through the visualization layer that's human digestible. There's always layers in between. At Squirrel, what we think is the right layer to build applications on top of is something we call discovery analytics. And this came out of years of efforts in building lots of different applications at NSA. But we think that some of the things that show up in that space are things like universal search where it's structured and unstructured data using languages that people are familiar with like Lucene. Also in that space, basic statistics, aggregations that are parallelized across the cluster, document structures, right? Building models on the time using, over time online using hierarchical document structures and using graph structures. Those things all fit into this discovery analytics layer. And if we can build those in a generic way such that they're reusable at these higher level applications, then what we can do is figure out and solve the scalability and security problems inside of that discovery analytics layer so that the cost of building applications on top of it is very small or is significantly smaller at least. So this layer down here, this is really where Squirrel fits. And our product Squirrel Enterprise encapsulates that whole layer up through the database thing, up through the indexing, the organizations of data, up through that discovery analytics layer exposing things in the right API that's useful for application development. So along all of these concerns, coming from NSA, security is always a big concern. So at NSA, all of the applications that we developed had a multi-level security concern of some level, right? So that generally comes out of the concept of privacy policies or legal restrictions on how data can be used. And as you get into the big data, big application space, you really run across more and more of those types of restrictions on how data can be used. Some of those things come from like Sarbanes-Oxley or HIPAA restrictions on data usage. Some of them come from internal privacy policies or information sharing between different organizations. We're seeing not only in the government space but also in commercial spaces more and more of those data restrictions coming into place. The traditional approach to dealing with complex data restrictions is to put some business logic into the application space. And that's bad for a couple of reasons. One is that it complicates the application. So bringing an application to market with all of the security concerns built into it is a longer process. It may take you months instead of hours to build that application with all those security concerns. Not only that, but this is a point of vulnerability. The more times you implement complex security requirements, the more times you're gonna get it wrong. So if we can push that down into the infrastructure layer and get it right in that one place, we've increased our security and we've increased innovation through cheaper application development. The way we do that inside of Accumulo and inside of Squirrel is through a concept called data-centric security. And for that, really data carries around with it some aspect of provenance, which really defines how it can be used when coupled with a set of policies, security policies. So Squirrel and Accumulo implement that through a concept called cell-level security. And inside of Squirrel and inside of Accumulo, every key value pair that represents those higher-level concepts like structured documents, graphs, indexes, each of those things boils down to key value pairs which are tagged with visibility labels. And that data-centric security concept allows us to separate the modeling of security from the modeling of the application. So exposed at that discovery analytics layer are a series of methods that are used to build those, to model the applications. Each of them has that security built into it from the core, right? So that drastically decreases that cost of building security and drastically increases the efficacy of that security. So those are the four basic lessons learned over my decade of experience at NSA, right? So we follow through here. It starts small but designed for scale, iterative schema refinement, discovery analytics as big app building blocks and data-centric security throughout. Excellent, Adam, thanks very much. I really appreciate you sharing your deep knowledge with the Wikibon and SiliconANGLE communities. So if you want more information on this and other advice, go to squirrel.com, it's sqrl.com. And also check out youtube.com slash SiliconANGLE for other videos like this. Go to siliconangle.com for all the news and check out wikibon.org for all the research. Thanks for watching, everybody. We'll see you next time. All right, thanks, Dave.