 Live from Union Square, in the heart of San Francisco. It's theCUBE, covering Spark Summit 2016, brought to you by Databricks and IBM. Now here are your hosts, John Walls and George Gilbert. Welcome back here on theCUBE. I'm John Walls along with George Gilbert as we continue our coverage here of Spark Summit 2016 at the Hilton Hotel here in San Francisco, continuing with our guest, Yaron Jave, who is the founder and CEO of Aguazio, Israeli-based company, who is having a big day, which we'll get to in just a moment. Yaron, thank you for joining us. Welcome. Nice to have you here. Yeah, nice to be here. Aguazio. What, I mean, and unusual name to say the least, right? I mean, so first off, before we get to what you do, what's the origin of the name? Yeah, what's up with that? So when you think big data, you could think about streams, huge streams of water, you know, high volumes when we thought the team of founders about this. So what is sort of the biggest, nicest waterfall that exists? And that's Iguazu Falls in Brazil and Argentina border. Giant waterfall in Brazil. Yes, it's even nicer than Niagara Falls and that's sort of the inspiration. After we found the name, we actually learned that it stands for big data in Portuguese, big water in Portuguese. So that's why the IO for it. So we're like the big data thing, dealing with massive amounts of data and volume. Interesting. All right, so yeah, so for our viewers who are obviously, I'm sure not familiar with Aguazio because you've been in self mode until today. Tell us about your core focus then. Yeah, so first, a little about the team. So we're coming from a rich enterprise background. You know, one of my co-founders founded a company called Extreme IO, which is now EMC and sort of their main product. We're coming from other companies doing networking, security and enterprise storage. So we have very rich heritage. I was in a company called Melanox selling to cloud providers and storage and enterprise, dealing with all their data center architectures. And what we're delivering is basically a new platform for which unifies all those different use cases for data. And you know, maybe we can talk about some of the, why do we have to build those systems? And when you look today at the applications in the new world, you're sort of seeing the shift from on-prem analytical applications to systems of engagement. You know, the Uber's, the Capital One's, you know, Progressive, everyone is getting into those systems that interact with users, interact with mobile, with social, you know, so they feed a lot of data. So let's assume you need to build an Uber-like application for your application. One that engage with the mobile and social and everything, you need to feed a lot of data. So what do you do? You go, and I think George referred it to the zoo, you know, with those 28 different projects. And one of the biggest challenges in those projects is not necessarily the application. Those are typically stateless. It's how do you store the data? And you have for every kind of data pattern that exists, you know, for streaming, for key value, for files, for objects. And even for those, you have different kinds for scale and performance. You have a different type of repository. And you have to go and use all those different tools, like you go and stream and store streams in Kafka, and then you move it to HDFS. And some people say, you know what, let's also store it in Cassandra or HBase. And maybe we need elastic search because we may want to index it. So you end up with so many independent projects. Each one of them has its own sort of high availability, its own security, its own version management. Some don't even work together, you know? And let's assume you're in enterprise and now you want to build configuration management. You want to do an upgrade for this thing. It's not like your traditional solution. So we're trying to address this challenge. Would you concur? You know, it's funny, I'm listening to how you describe it. And the classic life cycle of a platform is take a bunch of disjointed, partially complete sets of functionality and bring them together into a complete whole that is now simple enough and comprehensive enough for others to build on. And, you know, I sort of joke around and say that, you know, many of the open source projects that are part of the Hadoopika systems are kind of like the Noah's Ark, you know, that we, but not, they sort of join not two by two, three by three. And we do, we're in crying need for some simplification. And as we've talked before, Spark simplifies the compute, you know, area. All those gazillions of different execution engines now, you know, coalesce around Spark or get consumed by Spark. And there was a crying need to do something here at the storage layer. Yeah, so exactly. Spark Basky replaces the Mahout and the pigs and the hives and all those different computation problems. And we're trying to consolidate all the data, all the persistency layer, basically how you store all those sort of files and streams and objects. And what's really unique about our solution is that we don't just consolidate, we virtualize the services. That's why we call it virtualized data services. So you can actually stream data into the system on one end, you could add sort of the context. Let's say an IOT example, okay? So I have my sensor streaming data. I also need to know the state, you know, what's going on with this sensor. We can also pull it from the state, which is more like a key value structure and push it into Spark, a sort of one data frame construct. So now you avoided all those different data copies and silos and complexity and you get real time. Within a millisecond from the minute this event arrived, you can already analyze it. Have you read the long simmering controversy between the Lambda architecture and the Kappa architecture, where Lambda was the, hey, we have the real time, we're near real time here and we have the batch here and forever they must stay separate because you can't really, you can't combine them. The functionalities don't overlap enough. Whereas we waited it maybe a year or more and we could combine them. And what you're telling us now is actually an even richer combination. Would you say that, so if there was Lambda, Kappa, I can't remember the next great letter, but would you claim that next letter? Yeah, I think so. I think that also, you know, I've been working a lot with the cloud providers. I think they also look into those paradigms of how do we consolidate more and more things? Why do we need a caching layer like Redis and why do we need a DynamoDB layer for why won't we combine those two things together? Why can't I have my key value have cash in front of it? Okay, so I'm in total agreement. By the way, I have my own blog and I write a lot about those things and the Lambda. But this is the key challenge that we're addressing. When we go today to customers, it takes them sometimes two years from the initial POC where they start playing with those toys until they get everything nailed down. You know, high availability, security, virgining, and some enterprises, they only have the skills, you know, cause all the skills are serving the web companies. Where do they find those guys that can master those 20-something projects to create this sort of two-year project? Have you benched, well I guess it's hard to benchmark since you're just coming out of stealth, but what type of anecdotes have you heard in terms of timeframe to go from POC to pilot to production? And then maybe have you worked at all with any design partners where you see, you can benchmark a difference? Yeah, so we're still really not announcing the products. I don't want to get into the actual product, but the system is designed with two principles. One, it's an enterprise system. So everything is baked in, you know, we're coming from legacy of building X-Trim IO and XIV and enterprise products. So we know how to deliver this kind of experience, you know, in-service upgrades, all those things, one end. But what is sort of missing today in the enterprise is this notion of service, just like Amazon. Why can't you have services? I don't need the application guide to go to the IT guy and send emails and say, go provision infrastructure. I want the application guide to basically be able to provision his own stuff and also be able to do application performance monitoring, you know, provision quality of service and security for more of an application part of that. So this is what we're trying to do to simplify. Not only, you know, the performance, we could talk about what we do on performance, we sort of redesign the entire stack to deliver phenomenal performance. But what we also believe in is that usability is the biggest thing we need to address, not just complexity and performance. Usability from an admin, like in what way? Usability in the sense that when someone wants to provision an application or create an application, we should streamline it. It needs to be the fewer amount of clicks to push it. It needs to be the fewer amount of cables to plug, the fewer amount of scripts to write. This is the notion I'm coming from. As always, think about the iPhones and the iPad. How do you create this experience? You know, I wrote a blog post recently about sort of the Docker and DCOS and Mesos How do we create the apps experience for enterprise? And a lot of it, if you think about apps, you know, mobile apps, then the mobile apps, when you serve your phone breaks, you know, you bring a new phone, you put your SIM card and you start, you basically, it all works, all your apps. How did it happen? It's because all the state is stored in those sort of data services somewhere in the cloud. So we need the enterprise needs sort of the similar experiences. You know, in the future, we'll have those Docker microservices delivering the apps. You know, the apps can be Spark, it can be Elasticsearch, other things. We need the storage to start behaving the same way. Let me ask two questions in different directions. Engineering is about trade-offs. Okay. You know, getting this wonderful integration has to come at a cost. You know, what did you trade for that integration? Question one, and two, is the purpose to make this the perfect storage equivalent to Spark as the unified compute engine? Yeah, so first we see the storage is changing. You know, storage used to be San and S. You know, show me one guy deploying application in Amazon or a software as a service company using San or S. Those are going away and needs to be displayed with things that are sort of stateless and elastic like object storage, key value, you know, streaming, this is how modern applications are designed. So we want to basically address sort of the unified storage for this space, but we don't stop there because what we do, we push some of the applications, some of the analytics into our platform. Now, what we did in order to address this challenge, we had to redesign the entire stack. You know, think about how storage is designed today or how, you know, a Cassandra would be designed today. You know, it's like a Java based code with all the Java issues, you know, VMs, garbage cleaning, and then you have traditional file systems and all that. Now, those stacks were designed 20 years ago, okay? When we had disks, when CPUs had two, systems had two or four CPUs, okay? Right now we need to totally evolve the stack to think about memory. You know, we're going to have non-volatile memory next year with Intel 3DX point, okay? And reramp technology. We're going to have flash is sort of conquering storage. Why do we need to have all those disk trade-offs and scheduling when we could introduce flash at least as a caching tier, okay? Why do we need to have all this serialization and the software when we have 30 CPU cores or 40 CPU cores on the system? So basically what we had to do is bring our experience coming from high performance and trading system and deep security is how do we do that in real time? How do we produce basically millions of transactions per second on a single platform where current solutions are doing tens of thousands, okay? So are you saying that you used multi-core, the parallelism that everyone's struggling with, you assigned to each core or cores the different layers. You disaggregated the layers in the storage stack but you kept them performant by assigning them to the cores in a processor. That's the secret sauce. Yeah, so we're coming from a rich background in real time and for example, one of the things we did in Melanox where we sort of invented open V-Suite, the switching in hardware and we did things for network function virtualization that produce 50 million transactions per second. You would tell a storage guy today, can you do 50 million transactions per second? It's going to faint, okay? Now it's not that it's impossible, it's a different paradigm of how you write software, okay? It's lock free. You have to understand how the Intel cache behaves. You have to understand how the network and the storage behaves and most of those zoo project as you call them are still sort of very high level, Java and all that. So we basically had to write code that doesn't use anything from the operating system. Okay, we manage the memory, we manage the input, we manage everything so we can produce latencies that are immanageable. We don't announce performance and latency numbers but believe me, when you'll hear what we can deliver, you'll be amazed. Because we're actually faster than the block storage, the fastest all flesh around the market. We're faster than that. And the challenge with some of those lower level storage parts like block or file is that once you're starting to stack all the layers, you bring the file system and application layer, you would get 1% of your overall, you know, row performance. What we wanted to change is to create the stack so integrated and so much utilizing all those different capabilities. So we get 90% of the overall hardware capacity. And that translates on one end to real real time performance. On the other end to youth saving in cost. Because what we can do on a single x86 is equivalent to about 20 different servers. Okay, so you save a bunch of power consumption, all the people that need to manage it. You know, it's a youth saving. So I don't want to get too far in the weeds but if I'm reading, if I'm understanding correctly, there's a layer of services, standard services that you could leverage but you pay a heavy tax. Exactly. And so what you did was you wrote those services yourself distributed among the cores so that they run in parallel. Exactly. So you eliminated the tax, went parallel and then had a revolutionary performance breakthrough. Yeah, so we sort of combined the clonuses like layer seven network processing and in-memory databases. We sort of modern storage stack into one platform. You know, we do crazy things on security. We do deep packet inspection. We can analyze content on the fly in 100 gig per second throughputs. You know, there are sort of imaginary numbers and performance numbers. And this all requires understanding of all the different layers, anywhere from the application all the way down to the infrastructure. And with that sort of experience, what we want to do eventually, you know, we're talking too much about the weeds and the technology but what we want to build is a platform that sort of provides the Amazon experience to the enterprise. But we're more focused on what enterprise really cares. You know, performance, security, ease of use, life cycle management, simplifying it in a way that traditional enterprise guys can work with it. You know, not just sort of scientists and guys with extreme experience in software. So this is, now we have this huge advantage on the technology because we're coming from this very rich heritage of high-performance enterprise product which will allow us this sort of differentiation and allows us to basically do all this abstraction and virtualization with zero penalty and actually resulting in faster performance. And over time we're trying to take more and more functionality from the applications to us. So for example, we're working with ad exchange companies, okay? And they have very tough challenges of cross correlation and things like that. We can do things on the cash layer internally that actually accelerate their applications. We had one use case where we can cut queries that ran two days into 15 minutes by sort of trying to take some of the computation, some of the state management closer to the data, you know, into a very, very optimized execution engine. Okay, so that's sort of what we're trying to do. You've got, I think the, certainly the makings of a better mousetrap and wish you all the best with that. Good luck after today's announcement. And thanks for sharing that here on theCUBE. I hope to be here again. Very good. All right. Thank you. I appreciate that. That's before from San Francisco in just a moment. Oops, sorry.