 Live from San Jose, California, it's The Cube. Covering Big Data Silicon Valley 2017. Life of the Cube coverage of Big Data Silicon Valley or Big Data SV hashtag, Big Data SV. In conjunction with Strata Hadoop, I'm John Furrier with The Cube and my co-host George Gilbert, analyst at Wikibon. I'm excited to have our next guest, Yaron Jave, who's the founder and CTO of Gazio. Just wrote in a post up on SiliconANGLE, check it out, welcome to The Cube. Thanks, John. Great to see you. You've been a guest blog this week on SiliconANGLE and always great on Twitter, because Dave will always like to bring you into the contentious conversations. I like the controversy one. And you add a lot of good color on that, so let's just get right into it. So your company is doing some really innovative things. We were just talking before we came on camera here about some of the amazing performance improvements you guys have on many different levels. But first, take a step back and let's talk about what this continuous analytics platform is, because it's unique, it's different, and it's got impact. Take a minute to explain. Sure, so first few words on Iguazu. We're developing a data platform which is unified so basically you can interest data through many different APIs and it's more like a cloud service. It is for on-prem and edge locations and co-location, but it's managed more like a cloud platform. So very similar experience to Amazon. It's software. It's software. We do integrate a lot with hardware in order to achieve our performance, which is really about 10 to 100 times faster than what exists today. But the point that we're, we've talked to a lot of customers and what we really want to focus with customers is solving business problems, because I think a lot of the Duke camps started with more solving IT problems. So IT is going kicking tires and eventually failing based on your statistics and gardener statistics. So what we really wanted to solve is big business problems. And what we figured out that this notion of pipeline architecture, where you ingest data and then curate it and fix it, et cetera, which was very good for the early days of a Duke. If you think about how a Duke started was page ranking from Google. There was no time sensitivity. You could take days to calculate it and recalibrate your search engine. Based on your research, everyone is now looking for real-time insights. So there is sensor data from cores. There is stock data from exchanges. There is fraud data from banks. And you need to act very quickly. So this notion of... And I can give you examples from customers. This notion of taking data, creating Parquet file and log files and storing them in S3 and then taking Redshift and analyzing them and then maybe few hours later having an insight. This is not going to work. And what you need to fix is you have to put some structure into the data because if you need to update a single record, you cannot just create a huge file of 10 gigabytes and then analyze it. So what we did is basically a mechanism where you ingest data. As you ingest the data, you can run multiple different processes on the same thing and you can also serve the data immediately, okay? And a few examples that we demonstrate here in the show. One is video surveillance, very nice movie-style example that you basically ingest pictures for our S3 API, for Object API. You analyze the picture to detect faces, to detect scenery, to extract geolocation from pictures and all that. All those through different processes, TensorFlow doing one serverless functions that we have to do other simpler tasks. And at the same time, you can have dashboards that just show everything and you can have Spark that basically does a query. So where was this guy last seen or who was he with, think about the Boston-Born example, you could just do it in real time because you don't need this notion of pipeline. And this solves very hard business problems for some of the customers we work with. So that's the key innovation, there's no pipelining. No pipelining. And what's the secret sauce? So first, our system does about a couple of million of transactions per second and we are a multi-model database. So basically you can ingest data as a stream, exactly the same data could be read by Spark as a table. So basically issue a query on the same data, give me everything that has a certain pattern or something and could also be served immediately through RESTful APIs to a dashboard running AngularJS or something like that. So that's the secret sauce is by having this integration and this unique data model, it allows you all those things to work together. There are other aspects, like we have transactional semantics. One of the challenges is how do you make sure that a bunch of processes don't collide when they update the same data? So first, you need a very low granularity because each one may update a different field. Like this example that I gave with geo data, the serverless function that does the geo data extraction only updates the geo data fields within the records. And maybe TensorFlow updates information about the image in a different location in the record or in potentially different records. So you have to have that along with transaction safety, along with security. We have very tight security at the field level, identity level. So that's rethinking the entire architecture. And I think what many of the companies you'll see at the show, they say, okay, Hadoop is given. Let's build some sort of a convenience tools around it. Let's do some scripting, let's do automation but sort of the underlying thing, I won't use dirty words, but it's not well equipped to the new challenges of real time. And we basically restructured everything. We took the notions of cloud native architectures. We took the notions of flash and latest flash technologies, a lot of parallelism on CPUs. We didn't take anything for granted on the underlying architecture. So when you found the company, just take a personal story here. What was the itch you were scratching? Why did you get into this? Lastly, you have a huge tech advantage, which is we'll double down on the research piece that George will have some questions. But what got you going with the company? You got a unique approach. People would love to do it with the pipeline. It sounds great. And the performance, you said, what, 100X? So how did you get here? Tell the story. So if you know my background, I ran all the data center activities in Melanox and you know Melanox, I know Kevin was here. And my role was to take Melanox technology, which is 100 gig networking and silicon and fit it into the different applications. So I worked with SAP HANA, I worked with Teradata, I worked on Oracle Exadata, I worked with all the cloud service providers on building their own object storage and no SQL and other solutions. I also owned all the open source activities around Hadoop and SAF and all those projects. And my role was to fix many of those or to come to customers says, I don't need 100 gig, it's too fast for me. And how do I, and my role was to convince him that, yes, I can open up all the bottleneck all the way up to your stack so you can leverage those new technologies. For that, we basically sew the inefficiencies in those stacks. So you had a good purview of the marketplace. You had open source on one hand and then all the storage, vendors, network. All the database players and all the cloud service providers were my customers. So you're a very unique point where you see the trajectory of cloud doing things totally different. And I sometimes I see the trajectory of enterprise storage and all flash, all that legacy technologies where sort of cloud providers are all about object, key value, no SQL, you know. And you're trying to convince those guys that maybe they were going the wrong way, but it's pretty hard. But how are they going the wrong way? I think they are going the wrong way. Everyone, for example, is running to do NVMeover Fabric now. That's the new fashion, okay? I did the first implementation of NVMeover Fabric in my team at Melanox. And I really loved that at that time, but databases cannot work on top of storage or networks because there are serialization problems. Okay, if you use a storage or a network, that means that every node in the cluster have to go and serialize an operation against a shared media. And that's not how Google and Amazon works. There's a lot more databases out there too and a lot more data sources, got the edge. Yeah, but all the new databases, all the modern databases, they basically shard the data across the different nodes. So there are no serialization problems. So that's why Oracle doesn't scale or scale to 10 nodes at best with a lot of our DMA as a backplane to allow that. And that's why Amazon can scale to 1,000 nodes or Google spanner. And that's the horizontally scalable piece that's happening. Yeah, because basically the data distribution has to move into the higher layers of the data. And not the lower layers of the data. And that's really the trajectory where the traditional legacy storage and system vendors are going. And we sort of followed the way the cloud guys went. Just with our knowledge of the infrastructure, we sort of did it better than what the cloud guys did. Because the cloud guys focus more on the higher levels of the implementation, the algorithms, the Paxos and all that. Their implementation is not that efficient. And we did both sides extremely efficient. How about the edge? Because that's edge is now part of cloud. And you got cloud, it's got the compute, all the benefits you were saying. And still they have their own consumption opportunities and challenges that for everyone else does. But edge is now exploding. The combination of those things coming together at the intersection of that is deep learning, machine learning, which is powering the AI hype. So how is the edge factoring into your plan and were architectures for the cloud? So I wrote a bunch of posts that are not published yet about the edge. But my analysis, along with your analysis and Pierre Levine's analysis, is that cloud have to sort of start distributing more. And because if you're looking at the trends, 5G Wi-Fi and wireless networking is going to be gigabit traffic. Gigabit to the home driven by Google, 70 bucks a month. It's going to push a lot more bandwidth at the edge. On the same time, cloud providers, in order to lower costs and deal with energy problems, they are going to rural areas. And the traditional way we solve cloud problems was to put CDNs. So every time you download a picture or a video, you go to a CDN. When you go to Netflix, you don't really go to Amazon. You go to a Netflix pop, one of 250 locations. But the new workloads are different because they're no longer pictures that need to be cached. First, there are a lot of data going up. Sensor data, upload files, et cetera. Data is becoming a lot more structured. Sensor data is structured. All this current information will be structured. And you want to basically digest or summarize the data. So you need technologies like machine learning and all those things. So you need something which is like CDNs, just a mini version of cloud that sits somewhere in between the edge and the cloud. And this is our approach. And now, because we can sort of string grab a mini cloud, a mini Amazon and way more dense approach, then this is a play that we're going to take. And we have a very good partnership with Equinix, which is distributed in 170 something locations. We have a very good relation. So you're essentially going to disrupt the CDN. And something that I've been writing about and tweeting about, CDNs were based on the old Yahoo days, caching images you mentioned. That's, you know, give me 1999 back, please. You know, that's old school today standards. So it's a whole new architecture because of how things are stored. You have to be a lot more distributed. What is the architecture? So in our innovation, we have two layers of innovation. One is sort of on the lower layers of, we actually have three main innovations. One is sort of on the lower layers of what we discussed. The other one is sort of the security layer, where we classify everything, layer seven at 100 gig traffic rates. And the third one is all this notion of distributed system. We can actually run multiple systems in multiple locations and manage them as one logical entity through high-level semantics, high-level policies. Okay, so when we take theCUBE global, we're going to have you guys on every pop. No, this is a legit question. No, it's going to take time for us. You know, we're not going to do everything in one day and we're starting with local problems. Well, this is digital transmission. So stay with us for a second. So here's the, stay with this scenario. So video like Netflix is pretty much one dimensions video and so there's a, there's certainly a new CDNs but when you start thinking in different content types, so I'm going to have a video with maybe a CGI overlay or social graph data coming in from tweets at the same time with Instagram pictures, I might be accessing multiple data everywhere to watch a movie or something. That would require beyond a CDN thinking. And you have to run continuous analytics because you cannot afford batch. You cannot afford a pipeline because you ingest picture data, you may need to add some subtext with the data and feed it directly to the consumer. So you have to move to those two elements of moving more stuff into the edge and running into continuous analytics versus batch and pipeline. So you think based on that scenario, I just said that there's going to be an opportunity for somebody to take over the media landscape for sure. Yeah, I think if you're also looking at the statistics for, I've seen a nice article, I told George about it, that looking at analyzing the Intel chip distribution. And what you see is that there is a 30% growth on Intel chips into cloud, which is faster than what most analysts anticipate in terms of cloud growth. That means actually that cloud is going to cannibalize enterprise faster than what most think. Enterprise is shrinking about 7%, and there is another place which is growing. It's telcos. It's not growing like cloud, but part of it is because of this move towards the edge and the move of telcos buying white boxes. And 5G and access over the top too. Yeah, but that's server chips. So basically there's going to be more and more computation in the different telecommunications. And yeah, and this is an opportunity that we can capitalize on if we run fast enough. It sounds as though because you've implemented these sort of industry standard APIs that come from largely the open source ecosystem, that you can propagate those two areas on the network that the vendors who are behind those APIs can't necessarily do into the telcos, towards the edge. And I assume part of that is because of the density and the simplicity, like so that essentially your footprint's smaller in terms of hardware. And the operational simplicity is greater. Is that a fair assessment? Yes, and also we support a lot of Amazon compatible APIs which are RESTful, typically HTTP based, very convenient to work with in a cloud environment. Another thing is because we're sort of taking all the state on ourself, the different forms of states, whether it's a message queue or a table or an object, et cetera, that makes the computation layer very simple. So one of the things that we're also demonstrating is the integration we have with Kubernetes that basically now simplifies Kubernetes because you don't have to build all those different data services for cloud native infrastructure. You just run Kubernetes, where the volume driver, where the database, where the message queues, where everything underneath Kubernetes, and then you just run Spark or TensorFlow or a serverless function as a Kubernetes microservice. And that allows you now elastically to increase the number of Spark jobs that you need or maybe you have another tenant, you just spawn a Spark job. Yarn has some of those attributes, but Yarn is very limited, very confined to the Hadoop ecosystem. TensorFlow is not an Hadoop player and a bunch of those new tools are not an Hadoop player and everyone is now adopting a new way of doing streaming and they just call it serverless. Serverless and streaming are very similar technologies. The advantage of serverless is all this prepackaging and all this automation of the CI CD, the continuous integration, continuous development. So we're taking, in order to simplify the developer and operation aspects, we're trying to integrate more and more with cloud native approach around CI CD and integration with Kubernetes and cloud native technologies. Would it be fair to say that from a developer or admin point of view, you're pushing out from the cloud towards the edge faster than if the existing implementations, you know, say the Apache ecosystem or the AWS ecosystem, you know, where they, AWS has something on the edge. I forgot that, whether it's Snowball or Greengrass or whatever, you know, where they at least get the Lambda function. They're fuel, by the way, and it's interesting to see. One of the things they did, they allowed Lambda functions in their CDNs, which is sort of going the direction I mentioned, just very minimal functionality. Another thing that they have those boxes where they have a single VM and they can run Lambda function as well. But I think their ability to run computation is very limited and also their focus is on shipping the boxes through mail and we want it to be always connected. You're on final question for you, just to get your thoughts, great statement, by the way, it's very informative and we should do a follow up on Skype on our studio for Silicon Valley Friday show. But, I mean, Google Next was interesting. I mean, they're serious about the enterprise but you can see that they're not yet there. What is the enterprise readiness perspective from your perspective? Because Google has the tech and they try to flaunt the tech. We're great, we're Google, look at us, therefore you should buy us. It's not that easy in the enterprise. How would you size up the different players because they're not all like Amazon, although Amazon is winning. You get Amazon, Azure, and Google. Your thoughts on the cloud players? Yeah, so the way we attack enterprise, we don't attack it from an enterprise perspective or IT perspective, we attack it from a business use case perspective, especially because we're small and we have to run fast. So you need to identify a real critical business problem. We're working with stock exchanges and they have a lot of issues around monitoring the daily trade activities in real time, okay? And if you compare what we do with them on this continuous analytics notion to how they work with excels and sort of doops, it's like totally different and now they could do things which are way different. And I think that one of the things that those customers, if Google wants to succeed against Amazon, they have to find a way of how to approach those business owners and say, here's a problem, Mr. Customers, here's a business challenge, here's what I'm going to solve. If they're just going to say, you know what, my VMs are cheaper than Amazon, it's not going to be a huge issue. Yeah, also they're doing the whole, they're calling lift and shift, which is code word for rip and replace in the enterprise. So that's essentially, I guess, a good opportunity if you can get people to do that, but not everyone's ripping and replacing and lifting and shifting. But a lot of Google advantages around areas of AI and things like that. So they should try and leverage, if you think about Amazon approach to AI, they sort of funded the university to build the project and then sort of set its hours, where Google created TensorFlow and created a lot of other IPs and data flow and all those solutions and contributed to the community. So I really love Google's approach of contributing Kubernetes, contributing TensorFlow. And this way, they're sort of planning the seeds. So the new generation that is going to work with Kubernetes and TensorFlow, so you know what, why would I mess with this thing on-prem? I'll just go in. Right to the cloud. Do multi-cloud. Right to the cloud. But I think a lot of criticism about Google is that they're too sort of research-oriented, they don't know how to monetize and approach them. Yeah, it's just enterprise is a whole different drumbeat. And I think that's the only thing I'm my complaint with them is they got to get that knowledge and or buy companies. Have a quick final point on Spanner, any analysis of Spanner that went from paper to pretty quickly from paper to product? So before we started Iguazio, I studied Spanner quite a bit. All the publications that was there and all the other things like Spanner and Spanner has the underlying layer called Colossus and our data layer is very similar to how Colossus works. So we're very familiar. We took a lot of concepts from Spanner. And you like Spanner, it's legit. Yes, again, I think that we haven't copied, you borrow some good practice. Well, I think I studied about 300 research papers before we did the architecture. Wow. But we basically took the best of each one of them because there's still a lot of issues. Most of those technologies, by the way, are designed for mechanical disks. And we can talk about it in a different way. And you have Flash. All right, Yaron, we had gone over here, great segment, we're here live in Silicon Valley, breaking it down, getting into the under the hood, looking at a 10X, 100X performance advantages, keep an eye on Agazio, they're looking like they've got some great product. Check them out. This is theCUBE, I'm John Furrier with George Gilbert. We'll be back with more after this short break. Thanks.