 Live from San Francisco, it's theCUBE, covering SparkSummit 2017, brought to you by Databricks. You are watching the SparkSummit 2017 coverage by theCUBE. I'm your host, David Goad, joining with George Gilbert. How you doing, George? Good to be here. And honored to introduce our next guest, the CTO from Snappy Data. Wow, we were lucky to get this guy. Thanks for having me. Jags Ramnarayan, Jags, thanks for joining us. Thanks, thanks for having me. Hey, for people who may not be familiar, maybe tell us what does Snappy Data do? So Snappy Data in a nutshell is taking Spark, which is a compute engine, and in some sense augmenting the guts of Spark so that Spark truly becomes a hybrid database. A single data store that's capable of taking Spark streams, doing transactions, providing mutable state management in Spark, but most importantly, being able to turn around and run analytical queries on that state that is continuously emerging. That's in a nutshell. Let me just say a few things. Snappy Data itself is a startup that is a spun out, spun out out of Pivotal. We've been out of Pivotal for roughly about a year. So the technology itself was, to great degree, incubated within Pivotal. It's a product called GemFar within VMware and Pivotal. So we took the guts of GemFar, which is an in-memory database designed for transactional low latency, high concurrency scenarios, and we are sort of fusing it. That's the key thing, fusing it into Spark so that now Spark becomes significantly richer as not just a compute platform, but as a store. Great, and we know this is not your first Spark Summit, right? And how many have you been to? Boy, let's see, three, four now. Spark Summit, see if I include the Spark Summit East this year, four to five. Great, so an active part of the community, what were you expecting to learn this year and have you been surprised by anything? You know, it's always wonderful to see, I mean, every time I come to Spark, it's just a new set of innovations, right? I mean, when I first came to Spark, it was a mix of, you know, let's talk about data frames, RDDs, you know, let's optimize MapReduce. Today you come, I mean, there is such a wide spectrum of amazing new things that are happening. It's just mind-boggling. I mean, you know, right from AI techniques, structured streaming and the real-time paradigm and sort of this confluence. It's just what Databricks brings about sort of how can I create a confluence through a unified mechanism, right? Which is really brilliant, that's what I think. Okay, well, let's talk about how you're innovating at Snappy Data. What are some of the applications or current projects you're working on? So a number of things. I mean, G is an investor in Snappy Data. So we're trying to work with G on the industrial IoT space. We're working with large healthcare companies also in that IoT space. So the pattern with Snappy Data is one that says, there's a lot of high-velocity streams of data emerging, where the streams could be, for instance, Kafka streams, driving Spark streams. But streams could also be operational databases, your Postgres instance, your Cassandra database instance, and they're all generating continuous changes to data that's emerging in an operational world. Can I suck that in and almost create a replica of that state that might be emerging in these so-called operational environments and still allow interactive analytics, that's the interactive analytics at scale for a number of concurrent users on live data. Not cube data, not pre-aggregated data, but on live data itself. Being able to almost give you Google-like speech to live data. George, we've heard people talking about this quite a bit. Yeah. So, Jags, as you said, up front, Spark was conceived as a sort of a general purpose, like I say, analytic compute engine and adding a DBMS to it, like sort of not bolting it on, but deeply integrating it so that the core data structures now have DBMS properties, like transactionality, that must make a huge change in the scope of applications that are applicable. Can you describe some of those for us? Yeah, I mean, the classic pattern today that we find the time and again is the so-called smack stack, right? I mean, Lambda stack, now there's a smack stack, which is really about Spark running on mesos, but you're really using Spark streaming as an ingestion capability. And there is continuous state that is emerging that I want to write and store into Cassandra, right? So what we find very quickly is that the moment the state is emerging, I want to throw in a business intelligence tool on top and immediately do live dashboarding on that state that is continuously changing and emerging, right? So what we find is that the first part, which is the high-speed drives, the ability to transform these data sets, cleanse the data sets, get the cleanse data into Cassandra works really well. What is missing is this ability to say, how am I going to get insight? How can I ask you interesting, insightful questions, get responses immediately on that live data, right? And so the common problem there is the moment I have Cassandra working, let's say with Spark, every time I run an analytical query, you only have two choices. One is use the parallel connector to pull in the data sets from Cassandra, right? And now unfortunately, when you do analytics, you are working with large volumes and every time I run even a simple query, all of a sudden I could be pulling in 10 gigabytes, 20 gigabytes of data into Spark to run the computation, hundreds of seconds lost. I had nothing like interactive, it's all about batch querying. So how can I turn around and say that if stuff changes in Cassandra, I can have an immediate real-time reflection of that mutable state in Spark on which I can run queries rapidly, right? That's a very key aspect to us. So you were telling me earlier that you didn't see necessarily the need to replace entirely the Cassandra in the smack stack, but to complement it. That's right, yeah. To elaborate on that. So our focus, much like Spark, is all about in-memory state management, in-memory processing, right? And Cassandra realistically is really designed to say that how can I scale to a petabyte, right? For key value operations, semi-structured data, what have you. So we think there are a number of scenarios where you still want Cassandra to be your store, because in some sense a lot of these guys are already adopted Cassandra in a fairly big way. So you want to say that, hey, leave your petabyte level volume there, and you can essentially work with a real-time state, which could be still being many terabytes of state, essentially in main memory. That's kind of what we're specializing in. And we're also, I mean, I can touch on this approximate query processing technology also, which is the other part, the other key part here, to say, hey, I can't really afford a thousand cores and a thousand machines, just so that you can do your job really well. So one of the techniques we are adopting, which even the Databricks guys did with Blink, essentially, it's an approximate query processing engine. We have our own, essentially, approximate query processing engine as an adjunct to essentially our store. What that essentially means is to say, can I take a billion records and synthesize something very, very small, using smart sampling techniques, sketching techniques, essentially statistical data structures that can be stored along with Spark and Spark memory itself, and fuse it with the Spark's catalyst querying engine, says that you run your query and we can very smartly figure out saying, can I use the approximate data structures to answer the questions extremely quickly? Even when the data could be in petabyte volume, I have these data structures that are just now taking, maybe gigabytes of storage only, right? So hopefully not getting too, too technical. So the Spark Catalyst query optimizer, like an Oracle query optimizer, it knows about the data that it's going to query, only in your case, you're taking what Catalyst knows about Spark and extending it with what's stored in your native, also Spark native data structures. Exactly, so think about an optimizer that always takes a query plan and says, here are all the possible plans you can execute and here is a cost estimate for these plans. We essentially inject more plans into that and hopefully our plan is even more optimized than the plans that the Spark Catalyst engine came up with. And Spark is beautiful because the Catalyst engine is a very pluggable engine, so you can essentially augment that engine very easily. So you've been out in the marketplace, whether in alpha, beta, or now, production, for enough time so that the community is aware of what you've done. What are some of the areas that you're being pulled in that people didn't associate Spark with? Yeah, so, but often we land up in situations where they're looking at SAP HANA as an example, maybe a mem SQL, maybe just Postgres, and all of a sudden there are these hybrid workloads which is the Gartner term of H-Tap. So there's a lot of H-Tap like use cases where we get pulled into. So there's no Spark where we get pulled into because we're just a hybrid database. That's what people look at us. Oh, so you pull Spark in because that's just part of your solution. Exactly, right? So think about like Spark is not just data frames and the rich API, data frames, but also it has a SQL interface, right? I can essentially execute SQL, select SQL. Of course we now augment that SQL so that now we can do what you would expect from a database which is an insert, an update, a delete, can I create a view, can I run a transaction? So all of a sudden it's not just a Spark API but what we provide looks like a SQL database itself, right? Okay, interesting. So tell us in the work with GE, you know, they're among the first that have sort of educated the world that in that, you know, in that world there's so much data coming off devices that we have to be intelligent about what we filter and send to the cloud. We train models potentially up there. We run them closer to the edge so that we get low latency analytics. But you were telling us earlier that there are alternatives, especially when you have such an intelligent database working both at the edge and in the cloud. Right, so that's a great point. See what's happening with sort of a lot of these machine learning models is that these models are learned on historical data sets. And quite often, especially if you look at predictive maintenance, those class of use cases in industrial IoT, the patterns could evolve very rapidly. Right, maybe because of climate changes and let's say for a windmill form there are a few windmills that are breaking down so rapidly it's affecting everything else in terms of the power generation. So being able to sort of alter the model itself incrementally and near real time is becoming more and more important. Still a fairly academic research kind of area. But for instance, we're working very closely with the University of Michigan to sort of say, hey, can we use some of these approximate techniques to incrementally also learn a model, right? Sort of incrementally augment a model potentially on the edge or even inside the cloud, for instance. So if you're doing it at the edge, would you be updating the instance of the model associated with that locale and then would the model in the cloud be sort of like the master? And then that gets pushed down and so you have an instance and a master? That's right. I mean, see most typically what will happen is you have computed a model using a lot of historical data. You have typically supervised techniques to compute a model. And you take that model and inject it potentially into the edge so that you can compute that model, which is the easy part, everybody does that. So you continue to do that, right? Because you really want the data scientists to be pouring through those patterns, looking and sort of tweaking those models. But for a certain number of models, even in the model is injected on the edge, can I re-tweak that model in unsupervised way? It's kind of the play we're also kind of venturing into slowly, but that's sort of the future. But if you're doing it unsupervised, do you need metrics that sort of flag, you know, like, you know, what is a champion challenger and figure out? I should say that, I mean, not all these models can work in this very real-time manner. So for instance, we've been looking at saying, can we reclassify the NBC, the native brace classifier, to essentially do incremental classification or incrementally learning the model? Clustering approaches can actually be done in a unsupervised way in an incremental fashion, things like that. Not, I mean, there's a whole spectrum of, you know, algorithms that really need to be thought through for approximate algorithms to actually apply. So it's still an area of active research. Really great discussion, guys. We just got about a minute to go. I'm trying to break up really great stuff. I don't want to interrupt you. But maybe switch real quick to business drivers. Maybe it's snappy data or with other peers you've talked to today, what business drivers do you think are going to affect the evolution of Spark the most? Boy, I mean, you know, so for us as a small company, the single biggest challenge we have is, it's like what one of you guys said, the analysts said, it's raining databases out there. And this ability to constantly educate people on how you can essentially realize a very next generation like data pipeline in a very simplified manner is the challenge we are running into, right? I mean, I think the business driver for us is primarily, you know, how many people are going to go and say, yes, bachelorette and analytics is important, but we incrementally, for competitive reasons, want to be playing the real-time analytics game a lot more than before, right? So that's going to be a big for us. And hopefully we can play a big part there along with Spark and Databricks. Great, well, we appreciate you coming on the show today and sharing some of the interesting work that you're doing. George, thank you so much. And Jax, thank you so much for being on theCUBE. We appreciate it. And thank you all for tuning in. Once again, we have more to come today and tomorrow here at Spark Summit 2017. Thanks for watching.