 Live from San Francisco, it's theCUBE, covering Spark Summit 2017, brought to you by Databricks. Welcome back, we're here at theCUBE at Spark Summit 2017. I'm David Goad, here with George Gilbert, George. Good to be here. Thanks for hanging out with us. Well, here's the other man of the hour here, and we just talked with Ali, the CEO at Databricks, and now we have the chief architect and co-founder at Databricks, Reynolds Shin. Reynolds, how are you? George, good, how are you doing? Awesome, are you enjoying yourself here at the show? Absolutely. It's fantastic, it's a lot of just Summit, it's a lot of interesting things, a lot of interesting people with who I meet. Well, I know you're a really humble guy, but I had to ask Ali, what should I ask Reynolds when he get up here? Well, Reynolds is one of the biggest contributors to Spark, and you've been with this for a long time, right? Yes, yeah, being contributing to Spark for about five or six years, and I guess that's probably the most number of commitments to the project, and lately we're working with other people to help design the roadmap for both Spark and Databricks one time. Well, let's get started talking about some of the new developments that you want. Maybe our audience at theCUBE hasn't heard here in the keynote this morning. Sure. What are some of the most exciting new developments? So, I think in general, if Rook has Spark, there are three directions I would say we're doubling down. One, the first direction is the deep learning, like deep learning is extremely hot, and it's very capable, but as we alluded to earlier in a blog post, I think deep learning has reached sort of a MapReduce point in which it shows tremendous potential, but the tools are very difficult to use, and we're hoping to sort of democratize deep learning and do Spark, what Spark did to big data to deep learning with this new library called Deep Learning Pipelines. So what it does integrates deep learning, different deep learning libraries directly in Spark and can actually expose models in SQL, so even the business analysts are capable of leveraging that. So that's one area, deep learning. The second area is streaming. Streamings, again, I think there's a lot of us, customers have aspirations to actually shorten the latency and increase the throughput in streaming. So we just actually measure the structure of streaming effort is going to be generally available, and last month alone on Databricks platform, I think our customers processed three trillion records last month out using structured streaming, and we also have a new effort to actually push down the latency all the way down to a similar second range so we can really do blazing fast, blazingly fast streaming analytics. And last but not least is the SQL data warehousing area. Data warehousing, I think it's a very mature area from sort of outside of big data point of view, but from a big data one, it's still pretty new and there's a lot of, I would say, use cases that's popping up there and spark with the pushes like the CBO and also in particular in the Databricks runtime with DBIO we're actually substantially improving the performance and the capabilities of data warehousing features. We're going to dig into some of those that technology's here to second with George, but have you heard anything here so far from anyone that's changed your mind maybe about what to focus on next? So one thing I've heard from a few customers is actually visibility and debuggability of the big data jobs. So many of them are fairly technical engineers, some of them are less sophisticated engineers and they've written jobs and sometimes the job runs slow and so the performance engineer in me would think so how do I make the job run fast? The different way to actually solve that problem is how can we expose the right information so the customer can actually understand and figure it out themselves. This is why my job is slow and this is how I can tweak it to make it faster and kind of rather than giving people the fish you actually give them the tools to fish. Just going to call that bugability. Yeah, debuggability. Debuggability. And visibility. All right, awesome, and George? So let's go back and unpack some of those kind of juicy areas that you identified. On deep learning, you were able to distribute, if I understand things right, the predictions. You could put models out on a cluster but the really hard part, the compute intensive stuff was training across a cluster. And so like deep learning 4j and I think Intel's big DL, they were written for Spark to do that but with all the excitement over some of the new frameworks are they now at the point where they're as good citizens on Spark as they are on their native environments? Yeah, so this is a very interesting question. Obviously a lot about the frameworks becoming more and more popular such as TensorFlow, MXNet, Tiano, Keras and all of this. What the deep learning pipeline library does is actually exposes all this deep learning single node deep learning tools that's highly optimized for say even GPUs or CPUs to be available as an estimator or like a module in a pipeline of the machine learning pipeline library in Spark. So now users can actually leverage Spark's capability to for example do hyperparameter tuning. So when you're building a machine learning model it's fairly rare that you just run something once and you're good with it. Usually you have to fiddle with a lot of the parameters. For example, you might run over 100 experiments to actually figure out what is the best model I can get. This is where actually Spark really shines when you combine Spark with some deep learning library be it big DL, be it MXNet, be it TensorFlow, you could be using Spark to distribute that training and then do cross validation on it so you can actually find the best model very quickly and Spark takes care of all the job scheduling all the fault tolerance properties and how do you read data in from the different data sources. And without my dropping too much in the weeds there was a version of that where Spark wouldn't take care of all the communications. It would like maybe distribute the models and then like do some of the averaging of what was done out on the cluster but are you saying that all that now can be managed by Spark? So Spark, in that library, Spark will be able to actually take care of picking the best model out of it and there are different ways you can assign how do you define the best. The best could be some average of some different models. The best could be just pick one out of this. The best could be maybe there's a tree of models that you can classify. And that's a hyperparameter configuration choice? That is, so that is actually building functionality in Spark's machine learning pipeline. And now what we're doing is now you can actually plug all those deep learning libraries directly into that as part of the pipeline to be used. So another, maybe just to add, another really cool functionality of the deep learning pipeline, transfer learning. So as you said, deep learning takes up very long times, very computational demanding and it takes a lot of resources, expertise to train. But with transfer learning, what we allow the customers to do is they could take an existing deep learning model as well trained in a different domain and then we re-train it on a very small amount of data very quickly and you can adapt it to a different domain. That's how sort of the demo on the James Bond car. So there's a generic image classifier but we train it on probably just a few thousand images and now we can actually detect whether a car is James Bond's car or not. Oh and the implications there are huge which is you don't have to have huge training data sets for modifying a similar model of a similar situation. I want to, in the time we have, there's always been this debate about whether Spark should manage state, whether it's database, key value store. Tell us how the thinking about that has evolved and then how the integration interfaces for achieving that have evolved. Yeah, so one of the, I would say, the advantage of Spark is it works with, it's unbiased and work with variety of store systems be it Cassandra, be it HBase, be it HDFS, be it S3. It's, there is a, I would say, metadata management functionality in Spark which is the catalog of tables that customers can define but the actual storage sits somewhere else and I don't think that will change in the near future because we do see that the storage systems have matured significantly in the past few years and I just wrote out to you a blog post on last week about the advantage of S3 over HDFS for example. The storage price is being driven down by almost a factor of 10X when you go to the cloud and I just don't think it makes sense at this point to actually be building storage systems for analytics. That said, I think there's a lot of, building on top of existing storage system there's actually a lot of opportunities for optimization on how you can leverage the specific properties of underlying storage system to get to maximum performance. For example, how do you do intelligent caching? How do you start thinking about building indexes actually against the data that's stored for scan workloads? With tungsten, so where you have, you take advantage of the latest hardware and where we get more memory intensive systems and now that the catalyst optimizer has a cost-based optimizer or will be and large memory, can you change how you go about knowing what data you're managing in the underlying system and therefore achieve a tremendous acceleration and performance? So this is actually one area we invested in sort of in the DBIO module as part of Databricks Runtime and what DBIO does, a lot of this are still in progress but for example, we're adding some formal indexing capabilities to actually the system so we can quickly skip and prune out all the irrelevant data when the user's doing simple point lookups or if the user's doing a scan heavy workload with some predicates. That actually has to do with how we think about the underlying data structure. The storage system is still the same storage system like S3, but we're adding actually indexing functionality is on top of it, that's part of the idea. And so what would be the application profiles? Is it just for the analytic queries or can you do the point lookups and updates in that sort of scenario too? Yeah, so it's interesting you're talking about updates. Updates is another thing that we've gotten a lot of future requests on. We're actively thinking about how we would support update workload. Now that said, I just want to emphasize, for both use cases of doing point lookups and updates, we're still talking about in the context of analytic environment. So we'll be talking about for example, maybe bulk updates or low throughput updates rather than doing transactional updates in which every time you swipe a credit card some record gets updated. That's probably more belongs on the transactional databases like Oracle or like MySQL. What about when you think about people who are going to run, they started out with Spark on prem. They realized they're going to put much more of their resources in the cloud, but with IOT, industrial IOT type applications, they're going to have Spark maybe in a gateway server on the edge. What do you think that configuration looks like? Really interesting. So it's kind of two questions maybe. The first is the hybrid on prem cloud solution. So again, one of the nice advantage of Spark is the couple of storage and compute. So when you want to move, for example, workloads from on prem to the cloud, the one you care the most about is probably actually the data, because the compute, it doesn't really matter that much where you run it, but data is the one that's hard to move. We do have customers as leveraging the data bricks in the cloud, but actually reading data directly from on prem and reliance of the caching solution we have, they minimize the data transfer over time, and this is kind of one route, I would say it's pretty popular. Another one is like, with Amazon, you could literally give them, this is snowball functionality, you give them hard drive through trucks, the trucks will ship your data directly, put it in S3. With IOT, a common pattern we see is a lot of the edge devices would be actually pushing the data directly into some sort of the firehose like Kinesis or Kafka or I'm sure Google and Microsoft both have their own variants of that. And then you would spark to directly subscribe to those topics and process them in real time with structure streaming. And so would spark be down, let's say at the site level, if not on the device itself? So most, it's an interesting thought and maybe one thing which should actually consider more in the future is how do we do push spark to the edges? Right now it's more of a centralized model in which the devices push data into spark, which is centralized somewhere. I've seen, for example, one, I think I don't remember exactly use case, but it has to do with some scientific experiment in the North Pole. And of course there you can, you don't have a great uplink of all the data connecting transferring back to some say national lab. And rather they would do a spark processing there and then ship the aggregated result back. This is another one, but it's less common. All right, well just one minute now before the break, so I'm going to give you a chance to address the spark community. What's the next big technical challenge you hope people will work on for the benefit of everybody? I think in general, spark came along with two focuses. One is performance, the other one is use of use. And I still think big data tools are too difficult to use, keep running tools even harder. The bar, the barrier to entry is very high for all of those tools. I would say we might have already addressed performance to a degree that I think is actually pretty usable. The systems are fast enough. Now we should work on actually making just things even easier to use. And it's what also we focus a lot on at Databricks here. Democratizing access, right? Absolutely. All right, well, Rhett, I wish we could talk to you all day. It's just great when we are out of time now. I want to appreciate you coming by the cube and sharing your insights and good luck with the rest of the show. Thank you very much. Appreciate it. Thank you all for watching here. We're at the cube at Spark Summit 2017. Stay tuned. Lots of other great guests coming up today. We'll see you in a few minutes.