 Live from San Francisco, it's theCUBE, covering Spark Summit 2017, brought to you by Databricks. Hi, we're back here where theCUBE is live and I didn't even know it. Welcome, we're at Spark Summit 2017. I have so much fun talking to our guests. I didn't know the camera was on. We are going to talk with Cloudera, a couple of experts that we have here. The first is Mark Grover, who's a software engineer and an author. He wrote the book Hadoop Application Architectures. Mark, welcome to the show. Thank you very much, glad to be here. And just to his left, we also have Jennifer Wu. And Jennifer's the Director of Product Management at Cloudera. Did I get that right? That's right, I'm happy to be here too. All right, great to have you. Why don't we get started, talk a little bit more about what Cloudera is maybe introducing new at the show. I saw you had a booth over here. Mark, do you want to get started? Hi, yeah, there are two exciting things that we've launched recently. There's Cloudera Altus, which is for transient workloads and being able to do ETL-like workloads Jennifer will be happy to talk more about that. And then there's Cloudera Data Science Workbench, which is this tool that allows folks to use data science at scale. So get away from doing data science in silos on your personal laptops and do it in a secure environment on-prem and on-cloud. All right, well, let's jump into data science workbench first. Tell me a little bit more about that. And you mentioned it's for exploratory data science. So give us a little more detail on what it does. Yeah, absolutely. So there was a private beta for Cloudera Data Science Workbench earlier in the year and then it was made GA a few months ago. And it's, like you said, an exploratory data science tool that brings data science to the masses within an enterprise. Previously, people used to have, there was this dichotomy, right? As a data scientist, I want to have the latest and greatest tools. I want to use the latest version of Python, the latest notebook kernel. And I want to be able to use R and Python to be able to crunch this data and run my models in machine learning. However, on the other side of this dichotomy were the IT organization of the organization where they want to make sure that all tools are compliant and that your clusters are secure and your data is not going into places that are not secured by state-of-the-art security solutions like Kerberos, for example, right? And of course, if the data scientists are putting the data on their laptops and taking the laptop around to wherever they go, that's not really a solution. So that was one problem. And the other one was, if you were to bring them all together in the same solution, data scientists have different requirements. One may want to use Python 2.6 and the other one would be want to use 3.2, right? And so, Cloudera Data Science Workbench is a new product that allows data scientists who visualize and do machine learning through this very nice notebook-like interface, share their work with the rest of the colleagues in the organization, and but also allows you to keep your cluster secure. So it allows you to run against a Kerberos cluster, allows single sign-on to your web interface to the Data Science Workbench and provides a really nice developer experience in the sense that my workflow and my tools and my version of Python does not conflict with Jennifer's version of Python. We all have our own Docker and Kubernetes-based infrastructure that makes sure that we use the packages that we need and they don't interfere with each other, we're done with each other. We're going to go to Jennifer and Alta since just a few minutes, but George first gave you a chance to maybe dig in on Data Science Workshop. Two questions on the Data Science side. Some of the really toughest nuts to crack have been sort of a common environment for the collaborators, but also the ability to operationalize the models once you've sort of agreed on them and managed the life cycle across teams, you know, like challenge your champion, promote something or even before that, doing the A-B testing, and then sort of what's in production that's typically in a different language from what was designed in and sort of integrating it with the apps. Where is that on the roadmap? Has no one really had a good answer for that? Yeah, that's an excellent question. In general, I think it's the problem to crack these days of how do you productionalize something that was written by a data scientist in a notebook-like system onto the production cluster, right? And I think the part where the data scientist works in a different language than the language that's in production, I think that problem, the best I can say right now is to actually have someone rewrite that. I don't really have someone rewrite that in language you're going to make in production, right? But I don't see that to be the more common part. I think the widespread problem is, even when the language is production, how do you go making the part that a data scientist wrote, the model or whatever that would be into a production cluster? And so data science work mentioned in particular runs on the same cluster that is being managed by cloud or a manager, right? So this is a tool that you install, but then it's available to you as a web server, as a web interface. And so that allows you to move your development machine learning algorithms from your data science workbench to production much more easier because it's all running on the same hardware and same systems. There's no separate cloud or a manager instance you had to use to manage the workbench compared to your actual cluster. Okay, a tangential question, but one of the difficulties of doing machine learning is finding all the training data and sort of the data science expertise to sit with the domain expert to figure out proper model of features, things like that. One of the things we've seen so far from the cloud vendors is they take their huge data sets in terms of voice images. They do the natural language understanding speech or rather text to speech. Facial recognition, because they have such huge data sets they can train on. We're hearing noises that they're going to take that down to the more mundane statistical kind of machine learning algorithms. So you wouldn't be like, here's a algorithm to do churn, go to town, but that they might have something that's already kind of pre-populated that you would just customize. Is that something that you guys would tackle too? I can't speak for the roadmap in that sense, but I think some of that problem needs to be tackled by projects like Spark, for example. So I think as the stack matures, it's going to raise the level of abstraction as time goes on. And I think whatever benefits Spark ecosystem will have will come directly to distributions like Cloud area. That's interesting, okay. All right, well let's go to Jennifer now and talk about Altus a little bit. Now you've been on theCUBE show before, right? I have not. No, okay, well, familiar with your work. Tell us, you're the product manager for Altus. What does it do and what was the motivation to build it? Yeah, we're really excited about Cloudera Altus. So we released Cloudera Altus in its first GA form in April and we launched Cloudera Altus in a public environment in Strato-London about two weeks ago. So we're really excited about this and we are very excited to now open this up to all of the customer base. And what it is is a platform as a service offering designed to leverage basically the agility and the scale of cloud and make a very easy to use type of experience to expose Cloudera capacity in particular for data engineering type of workloads. So the end user will be able to very easy in a very agile manner, get data engineering capacity on Cloudera in the cloud and they'll be able to do things like ETL and large scale data processing, productionized machine learning workflows in the cloud with this new data engineering as a service experience. And we wanted to abstract away the cloud and cluster operations and make the end user a really, the end user experience very easy. So jobs and workloads as first class objects, you can do things like submit jobs, clone jobs, terminate jobs, troubleshoot jobs. We wanted to make this very, very easy for the data engineering end user. This, it's, it does sound like you sort of abstracted away a lot of the infrastructure that you would associate with on-prem and sort of almost make it like programmable and invisible, but I guess one of my questions is one of my questions is when you put it in a cloud environment, when you're on-prem you have a certain set of competitors which is kind of restricted because you are the standalone platform. But when you go in the cloud, someone might say, I want to use Redshift on Amazon or Snowflake as the MPP SQL database at the end of the pipeline. And it's not just, I'm using those as examples. There's dozens, hundreds, thousands of other services to choose from. What happens to the integrity of that platform if someone carves off one piece? Right, so interoperability and a unified data pipeline is very important to us. So we want to make sure that we can still service the entire data pipeline all the way from ingest and data processing to analytics. So our team has 24 different open source components that we deliver in the CDH distribution and we have committers across entire stack. We know the application and we want to make sure that everything's interoperable no matter how you deploy the clusters. So if you've deployed data engineering clusters through Cloudera Altus, but you've deployed Impala clusters for a data mart in the cloud through a Cloudera director or through any other format, we want all these clusters to be interoperable and we've taken great pains in order to make everything work together well. Okay. So how do you Altus and Data Science Workbench interoperate with Spark? Maybe, sorry, would you want to go first for Altus? Sure, so we, in terms of interoperability we focus on things like making sure there are no data silos so that the data that you use for your entire data lake can be consumed by the different components in our system, the different compute engines and different tools and so if you're processing data you can also look at this data and visualize this data through Data Science Workbench. So after you do data ingestion and data processing you can use any of the other analytic tools and then this includes Data Science Workbench. Right, for Data Science Workbench it runs for example with the latest version of Spark you could pick the currently latest release version of Spark Spark 2.1, Spark 2.2 is being voted of course and that will soon be integrated after it's released. For example, you could use Data Science Workbench with your flavor of Spark 2's version and you can run PySpark or Scala Jobs on this notebook like interface be able to share your work and because you're using Spark underneath the hood it uses yarn for resource manager management. The Data Science Workbench itself uses Docker for configuration management and Kubernetes for resource managing these Docker containers. What would be, if you had to describe sort of the edge conditions and the sweet spot of the applications? I mean you talked about data engineering. One thing we were talking to Matej and Matej Zaharia and Reynolds Shin about was and Ali Gojci as well was if you put Spark on a database or at least a sophisticated storage manager like Kudu all of a sudden there are whole new class of jobs or applications that open up. Have you guys thought about what that might look like in the future and what new applications you would tackle? I think a lot of that benefit for example could be coming from the underlying storage engine. So let's take Spark on Kudu for example. The inherent characteristics of Kudu today allow you to do updates without having to either deal with the complexity of something like HBase or the crappy performance of dealing HDFS compactions. So the sweet spot comes from Kudu's capabilities. Of course it doesn't support transactions or anything like that today but imagine now putting something like Spark and being able to use the machine learning libraries and we have been limited so far in the machine learning algorithms we have implemented in Spark by the storage system sometimes and for example new machine learning algorithms and over the existing ones could be rewritten to make use of the update features for example in Kudu. And so it sounds like it makes it the machine learning pipeline might get richer but I'm not hearing that and maybe this isn't sort of in the near term sort of roadmap the idea that you would build sort of operational apps that have these sophisticated analytics built in where the analytics you've done the training but at runtime the inferencing influences a transaction influences a decision. Is that something that you would foresee? I think that's totally possible. Again at the core of it is the part that now you have one storage system that can do scans really well and it can also do random reads and writes in place. So that allows applications which were previously siloed because you had one application that ran off of HDFS and another application that ran out of HBase and then perhaps yeah and so you had to correlate them to just being one single application that can use to train and then also use the trained data to then make decisions on the new transactions that come in. So that's very much within the sort of scope of imagination or scope that's part of the sort of the ultimate plan. I think it's definitely conceivable now, yeah. Okay. We're up against a heartbreak come up in just a minute so each get a 30 second answer here but it's the same question. You've been here for a day and a half now what's the most surprising thing you've learned that you think should be shared more broadly with the Spark community? Let's start with you. I think one of the great things that happening in Spark today is people have been complaining about latency for a long time. So if you saw the keynote yesterday you would see that Spark is making for is into reducing that latency. And if you are interested in Spark using Spark it's a very exciting news. You should keep tabs on it and we hope to deliver lower latency as a community sooner. How long is it one millisecond? Yeah, I'm largely focused on cloud infrastructure and I found here at the conference that like many, many people are very much prepared to actually start taking more, you know, more POCs and more interesting cloud and the response in terms of all of this and all tests has been very encouraging. Yeah. Great. Well, Jennifer, Mark, thank you so much for spending some time here on theCUBE with us today. We're going to come by your booth and chat a little bit more later because it's some interesting stuff. And thank you all for watching theCUBE today here at Spark Summit 2017 and thanks to Cladera for bringing us these two experts and thank you for watching. We'll see you again in just a few minutes for the next interview.