 From San Jose, California, it's the Cube, covering Big Data Silicon Valley 2017. Big Data SV 2016, day two of our wall to wall coverage of Strata Hadoop Conference, Big Data SV, really what we call Big Data Week, because this is where all the action is going on down in San Jose. We're at the historic Pagoda Lounge, back at the Fairmont, come on by, say hello. We've got a really cool space and we're excited, never been in this space before, so we're excited to be here. So we've got George Gilbert here from Wikibon and we're really excited to have our next guest. He's Fred Rice, he's the chief architect at IBM Spark Technology Center in San Francisco. Fred, great to see you. Thank you, Jeff. So I remember when Rob Thomas, we went up and met with him in San Francisco when you guys first opened the Spark Technology Center a couple years now. Give us an update on what's going on there. I know IBM's putting a lot of investment in the Spark Technology Center, in the San Francisco office specifically. Give us kind of an update of what's going on. That's right, Jeff. We're in the new Watson West Building in San Francisco in 505 Howard Street. Co-located, we have about a 50 person development organization right next to us, we have about 25 designers. And on the same floor, a lot of developers from Watson doing data science from the weather underground, doing weather data analysis. So it's a really exciting place to be, lots of interesting work in data science going on there. And it's really great to see how IBM is taking the core Watson, obviously enabled by Spark and other core open source technologies. And now, applying it, we're seeing Watson for health, Watson for autonomous vehicles, Watson for marketing, Watson for this, and really bringing that type of machine learning power to all the various verticals in which you guys play. Absolutely, that's been what Watson has been about from the very beginning. Bringing the power of machine learning, the power of artificial intelligence to real world applications. Excellent. So let's tie it back to the Spark community. Most folks understand how Databricks builds out the core, or does most of the core work for the SQL workload, the streaming, and machine learning, and I guess graph is still immature. We were talking earlier about IBM's contributions in helping to build out the machine learning side. Help us understand what the Databricks core technology for machine learning is and how IBM is building beyond that. So the core technology for machine learning in Apache Spark comes out actually of the machine learning department at UC Berkeley, as well as a lot of different members from the community. Some of those community members also work for Databricks. We actually at the IBM Spark Technology Center have made a number of contributions to the core Apache Spark ML libraries. For example, recent contributions in neural nets. In addition to that, we also work on a project called Apache System ML, which used to be proprietary IBM technology, but the IBM Spark Technology Center has turned System ML into Apache System ML. It's now an open Apache incubating project that's been moving forward out in the open. You can download the latest release online. And that provides a piece that we saw was missing from Spark in a lot of other similar environments, an optimizer for machine learning algorithms. So in Spark, you have the Catalyst Optimizer for data analysis, data frames, SQL. You write your queries in terms of those high level APIs and Catalyst figures out how to make them go fast. In System ML, we have an optimizer for high level languages like Spark and Python, where you can write algorithms in terms of linear algebra, in terms of high level operations on matrices and vectors, and have the optimizer take care of making those algorithms run in parallel, run in scale, taking account of the data characteristics. Does the data fit in memory? And if so, keep it in memory. Does the data not fit in memory? Stream it from disk. Okay, so there was a ton of stuff in there, and if I were to refer to that as so densely packed as it has to be a black hole, that might come across wrong. So I won't refer to it as a black hole, but let's unpack that. So the, and I meant that in a good way, like high bandwidth, you know. Thanks, George. So the traditional Spark, the machine learning that comes with Spark's ML Lib, one of its distinguishing characteristics is that the models, the algorithms that are in there have been built to run on a cluster. That's right. And very few others have built machine learning algorithms to run on a cluster. But as you were saying, you don't really have an optimizer for finding something where a couple of the algorithms would be fit optimally to solve a problem. Help us understand then how a system ML takes that sort of, solves a more general problem for say, ensemble models and for scale out. How you, I guess, help us understand how system ML fits relative to Spark's ML Lib and the more general problems it can solve. So ML Lib and a lot of related packages such as Sparkling Water from H2O, for example, provide you with a toolbox of algorithms. And each of those algorithms has been hand tuned for a particular range of problem sizes and problem characteristics. This works great as long as the particular problem you're facing as a data scientist is a good match to that implementation that you have in your toolbox. What system ML provides is less like having a toolbox and more like having a machine shop. You have a lot more flexibility, you have a lot more power. You can write down an algorithm as you would write it down if you were implementing it just to run on your laptop and then let the system ML optimizer take care of producing a parallel version of that algorithm that is customized to the characteristics of your cluster that's customized to the characteristics of your data. So let me stop you right there because I want to use an analogy that others might find easy to relate to for all the people who understand SQL and scale out SQL. So the way you were describing it, it sounds like, oh, if I were a SQL developer and I wanted to get at some data on my laptop, I would find it pretty easy to write the SQL to do that. Now, let's say I had a bunch of servers each with its own database and I wanted to get data from each database. If I didn't have a scale out database, I would have to figure out physically how to go to each server in the cluster to get it. And what I'm hearing for system ML is it'll take that query that I might have written on my one server and it will transparently figure out how to scale that out, although in this case not queries, machine learning algorithms. Well, the database analogy is very apt just like SQL and query optimization by allowing you to separate that logical description of what it is you're looking for from the physical description of how to get at it lets you have a parallel database with the exact same language as a single machine database. And system ML because we have an optimizer that separates that logical description of the machine learning algorithm from the physical implementation, we can target a lot of different parallel systems. We can also target a large server and the code that implements the algorithm stays the same. Okay, now let's take that a step further because you referred to matrix math and I think linear algebra and a whole bunch of other things that I never quite made it to since I was a humanities major. But when we're talking about those things, my understanding is those are primitives that Spark doesn't really implement so that if you wanted to do neural nets which relies on some of those constructs for high performance, that's not built into Spark. Can you get to that capability using system ML? Yes, system ML at its core provides you as a user with a library of machine, of rather linear algebra primitives just like a language like R or a library like NumPy gives you matrices and vectors and all of the operations you can do on top of those primitives. And just to be clear, linear algebra really is the language of machine learning. If you pick up a paper about an advanced machine learning algorithm, chances are the specification for what that algorithm does and how that algorithm works is going to be written in the paper literally in linear algebra. And the implementation that was used in that paper is probably written in a language where linear algebra is built in like R or like NumPy. So it sounds to me like Spark has done the work of getting sort of the blocking and tackling of machine learning to run in parallel. And that's, I mean, to be clear, since we haven't really talked about it, that's important when you're handling data at scale. And you want to train models on very, very large data sets. But it sounds like when we want to go to some of the more advanced machine learning capabilities, the ones that today are making all the noise with speech to text, text to speech, natural language understanding, those neural network-based capabilities are not built into the core Spark ML live. That, would it be fair to say you could start getting at them through System ML? Yes, System ML is a much better way to do scalable linear algebra on top of Spark than the very limited linear algebra that's built into Spark. So, all right, let's take the next step. Can System ML be grafted on to Spark in some way where would it have to be an entirely new API that doesn't integrate with all the other Spark APIs in a way that has differentiated Spark, where each API is sort of accessible from every other? Can you tie System ML in or do the Spark guys have to build more primitives into their own sort of engine first? A lot of the work that we've done at the Spark Technology Center as part of bringing System ML into the Apache ecosystem has been to build a nice tight integration with Apache Spark. So you can pass Spark data frames directly into System ML, you can get data frames back. Your System ML algorithm, once you've written it in terms of System ML's, one of System ML's domain specific languages, it just plugs into Spark like all the algorithms that are built into Spark. Okay, so that would keep Spark competitive with more advanced machine learning frameworks for a longer period of time. In other words, it wouldn't hit the wall the way if it encountered TensorFlow from Google for Google's way of doing deep learning, Spark wouldn't hit the wall once it needed a TensorFlow as long as it had System ML so deeply integrated the way you're doing it. Right, with a system like System ML, you can quickly move into new domains of machine learning. So for example, this afternoon, I'm going to be giving a talk with one of our machine learning developers, Mike Duesenberry, about our recent efforts to implement deep learning in System ML, like full scale convolutional neural nets running on a cluster in parallel, processing many gigabytes of images. And we implemented that with very little effort because we have this optimizer underneath that takes care of a lot of the details of how you get that data into the processing, how you get the data spread across the cluster, how you get the processing moved to the data or vice versa. All those decisions are taken care of the optimizer. You just write down the linear algebra parts and let the system take care of it. That let us implement deep learning much more quickly than we would if we had done it from scratch. So it's just this ongoing cadence of basically removing kind of the infrastructure gut management from the data scientist and enabling them to concentrate really where their value at is, is on the algorithms themselves. So they don't have to worry about how many clusters it's running on and kind of that configuration, kind of typical DevOps that we see on the regular development side, but now you're really bringing that into the machine learning space. That's right, Jeff. I mean, personally I find all the minutiae of making a parallel algorithm work really fascinating, but a lot of people working in data science see parallelism as a tool. They want to solve the data science problem. And System ML lets you focus on solving the data science problem, because the system takes care of the parallelism. So you guys could go on in the weeds for probably three hours, but we don't have enough coffee and we're going to set up the follow up time because you're both in San Francisco. But before we let you go, Brett, as you look forward into 2017 and kind of the advances you guys have done there at the IBM Spark Center in the city, what's kind of the next couple great hurdles that you're looking to cross? Kind of the new challenges that are getting you up every morning that you're excited that if we come back a year from now, we'll be able to look back and say, wow, you know, these are the one or two things we were able to take down in 2017. We're moving forward on several different fronts this year. On one front, we're helping to get the notebook experience with Spark notebooks consistent across the entire IBM product portfolio. We helped a lot with the rollout of notebooks on data science experience on Z, for example. And we're working actively with the data science experience and with the Watson data platform. On the other hand, we're contributing to Spark 2.2. There are some exciting features, particularly in SQL that we're hoping to get into that release as well as some new improvements to ML Live. We're moving forward with Apache System ML. We just cut version 0.13 of that. We're talking right now on the mailing list about getting System ML out of incubation, making it a full top level project. And we're also continuing to help with the adoption of Apache Spark technology in the enterprise. Our latest focus has been on deep learning on Spark. Well, I think we found him, the smartest guy in the room. All right, so Fred, thanks for stopping by and good luck on your talk this afternoon. Thank you, Jeff. Absolutely. All right, he's Fred Rice, he's George Gilbert. I'm Jeff Frick. You're watching theCUBE from Big Data SV, part of Big Data Week in San Jose, California. Hi, I'm John Furrier, the co-founder of SiliconANGLE Media and co-host of theCUBE. I've been in the tech business since I was 19, first programming on many computers.