 Live from Boston, Massachusetts, it's theCUBE, covering Spark Summit East 2017, brought to you by Databricks. Now, here are your hosts, Dave Vellante and George Gilbert. Boston everybody, Nick Pentreth is here, he's a principal engineer at the IBM Spark Technology Center in South Africa, welcome to theCUBE. Thank you. Great to see you, so let's see, it's a little different time of year here that you're used to, but... I've flown from, I don't know the Fahrenheit equivalent, but 30 degrees Celsius, heat and sunshine, snow and sleet. It's a lot chilly there, we'll wait until tomorrow, but so as I was joking, you probably get the t-shirt for the longest flight here, so welcome. I actually need the parka or like a beanie. A little better, the long sleeve. So Nick, tell us about the Spark Technology Center, STC, is it sort of acronym and your role there? Sure, yeah, thank you. So Spark Technology Center was formed by IBM a little over a year ago, and its mission is to focus on the open source world, particularly Apache Spark and the ecosystem around that, and to really drive forward the community and make contributions to both the core project and the ecosystem. The overarching goal is to help drive adoption, and particularly in enterprise customers, the kind of customers that IBM typically serves, and to harden Spark and to make it really enterprise-ready. So why Spark? I mean, we've watched IBM do this now for several years. The famous example that I like to use is Linux. When IBM put a billion dollars into Linux, and it really went all in on open source, and it drove a lot of IBM value, both internally and externally, for customers. So what was it about Spark? I mean, you could have made a similar bet on Hadoop, which you decided not to. You sort of waited to see the market evolve. What was the catalyst for having you guys go all in on Spark? Yeah, good question. I mean, I don't know all the details, certainly, of what was the internal drivers, because I joined STC a little under a year ago, so I'm fairly new. Translate the hallway talk, maybe. Essentially, I think you raise very good parallels to Linux and also Java. So IBM made these investments in open source technologies that it sees to be transformational and kind of game changing. And I think most people will probably admit within IBM that they maybe missed the boat, actually, on Hadoop, and saw Spark as the successor, and actually saw a chance to really dive into that and kind of almost leapfrog, and say, we're going to back this as the next generation analytics platform and operating system for analytics and big data in the enterprise. Well, I don't know if you happen to watch the Super Bowl, but there's a saying that it's sometimes better to be lucky than good, and that sort of applies. And so, in some respects, maybe missing the window on Hadoop was not a bad thing for IBM, because not a lot of people made a ton of dough on Hadoop, and they're still sort of struggling to figure it out. And now along comes Spark, and you've got this more real-time nature. IBM talks a lot about bringing analytics and transactions together. They've made some announcements about that and affecting business outcomes in near real-time. I mean, that's really what it's all about, and you're one of your areas of expertise is machine learning, and so talk about that relationship and what it means for organizations, your mission. Yeah, machine learning is a key part of the mission. And you've seen the kind of big dead-in-enterprise story starting with the kind of Hadoop and data lakes, and that's evolved into, now we've, before we just dumped all of this data into these data lakes and these silos, and maybe we had some Hadoop jobs and so on, but now we've got all this data that we can store. What are we actually going to do with it? So part of that is the traditional data warehousing and business intelligence and analytics, but more and more we're seeing there's a rich value in this data, and to unlock it, you really need intelligent systems. You need machine learning, you need AI, you need real-time decision-making that starts transcending the boundaries of old rule-based systems and human-based systems. So we see machine learning as one of the key tools and one of the key kind of unlockers of value in these enterprise data stores. So Nick, perhaps paint us a picture of someone who's advanced enough to be working with machine learning with IBM and we know that the tool chain's kind of immature, although IBM with DataWorks or DataFirst has a fairly broad end-to-end sort of suite of tools, but what are the early use cases and then what needs to mature to go into sort of higher volume production apps or higher value production apps? I think the early use cases for machine learning in general and certainly at scale are numerous and they're growing, but classic examples are, let's say, recommendation engines. That's an area that's close to my heart. In my previous life before IBM, I built a startup that had a recommendation engine service targeting online stores and e-commerce players and social networks and so on. So this is a great kind of example use case. We've got all this data about, let's say, customer behavior in your retail store or your video sharing sites and in order to serve those customers better and make more money, if you can make good recommendations to them about what they should buy, what they should watch or what they should listen to, that's a classic use case for machine learning and unlocking the data that is there. So that is one of the drivers of some of these systems, players like Amazon, they're sort of good examples of the recommendation use case. Another is fraud detection and that is a classic example in financial services, enterprise, which is a kind of staple of IBM's customer base. So these are a couple of examples of the use cases, but the tool sets traditionally have been kind of cumbersome. So Amazon built everything from scratch themselves using customized systems and they've got teams and teams of people. Nowadays, you've got this built into Apache Spark, you've got it in Spark Machine Learning Library, you've got good models to do that kind of thing. So I think from an algorithmic perspective, there's been a lot of advancement and there's a lot of standardization and almost commoditization of the model side. So what is missing? Yeah, what else? And what are the shortfalls currently? So there's a big difference between the current view and I guess the hype of machine learning as you've got data, you apply some machine learning and then you get profit, right? But really there's a huge complex workflow that involves this end-to-end story. You've got data coming from various data sources, you have to feed it into one centralized system, transform and process it, extract your features and do your kind of hardcore data science which is the core piece that everyone sort of thinks about as the only piece. But that's kind of in the middle and it makes up a relatively small proportion of the overall chain. And once you've got that, you do model training and selection, testing and you now have to take that model, that machine learning algorithm and you need to deploy it into a real system to make real decisions. And that's not even the end of it because once you've got that, you need to close the loop, what we call the feedback loop and you need to monitor the performance of that model in the real world. You need to make sure that it's not deteriorating, that it's adding business value, all of these kind of things. So I think that is the real, the piece of the puzzle that's missing at the moment is this end-to-end, delivering this end-to-end story and doing it at scale, securely, enterprise grade. In the business impact of that, presumably will be a better quality experience. I mean, recommendation engines and fraud detection have been around for a while, they're just not that good. Retargeting systems are too little, too late and cumbersome fraud detection, still a lot of false positives. Getting much better, certainly compressing the time. It used to be six months to detect fraud. Now it's minutes or seconds, but a lot of false positives still. So are you suggesting that by closing that gap that we'll start to see from a consumer standpoint much better experiences? Well, I think that's imperative because if you don't see that from a consumer standpoint, then the mission is failing because ultimately it's not magic that you just simply throw machine learning at something and you unlock business value and everyone's happy. You have to, there's a human in the loop here that you have to fulfill the customer's needs, you have to fulfill consumer needs and the better you do that, the more successful your business is. So you mentioned the time scale and I think that's a key piece here. What makes better decisions? What makes a machine learning system better? Well, it's better data and more data and faster decisions. So I think all of those three are coming into play with Apache Spark, N2N's story streaming systems and the models are getting better and better because they're getting more data and better data. So I think the industry has pretty much attacked the time problem, certainly for fraud detection and recommendations. This is the quality issue. Are we close? I mean, we're talking about six to 12 months before we'll really sort of start to see a major impact to the consumer and ultimately to the company who's providing those services or is it further away from than that? You know, it's always difficult to make predictions about time frames but I think there's a long way to go to go from, as you mentioned where we are, the algorithms and the models are kind of quite quantitized. The time gap to make predictions is kind of down to this real time nature. So what is missing? I think it's actually, it's less about the traditional machine learning algorithms and more about making the systems better and getting better feedback, better monitoring, so improving the end user experience of these systems and that's actually, I don't think it's, I think it's a lot of work to be done. I don't think it's a six to 12 month thing necessarily. I don't think that in 12 months, suddenly everything's going to be perfectly recommended. I think there's areas of active research in the kind of academic fields about how to improve these things but I think it's a big engineering challenge to bring in more disparate data sources, to better improve data quality, to improve the feedback loops, to try and get systems that are serving customer needs better, so improving recommendations, improving the quality of fraud detection systems, everything from that to medical imaging and cancer detection. I think we've got a long way to go. Would it be fair to say that we've done a pretty good job with traditional application lifecycle in terms of dev apps, but we now need the dev apps for the data scientists and their collaborators? Yeah, I think that's... And where is IBM along that? Yeah, that's a good question and I think you kind of hit the nail on the head that the enterprise applied machine learning problem has moved from the kind of academic to the software engineering and actually dev apps, internally someone mentioned the word train apps, so it's almost like the machine learning workflow and actually productionizing and operationalizing that. So, recently IBM for one has announced Watson data platform and now Watson machine learning and that really tries to address that problem. So really the aim is to simplify and productionize these end-to-end machine learning workflows. So that is the product push that IBM has at the moment. Okay, that's helpful. Yeah, I was at the Watson data platform announcement. You call the data works, I think they changed the branding, but it looked like there were numerous components that IBM had in its portfolio that it's now strung together and to create that end-to-end system that you're describing. Is that a fair characterization or is it sort of underplaying? I'm sure it is. The work that went into it, but maybe help us understand that better. Yeah, I mean, I should carry it by saying I'm fairly focused, very focused at STC on the open source side of things. So my work is predominantly within the Apache Spark project and I'm less involved in the day-to-day. So you didn't contribute specifically to Watson data platform? Not to the product line. So I wouldn't want to talk about it. Simply because I haven't been involved. Yeah, I don't want to push you on that because it's not your wheelhouse. But then help me understand how you will commercialize the activities that you do, or is that not necessarily the intent? So the intent with STC in particular is that we focus on open source. And a core part of that is that we, being within IBM, we have the opportunity to interface with other product groups and customer groups. So while we're not directly focused on, let's say, the commercial aspect, we want to effectively leverage the ability to talk to real-world customers and find that these cases talk to all the product groups that are building this Watson data platform and all the product lines and the features data science experience it's all built on top of Apache Spark and platform. So your role is really to innovate? Exactly, yeah. Leverage's open source and innovate. Both innovate and kind of improve. So improve performance, improve efficiency. When you're operating at the scale of a company such as IBM and other large players, your customers and you as a product teams and builders of products will come into contact with you will come into contact with all the kind of little issues and bugs and performance problems. Make it better. Yeah, and that is the feedback that we take on board and we try and make it better, not just for IBM and their customers, but for, because it's an Apache project, everyone benefits. So that's really the idea. Take all the feedback and learnings from enterprise customers and product groups and centralize that in the open source contributions that we make. Great. Would it be, so would it be fair to say you're focusing on making the core Spark, Spark ML and Spark ML lib capabilities, sort of machine learning libraries and then the pipeline more robust? And if that's the case, like we know there needs to be improvements in its ability to serve predictions in real time, like high speed. We know there's a need to take the pipeline and sort of share it with other tools perhaps or collaborate with other tool chains. What are some of the things that the enterprise customers are looking for along those lines? Yeah, that's a great question and very topical at the moment. So both from an open source community perspective and from enterprise customer perspective, this is one of the, if not the key, I think kind of missing pieces that within this Spark machine learning kind of community at the moment. And it's one of the things that comes up most often. So it is a missing piece and we as a community need to work together and decide is this something where we build it within Spark and provide that functionality? Is it something where we try and adopt open standards that will benefit everybody and that provides one standardized format or way of serving models? Or is it something where there's a few open source projects out there that might serve for this purpose and do we get behind those? So I don't have the answer because this is ongoing work but it's definitely one of the most critical kind of blockers or let's say areas that need to work at the moment. One quick question then along those lines. IBM, the first thing IBM contributed to the Spark community was Spark ML which as I understand it was this, it was an ability to I think create an ensemble sort of set of models to do a better job or more accurate create a more accurate. Are you referring to system ML? System ML, what are they? Yeah, so yeah, what does that fit? System ML started out as a IBM research project and perhaps the simplest way to describe it is as a kind of SQL optimizer is to take SQL queries and then decide how to execute them in the most efficient way, system ML takes a kind of high level mathematical language and compiles it down to an execution plan that runs in a distributed system. So in much the same way as your SQL operators allow this kind of very flexible and high level language where you don't have to worry about how things are done, you just tell the system what you want done. System ML aims to do that for mathematical and machine learning problems. So it's now an Apache project, it's been donated to open source and it's an incubating project under very active development. And that is really, there's a couple of different aspects to it but that's the high level goal. The underlying execution engine is Spark. It can run on Hadoop and it can run locally but really the main focus is to execute on Spark and then expose these kind of higher level APIs for that are familiar to users of languages like R and Python, for example, to be able to write their algorithms and not necessarily worry about how do I do large scale matrix operations on a cluster, system ML will compile that down and execute it for them. So really quickly follow up, what that means is it's a higher level way for people who aren't sort of cluster aware to write machine learning algorithms that are cluster aware. Precisely, yeah. That's very, very valuable when it works. When it works. So again with the caveat that I'm mostly focused on Spark and not so much the system ML side of things. So I'm definitely not an expert, I don't claim to be an expert in it. But it works at the moment. It works for a large class of machine learning problems. It's very powerful. But again, it's a young project and there's always work to be done. So exactly the areas that I know that they're focusing on are these areas of usability, hardening up the APIs and making them kind of easier to use and easier to access for users coming from the R and Python communities who again are, as you said, are not necessarily experts on distributed systems in cluster awareness, but they know how to write a very complex machine learning, cluster machine learning model in R, for example. And it's really trying to enable them with a set of API tools. So in terms of the underlying engine, it's, I don't know how many, hundreds of thousands, millions of lines of code and years and years of research has gone into that. So it's an extremely powerful set of tools. But yes, a lot of work is still to be done there and ongoing to make it, in a way to make it use already and enterprise already in the sense of making it easier for people to use it and adopt it and to put it into their systems and production. So I wonder if we could close, Nick, just a few questions on STC. So the Spark Technology Center in Cape Town, is that a global expertise center? Is STC a virtual sort of IBM community? I'm the only member who's in Cape Town, so I'm kind of fairly lucky from that perspective to be able to kind of live at home. The rest of the team is mostly in San Francisco. So there's an office there, it's co-located with the Watson West office, so the Watson teams that are based there and Howard Street, I think it is. How often do you get there? I'll be there next week. Typically, sort of two or three times a year is, you know, I try and get across there and, you know, interface with the team. But we are a fairly, I mean, IBM is obviously a global company and I've been surprised, actually pleasantly surprised that, you know, there are team members pretty much everywhere. Our team has a few scattered around, including me, but in general, when you interface with various teams, they pop up in all kinds of geographic locations and I think it's great, you know, a huge diversity of people and locations. Anything, I mean, it's early days here, early day one, but anything you saw in the morning keynotes or things you hope to learn here, anything that's excited you so far? I caught a couple of the morning keynotes, but had to dash out to kind of prepare for, yeah, I'm doing a talk later actually on feature hashing for scalable machine learning. So that's a 12-20, please come and see everybody. A breakout session, what, 12-20? 20 past 12, yeah, so in room 302, I think. So I'll be talking about that and so I needed to prepare. But yeah, I think some of the key exciting things that I've seen that I would like to go and take a look at are kind of related to the deep learning on Spark. You know, I think that's been a hot topic recently and one of the areas where, again, Spark is, perhaps hasn't been the strongest contender, let's say, but there's some really interesting work coming out of Intel that it looks like. They're going to be talking here on theCUBE in a couple hours. Yeah, I'd really like to see their work and that sounds very exciting. So yeah, I mean, I think every time I come to a Spark summit, they are, they're all these new projects from the community, various companies, some of them big, some of them startups that are pushing the envelope, whether it's research projects in machine learning, whether it's adding deep learning libraries, whether it's improving performance for kind of commodity clusters or for single, very powerful single nodes. There's always people pushing the envelope and that's what's great about being involved in an open source community project and being part of this community. So yeah, that's one of the talks that I'd like to go and see and I think I unfortunately had to miss some of the Netflix talks on their recommendation pipeline that's always interesting to see. So I'll have to catch them on the video. Well, there's always another project in open source land. Nick, thanks very much for coming on theCUBE and good luck. Cool, thanks very much. Have a good trip, have a good trip. Stay warm, all right, man. All right, keep right there, everybody. George and I will be back with our next guest. We're live, this is theCUBE from Spark Summit East. Hashtag Spark Summit, right back.