 Hi again, Red Hat Developers. This is Jason from the Red Hat Developers program here at Summit 2017 inside the Dev Zone. We're here with Will Benton from Red Hat, who's going to talk to us about analytics and data science on OpenShift with Apache Spark. Thanks, Will. So I'm Will Benton. And again, I'm going to talk about analytics and data science on OpenShift with Spark. It's a quick talk. We're just going to cover a few points. The first point I want to address is that contemporary analytics is a great fit for cloud-native application platforms like OpenShift. OpenShift is a great platform for the kinds of applications you want to be building with analytics, just like it's a great platform for applications in general. The surprising thing in this talk is that OpenShift is also a great platform for data scientists, that there are reasons why data scientists might want to work in OpenShift, even if they're not immediately deploying an application. And then finally, I'll show you where to go so that you can learn how to build the next great data-driven intelligent app on OpenShift. We'll start with a little bit of background, though, by looking at some older models for parallel and distributed computing that people might have used to do machine learning or intelligent applications in the past. Maybe you had a program that had a lot of loops in it, and you wanted to split up these loops and run each one in a different thread. Well, you could run something like OpenMP for something like that. If you had a fixed number of relatively coarse-grain tasks in your program that you could all have communicating at once, you could use something like MPI. And if you had a ton of data and some scale-out storage on commodity hardware, and you wanted to write programs that operated on this data by migrating compute jobs to where the storage they operated on were located, you could run in something like Apache Hadoop. Now, older-style parallel and distributed execution models like these may be familiar to you, but they have some shortcomings for developing the applications of tomorrow. Really, these kinds of older-style models are usually either inelastic. You can't scale them up. You can't scale them out. They're not resilient to failures. They may not be general enough to handle all the kinds of problems you want to solve. And furthermore, some of them are really just not that fun to program for. So we want something that addresses all of these concerns. Now, if we talk about cloud-native applications, and we take the definition that the Cloud-native Compute Foundation has, we're talking about applications that are containerized so that they're reproducible. They're stateless. You can always fire up one with a known good configuration. We're talking about applications that are dynamically orchestrated. You don't need to sort of statically set up a topology for your application. It should adjust to meet your needs and to meet the resources you have available. And finally, these applications should be structured as microservices. You get a lot of engineering benefits from having applications as microservices, but it also makes it really easy to scale up the teams that work on these applications because each team can work on a different component with different responsibilities. Now, if we think about the kinds of analytics and compute frameworks we don't want to use today, these actually already address some of these issues. We want to be elastic. We want to scale out. We want to be able to respond to more data when we get it. So these contemporary frameworks are already dynamically orchestrated. Furthermore, because it gets the resiliency essentially for free, a lot of these frameworks are essentially microservices that compute chunks of data anyway. So we may not be able to say that contemporary analytics and compute frameworks like Apache Spark are cloud native, but they might be cloud naturalized. They're most of the way there to being cloud native. So Apache Spark in particular, for those of you who aren't familiar with it, provides a programmer friendly abstraction and a way to schedule jobs of tasks operating on these distributed collections. And on top of that, it provides a bunch of libraries for common tasks that you might want to do in your intelligent applications. Graph processing, structured query processing like SQL queries, training machine learning models and processing streaming data. Spark can manage itself. It is dynamically orchestrated, but we found it also runs really well on OpenShift thanks to some work we've done inside Red Hat, making it work really well in containers. So we know OpenShift is great for regular applications. I wanna show in particular how OpenShift is great for the kinds of applications you'd build that are gonna learn from data and become better the more you use them. If we look back at these old sort of parallel and distributed compute models, the thing that they all have in common is that they treat your analytics jobs as a separate workload. Analytics is something you run off to the side. It's something your BI people do. It's something your data scientists do. It's not really part of your enterprise. It's a report that happens once a month. It's an offline thing. In the early days of using these sorts of contemporary frameworks like Spark, people sort of deployed Spark in the same way that they would deploy one of these legacy frameworks. They said, well, I have a compute cluster and I wanna run some applications. So I'm either gonna figure out how to run my applications on this compute cluster or I'm gonna have a separate resource manager for my applications and have this sort of competing process of orchestrating these things across these two frameworks. And that didn't work out very well. And the reason why it didn't work out very well is that analytics is not a workload that we run on the side anymore. Analytics is the thing that makes Amazon suggest something to you that you wanna buy, right? It's the thing that makes all the products and services and apps you use better. It's really an essential part of the contemporary applications we're running today. So the thing we've discovered is that it's actually much better to have compute clusters that are dedicated to each application so that we can schedule them on the same control plane as our applications and run everything at once under OpenShift. So I hope it's not too controversial to suggest that OpenShift is a great platform for this kind of applications, just as it is for other applications, but OpenShift is also great for data scientists. And to talk about why, let's talk about how data scientists like to work. How many people here have worked with data scientists before or done some data science exploration? So you know that data scientists probably love these notebook interfaces, right? It's a sort of literate programming interface where you have a browser-based way to sort of provide some explanatory text and provide an executable example. And then you can go through this notebook and run it and generate output. It's a really great way to explore new ideas. Those of you who've done literate programming before know that this is really nice to have the documentation and the code intertwined with the output of the code. The problem is that a lot of times, these things are sort of like pet applications that someone has developed on a notebook and wants to deploy, right? Sharing these can be really brittle. You may have a lot of dependencies. This is just to run a basic Jupyter notebook. Those are all the Python packages you have to install. The results of the notebook depend on the order that you evaluate the cells in. So maybe the order that the cells appear in the notebook is not the order that someone ran it. And a lot of times, there are dependencies on someone's data or environment that you may not be able to reproduce. So you get this situation where someone has a notebook or some great ideas in it, but you can't really turn it into something you can share or into an application. Well, let's take a look at the science part of data science and talk about doing reproducible experiments, right? What do we think of that's like a reproducible experiment? Well, continuous deployment is a lot like having a repeatable experiment, right? We want a way to go from get to an image to our pretty graphs, right? And that's exactly what OpenShift provides for data scientists. I actually, when I do work internally, I prefer to run a notebook under OpenShift because I know that I'll be starting from a no-and-good configuration and I can turn it into an S2I image to share with other people later. I can use an S2I to share with other people later. So I hope you're excited about building these kinds of apps on OpenShift. Here are some pointers for where to go next. The first and most important one is visit radanalytics.io, which is a community site where we've collected some of our open source work around this. There are OpenShift templates, images you can use to run Spark on OpenShift, and a bunch of tutorial applications that you can just start from scratch and deploy and play with and set up yourself. My colleague, Mike McEwn, and I are leading a hands-on lab, basically, where we'll take a notebook and turn it into an application all on OpenShift on Thursday morning. If you can attend that and it's not full, I would recommend it. Otherwise, there are a lot of great sessions on the schedule, just regular sessions at Summit. This week, we have two tomorrow, OpenShift and the insightful application development lifecycle, and then converging insightful data-led applications with traditional applications. And then on Thursday at 11.30 AM, we have a bigger picture demo of running various Red Hat data technologies in OpenShift with Spark to produce next-generation applications. Thank you so much for your time.