 Good morning. So my name is Will Benton. I'm a software engineer at Red Hat. So we've seen this morning how OpenShift makes it possible to develop and deploy applications. I'm going to be talking about a particular kind of application, data-driven applications. And I'm going to talk about my team's experience developing analytic applications and using data science at scale and talk about how our infrastructure requirements changed as we went from doing analytics as a workload to developing and deploying and maintaining analytic applications and how OpenShift makes that possible. For a little bit of background, for a little more than two years now I've been leading a team focused on data science in Red Hat's sort of emerging technology in our emerging technology group in Red Hat. And we really wanted to figure out what the data-driven applications of tomorrow are going to look like on open source infrastructures. We wanted to look for problems inside Red Hat and help people take things that they'd maybe prototyped on a single machine and bring them into production scale. We wanted to go from models that were sort of black boxes for predictions to user interpretable models that told you something about the real world and didn't just give you a yes or no. And ultimately we wanted to do all this while having at least as good predictive performance as existing solutions of things we were looking at. So when we started off in this effort we had a lot of experience with Apache Spark which is a framework for data processing I'll talk more about in a minute. But we had a bunch of Spark machines in a dedicated compute cluster. We had a Gluster, a network POSIX file system in essentially the same rack as our compute nodes and we were orchestrating everything with Apache Mesos. This worked pretty well when we were just a small development team. We could sort of coordinate with one another and say, hey I need this many machines for this model I'm going to train and sort of informally allocate things as we needed to. Where it sort of broke down is when we wanted to take the things we'd built and put them into production and share our applications and our data with our collaborators. And ultimately what we ran into is that this was a great way to run analytics workloads but the problem is that analytics isn't really a separate workload anymore. We really put analytics into production today as part of contemporary data driven applications. So what we wanted to do going forward is have a way that we could use a shared Spark cluster both for development work and for production work. So both sort of interactively and at scale. We wanted an easy way to share our cluster and our work with collaborators and we wanted a really nice continuous deployment workflow so that we didn't have to do a lot of extra work to update our applications. And we wanted to do all of this without becoming expert system administrators or scheduler policy ninjas as well. So in the rest of the talk I'm going to talk about Apache Spark, introduce it for those of you who are just becoming familiar with Spark and talk about why it's a great fit for contemporary microservice architectures. I'll talk about some architectures for contemporary data driven applications. I'll talk about two of the particular issues we had to solve to bring Spark to OpenShift, scheduling tasks and dealing with persistent storage. And then I'll sort of show you how you can get involved and use this stuff yourself. So we'll start with some background. How many people in here have heard of Apache Spark? Great, great. How many people have used it before? Great, great. So this will be a review for some of you, but we'll go through this just so everyone has the same context. The tagline for Spark is that it's a fast and general framework for distributed data processing. But that's really only part of the story, right? Fast and general, that's great. But I think the really compelling thing about Spark is that it's actually easy to use. Unlike a lot of other frameworks for distributed data processing, if you think of MPI or MapReduce, these sort of other frameworks are based on an execution model that's easy to execute in parallel, right? You have a way to sort of go from the execution model and say, hey, programmers, use this thing that's easy for us to execute efficiently. Spark sort of goes the other way around, and it takes a familiar programming abstraction and comes up with a way to execute it efficiently. So that fundamental abstraction in Spark is the resilient distributed data set. And this is just like any sort of sequential collection in your conventional programming languages, except it can be partitioned across machines. These collections are immutable, and operations on them are lazy. And the consequence of this combination of laziness and immutability is that we get resilience for free. So we'll take a look at these and see what they look like now. Imagine your conventional sequential collection in a programming language. It contains some values, probably all of the same type. And if we want to distribute it, we have a couple of ways we can do that. We can divide it up into partitions and put each partition on a separate machine. We can divide it into partitions in a bunch of different ways, but some of the most common are taking ranges of contiguous elements, putting each of those in its own partition, or hashing each element, and putting everything that lands in the same hash bucket in its own partition. Now, the first D word we have on this slide is distributed. And once we have a distributed computation, we're going to have what? Failures. We're inevitably going to have failures. So these partitions can go away when the machines that are holding them, or maybe some other machine, right, goes away. But we have a way to reconstruct them. And this is where immutability and laziness really start to pay off. Because when you have one of these RDDs, you have some values, either in an immutable collection in heap in your application program, or on stable storage somewhere, that form the basis for any of these RDDs. And you build them up by doing operations on them. You have two kinds of operations. You have transformations. This is where immutability comes in. These don't actually do anything to the underlying collection. They create a new collection. Because we're lazy, they don't actually create a new materialized collection. They create sort of a recipe for a new collection. So in this case, we have a very small collection, and we want to apply an operation to it, say filtering out all of the even numbers. We haven't actually created a new collection here. We've just created a way to say, given this underlying collection, how would we get to a new one? We can keep applying operations on this RDD, say multiplying every element by three, expanding the collection so that we have twice as many elements, including each element and its successor. And we still haven't computed anything here. We just have a recipe for how to get the collection we ultimately want. If we do a different kind of RDD operation called an action, we'll actually get a value out of this. And in this case, we'll do an action called collect, which schedules these computations on our cluster, materializes the collection as a sort of array in our application program. Now, because we're lazy, if we want to do a different action on that same collection, we'll have to compute everything again. So we have to filter out those even numbers, multiply everything, you know, and then expand everything into itself and its successor again if we wanted to do a different action like save as a text file. Obviously, if we want to use a collection more than once, this is sort of wasteful. So Spark gives us a way to suggest that something's going to be used again and cache these intermediate results. We can see what this looks like operationally by looking at the sort of high-level architecture of a Spark application. The main application is called the driver, and it interacts with a cluster manager, which schedules executors, which are basically just microservices that compute partitions of these RDDs. So the driver will serialize a function out to these executors. They will actually compute those values when they have to, and if we decide that we want to cache an intermediate result, those cached partitions will live on the executors. So this RDD is pretty pleasant to use, but it's really a lot like an assembly language or an intermediate representation for distributed computation. It's general, it's usable, it provides all the primitives you need, but a lot of programmers might want to work at a higher level. And the Spark ecosystem, which is built on this RDD and this scheduler concept, includes a lot of higher-level libraries for special purpose tasks. Since being able to cache these intermediate results is a big benefit for things that we want to iterate over, you can imagine that a lot of these libraries are things where you have to do iteration. Think of graph traversals. Think of processing structured queries, like with a database query planner. Think about optimizing machine learning models, and Spark even provides a way to treat a stream of values as many small RDDs, one for each window over the stream, and thus use the same abstraction to program streaming and batch workloads. And out of the box, Spark provides a few different ways to deploy this. You can use a sort of self-managed cluster manager, you can use Apache Mesos, or you can use Hadoop Yarn if you have an existing Hadoop installation. What I want to tell you today is that we can actually run these standalone Spark clusters on top of OpenShift, and there's work going on in the Kubernetes community and with some people on my team to actually have native support for scheduling Spark tasks on Kubernetes as well. So microservices. I assume a lot of people here have an opinion about microservices. Is anyone allergic to microservices? Anyone delighted by microservices? Everyone else in between? Is that fair to say? I assume most people would be in between. Microservices aren't a panacea, right? We do get some extra complexity in having to define these interfaces and orchestrate these components, but we get a lot of compelling benefits in return. Another really nice benefit is that Spark is a really natural fit for these kinds of architectures. So just to review why this makes sense, if we have stateless microservices, our services are commodities. One of them goes away. We can replace it with another one that meets the same responsibilities. We have a lot of flexibility in how we deploy these things. We can scale up, we can scale out, or we can load balance. It's a lot easier to debug a stateless service. No environment can make debugging into a trivial problem, but it's a lot easier to say I want to examine the surface area of this well-defined API than it is to say I want to explore some hidden state inside a stateful service and produce a test fixture that will reproduce a bug. This can hold for performance problems, too. It's a lot easier. Those of you who've done large-scale an individual component with an SLA that's not meeting its SLA and saying something needs to change here, then it is to look at a big system and figure out where things are going wrong. There are a lot of social benefits of microservices, too, but I'm not going to be the fourth person to talk about those this morning. So we can... So another nice thing about Spark is that it's really a natural fit for these. As I mentioned, these executors are really microservices that compute partitions of these collections. They are essentially stateless if we ignore caching, and caching is just an optimization. So when... So they do what they're told, and when one of them goes away, we know how to replace it because we have the underlying collection and we have the operations to apply to it to get back to the value. So the good news is that if we want to make a contemporary data-driven application on OpenShift, Spark is a great place to start. If we don't want to take advantage of contemporary application architectures, we need to think about analytics as a component of an application and not just as a component of... not just as a separate workload. So let's start by taking a high-level look at what a data-driven application actually has to do. A data-driven application is really a lot like any other application, except that typically you're transforming and aggregating data from different sources, using that to train predictive models, to transform and make predictions from your data, your saving predictions and raw data to archival storage, and you're supporting a few different kinds of user interface. Maybe you're letting developers or data scientists install new models, modify how models are trained. You have, of course, a typical end user web or mobile UI. You have a reporting UI for the business side, and you also have a management interface so that you can tune how your application is deployed and make sure it's performing well. So we'll start our review of architectures by looking at a couple just really quickly that are good analytics as a workload architectures. The first one is probably going to be familiar to a lot of people. This is the conventional data warehouse architecture. We think about a stream of events arriving. We're going to transform those somehow, apply some business logic rules, and put them in a database that we've optimized for fast concurrent updates. We're going to call this a transaction processing layer. Stop me if you've heard this before. Now, this is great because these fast databases can deal with a high volume of events, but they don't have any analytic capabilities. To extend this with analytic capabilities, we're going to periodically send the contents of our transaction processing database to another database that's optimized for a lot of concurrent reads and complex queries. We can then do some analysis on that, typically folding multidimensional data into a spreadsheet so that it's comprehensible and you can make a report out of it. We can also feed those analysis results back to the business logic and use in the transaction processing half. Finally, we may want to allow analysts to engage in sort of interactive exploration of our data as well. We call this the analytic processing layer. This architecture has worked really well for a long time for a lot of applications. The downside is that it's really hard to do this with stateless services. It's really hard to scale out the transaction processing side. Historically, you haven't been able to scale out relational databases very well. On the analytic processing side, it's a little easier to scale out, but it's going to lag behind the current state of the world, which is kind of a problem. There's a lot of additional complexity there as well. One approach to actually scaling out compute and storage is the data lake approach that was popularized by the Apache Hadoop project. The idea here is that you have a uniform abstraction for federating all the data you care about, a distributed file system. If you run out of space, you can add more space, and those nodes are going to stick around for a while. When you get application events that you care about, you append them to files in the distributed file system. Those events aren't going to be in the format you care about for training a predictive model or running your data-driven application, so you're going to schedule some jobs to run on those. The way the scheduling is going to work is that the jobs that operate on particular parts of the data will migrate to the machines that store that data. You get scale-out compute on top of your scale-out storage. This was a great approach in that it let people really exploit commodity hardware and get scale-out, and also the really compelling benefit of being able to store a lot of data was huge for a lot of organizations. The downside is that the programming model is pretty low-level, and that it's not really a great fit for the kind of elastic architecture we want to see in a cloud-native application. You can't scale your storage and compute independently because this is what you're doing. The legacy architectures we've just seen solved a lot of problems, but they're not the best way to have these kinds of applications in production now in 2016. We're going to look at some architectures now that are more suitable for contemporary, containerized, data-driven applications. The Lambda architecture is an interesting approach to modernizing that classic data warehouse. Instead of having two databases, you have two analysis layers. You get a stream of events, and you multiplex those both to a stream processing layer and to a distributed file system. In the stream processing layer, you're repeatedly performing imprecise analyses on your latest data, whereas in the distributed file system, you're regularly performing precise analyses on all your data. Then you can present your user with a view of their data that federates both the latest results from the stream processing layer and the precise results from the batch layer, giving them a view that strikes a balance between these imprecise but current information and precise but possibly stale information. The advantage here is that you get a lot lower latency than you would in the conventional data warehouse, and that you can also schedule this. Really a lot of this can be stateless containers. The downside is that you have to implement your analyses twice. Sometimes it's hard enough to get them right once, but you have to implement these things twice, both as batch analyses and as streaming analyses, which is a lot of extra engineering effort. So the CAPA architecture, which is sort of a response to the Lambda architecture, reflects the fact that the way we design streaming algorithms and stream processing systems has changed a lot, and the idea is that everything is your queue. Everything is on the queue. So we have events. We put them on a message queue with a raw data topic. We transform those and write them to a pre-processed data topic. Our analysis jobs simply read from pre-processed data topic and write to an analysis results topic. And then our UI components just subscribe to the analysis results topic and present those to users in different ways. So this is really a nicer way to do analysis where everything is a stream, but it does assume that you have a sophisticated stream processing system and that the analyses you want to do are actually expressible as streaming algorithms. Now we can do a lot of things as streaming algorithms, but we can't necessarily do everything. So those are good assumptions to keep in mind. In my opinion, the most flexible way to do this involves moving this problem of data federation from a storage layer, like a database, a message queue, or a distributed file system, a compute layer. And then we just explicitly transform data upon ingest into a system. And Spark, with the RDD and some of the abstractions built on top of that, gives you a flexible way to interact with a lot of different data sources, transforming them on demand in the compute layer and then operating on them there. Since we've seen that Spark executors are essentially microservices, we know that we can run this in a contemporary containerized environment. But also, Spark is general enough to implement the analytic processing side of a conventional data warehouse, the stream processing or batch processing sides of the Lambda architecture, or the stream processing side of the Kappa architecture as well. So Spark is really, really general and gets us a lot of benefits there. Now, is it enough to just have Spark as the basis for our contemporary data driven applications? Well, you can still go wrong, right? You can still say I have Spark and I'm going to deploy it like something that's just analytics as a workload. If you have one resource manager for your applications and one resource manager for your compute cluster, you get into this situation where you have to make scheduling decisions in two different places. And you're probably not using your resources as efficiently as you can. A better way to do this is to schedule with one cluster one Spark cluster, either logical or actual per application so that your unit of scheduling granularity is actually the application. And then your resource manager is able to make the right decisions for each application, making sure that application components are co-scheduled with the compute they depend on. The compute requirements of our applications can change over time though. In this example, we have an application with a huge task backlog but only a single Spark executor. Now, in a conventional standalone Spark deployment, we can use Spark's support for dynamically allocating resources to temporarily give it more executors. And we can do something similar here. But since we're wired in to OpenShift, we actually have a service that reads the metrics from our running Spark application and says give me some more... give me some more nodes to run on. So we can extend our elasticity, from here below the Spark cluster, into the control plane. So in my team's original deployment, we had a cluster file system living in the same rack as some of our compute nodes. POSIX file systems are really great, right? Like, you can run grep on them. You can do all of these kinds of things that you're familiar with. But this was a problem for us on a management perspective because when we wanted to share our data with other users, managing access control became sort of a pain. It also meant that we were depending on a path versus a service, right? Another option that people use for storage is to actually have HDFS as a peer to their application scheduler. This can work pretty well, but you can't get that advantage of co-locating your storage and compute that is something that HDFS depends on to some extent for performance. So I want to talk about a different solution for storage, which is just using object stores. There are a lot of different implementations of the S3 API. You can think of, you know, Amazon's obviously, the Swift project from OpenStack and Ceph. And you get fine-grained access control with these, and you can access them from a range of libraries. The downside is that if you have an application written against a POSIX file system, you have to change your expectations for consistency and performance. Now, the idea that you need to co-locate your storage and compute was sort of inspired by the popularity of HDFS. And it's sort of an interesting assumption that I want to take a minute to address. But one of the things you can scan these QR codes and read these whole papers, I would encourage you to do that. But one of the things that you see in contemporary workloads is that a lot of analytic processing workloads actually do fit in memory once you've pre-processed your data. We also see that if you have a fast network and you have maybe data center locality or rack locality, that might be as good as reading from local disks. Finally, frameworks that are designed to depend on having local storage may make some assumptions about how expensive storage is. Now, we know that writing to physical disks isn't actually cheap, but people assume that it is. And so you get these cases like this one where you have a distributed file system used to glue jobs together and you're spending almost a third of your time just moving temporary files around. If you read this, this is a great blog post about using Spark at scale and if you read the whole thing you'll see that they were able to eliminate that and replace it with a single job that didn't have to use the distributed file system for that. So, really, the other thing to consider about storage is that you're reading that big data from your storage once and then you're pre-processing it and operating it on medium data that's going to be cached as you train your model. And we're spending most of our time operating on data that fits in memory in this case. We could optimize the part where we're reading the big raw data off of storage but that's going to get us diminishing returns compared to making this fast. So, you may not need to take, to optimize by co-locating compute and storage. There are both practical and philosophical reasons why you might want, might not need to do this. But if you do it's possible to get that rack or data center locality pretty easily without sacrificing a lot of flexibility. And there are smart people working on making it possible to even get that same machine co-location without sacrificing a lot of flexibility either. So, we're really excited about Spark on OpenShift. I hope you are too. And I want to show you just how you can get started and try it out yourself. We're doing all of our work upstream in the Rad Analytics I.O. GitHub organization. My colleague Mike McEwn has a great video demo of the developer workflow for deploying a data-driven app on OpenShift. And my team and a bunch of other teams who are working on this are going to be here at the Big Data meeting at lunch. And we'll be happy to answer your questions then as well as I guess now.