 PDF though in their animations, that's the problem. So, let's try this with the VGA, if one more time. So, give me just a second here, we'll see. No, it disconnects right away when I start the presentation. So, does it disconnect if you don't start the presentation, even if we see the outlines of the other slides? I don't think so, but then I can't show my animations. Do you want me just to do this talk without slides? I could, or if... Is there a way to adjust the resolution on the projector or the aspect ratio? I can try this one. I'm going to try... I'm going to try... I'm going to try... I'm going to try... I'm going to try... I'm going to... I'm going to... Just plug it in, makes it after it's been used. I'm going to... I'm going to plug in, so... Now, can we turn off power saving mode on this? I guess let's see if it goes away. So does it only go away when I play? Can you turn off power save mode on this? Can you turn off power save mode on this? You have to turn off power save mode on this. You have to turn off power save mode on this. Now it seems it works. Let's see if it goes away. Let's turn off power save mode on this. Alright? No. Okay? Let's turn off power save mode on this. Okay? Let's turn off power save mode on this. Okay. Start again. This is the next session. Just a reminder that we will have a final session at 4.30 after the end of the whole session in that main room with some prizes that we came in. But now let's start with a new presentation. And we'll ban them from the Electric Technologies Team Reduct. We'll talk about big data introduction. We'll ban them from the Electric Technologies Team Reduct. Thank you so much. I spent many years making Luigi miserable when we were on the similar project. So I'm glad I can make him miserable when he's leading a session too with technical difficulties. So today I'm going to talk about some things we've learned using Apache Spark in production. Apache Spark. How many people in here have used Spark before? How many people sort of know what Spark is? So I'm targeting the second group of people. If you've used Spark before, you will learn something. If you haven't used Spark before, you'll get enough context to learn something for when you start using Spark. And we have basically today a story in three acts. We'll have a prologue where we introduce Spark to sort of give context to everyone so that when you start using Spark later, you'll remember what I said and it will make sense. We're going to start by talking about how my team has used Spark over the last few years. We first, I think the first thing I did with Spark in public was a blog post about Spark on Fedora in April of 2013. I've been using Spark for a long time. Relative to Spark. And then the second act, we'll talk about some lessons we've learned along the way that are sort of applicable to general distributed data processing and contemporary applications. And then the final act, we'll talk about how we're doing all this on OpenShift now and how we've gone from having analytics as a job, as a workload, as something that lives on the side to something that's a part of applications we care about and sort of is integrated with the application lifecycle. So since it's only an hour and since we spent the first four minutes actually getting the projector to cooperate with my iPad, we're not going to cover everything. You're not going to leave here as an expert Spark user, an expert data scientist. I'm not going to cover a lot of things with system tuning or JVM tuning, but I will say if you have a large heap, you need to worry about it. And I'm not going to talk about implementing custom algorithms or anything too. So if you're expecting these things, I apologize, but I think you might still enjoy the talk. So Prolog, we're going to talk about Spark first. The interesting thing from my perspective about Spark is that think about other systems you might use for distributed computing. What's another system you could use for distributed computing? You could use Condor. Could you use like MPI or PVM or OpenMP or just threads? So the interesting thing about Spark from my perspective is that all of those things we just mentioned are things where someone says I have an execution model that's easy to run in parallel. And I'm going to make programmers sort of fit their programs to it. With Spark, instead, there is an execution model that's easy to execute in parallel and easy to distribute. But with Spark, they really started with an abstraction that looks like something programmers already know how to use, which is a stark contrast like MapReduce, which is something that no one would use by choice. With Spark, you have this sort of fundamental abstraction. It's a collection. It's just like a collection you would see in a conventional serial programming language, except it's partitioned, so these collections are spread across multiple machines. It's immutable. When you update the collection, you don't update it in place. You make a copy that has the changes you wanted to make. But you don't actually update it by making a copy because these things are also lazy. You don't do anything to these collections until you have to. So you sort of build up a recipe of changes on a collection, and then when you need to get a value out of it, you actually do the computation, and it actually gets scheduled. So the interesting thing is because we're partitioned across multiple machines, that means we're absolutely going to have failures, right? Once you have more than one computer involved, you can't avoid failures. But this combination of immutability and laziness means we can always recover from failures. We essentially get resiliency for free. And this thing is called the resilient distributed data set, and Spark sort of treats this as the assembly language for distributed computing. Here's what it looks like in practice. I have a partition collection. Maybe I want to run a filter operation. So I filter out some values from my collection. Maybe I want to map these values by changing them into something else. Now, let's say one of the machines that's running my Spark program goes away. I lose that partition. Fortunately, I know how to reconstruct the partition, and Spark can schedule jobs to do just that by starting with the values I had in that partition to start with, re-executing the filter, and then re-executing that map operation as well. Here's what it looks like in code. I do most of my Spark programming in Scala, but I assume that most people can read Python, or at least this is sort of close enough to executable pseudo code that it makes sense, even if you don't like Python. But this is a program that counts the number of words in a file. It starts by creating one of these distributed data sets backed by the lines of a text file. It converts the data set of lines to a collection of words, which is what we're doing on that line where we split things by white space. And then we convert each word to a tuple of that word at one. Call that a word occurrence. It says you've seen that word once. Then we combine all those together, so instead of having a bunch of things saying you see a word once, you have one thing for each word saying how many times you saw it. Now, because we're lazy, we haven't actually done anything until we get to this last line where we say I want to save the result of this to a text file, and that's when the computation actually gets scheduled. So that's just a flavor of what Spark looks like. Here's how it looks when you actually execute a Spark program. You have a driver, which is your main application. You have something that manages resources on your cluster and tells the driver about them. And then the driver, when you want to execute functions on individual partitions, it sort of ships off functions that you want to execute on them to each executor, so just sort of a microservice to calculate the value for a partition. And optionally, you can cache these intermediate results if you're going to use them more than once, because the problem with being lazy is that sometimes being lazy means you have to do more work than if you just done in the first place. So I was sort of talking about RDDs there. As I said, it's sort of an assembly language for distributed data processing. The interesting thing about Spark is that Spark takes that, and you could program like that, I don't mind doing it, but people usually want a higher-level thing. And on top of that foundation of the distributed collections with a sensible programming model and a scheduler to run graphs of jobs on top of that programming model, we have a bunch of high-level libraries for other tasks, like graph algorithms, structured query processing, machine learning, and there's even a way to sort of handle streaming data by discretizing your stream into a bunch of these little collections. And you could deploy this in a bunch of different ways. Your cluster manager can be an ad hoc cluster manager that's provided by Spark itself. You can run on Apache Mesos. You can run on Hadoop Yarn if you have one of those environments. We actually have had a lot of success running these ad hoc clusters on Kubernetes and OpenShift, and there's sort of work ongoing in the Spark and Kubernetes communities to get first-class support for scheduling Spark on Kubernetes directly. So that's sort of the really high-level overview of Spark. I hope that's enough context so that we sort of get where we're coming from with Spark. If there are questions about Spark, I can take them at the end as well. But here's sort of just some examples of what we've used it for just to give you a flavor of what we've done and where we started. So when we started off, we were all sort of working with Spark on individual machines. One nice thing about this compute model is that you can use it to either scale up or scale out. So you can write a parallel program on a machine that has a lot of cores, or you can write a parallel program that runs on a cluster. And it's not as different as it is in some environments to run these two things the same way. But as our work with Spark matured, we went from prototyping on single machines to running on clusters. And we wanted to help people do the same thing. My team was looking to help groups within Red Hat basically who had data problems and they had sort of an analysis that they prototyped on a machine and scale it out. We wanted to help people go from models. We wanted to do some sort of improvements on the data science. I'd help people go from models that were black boxes to models that were human interpretable. Models that tell you something about the world. So instead of just saying I have a machine that gives me predictions, I want to look at a model and say, well, what does this tell me about my data? And then of course we wanted to also have models that had improved statistical performance, you know, wherever we could get that as well. So we were really looking at doing all of these things with open source tools and helping other people as much as possible to sort of solve data problems and sort of collaborate with people and bring fresh eyes to it. We were able to do a lot of cool projects during this time. Just some of them were, one project we had was actually, I did this as a prototype of using Spark to prototype new data science techniques. So using Spark sort of like you would use R or the Python data ecosystem for prototyping to analyze bike data and say, well, where's the best place to do workouts near where you live based on your historical bike rides? We've modeled the infrastructure costs for cloud services based on metrics. I'll talk a little bit more about that example a little later. How many people in here know what FedMessage is? So FedMessage is the sort of unified message bus that all of the Fedora infrastructure runs on. It's very cool. There's a lot of data. Anything that happens in Fedora has a message published about it. And the great thing is that all of this data is open and you can download it and you can analyze it. So we did some work to look at the health of the Fedora project and like how many people are contributing, what kinds of contributions are they making? Is there anyone who can't go on vacation without crippling the project? No, unfortunately there's not. And then also some cool work on looking at performance data, making sense of performance data with Spark and sort of machine provisioning data. This machine has this set of packages installed. What are we using it for? Can we figure that out automatically? So that's just some of the things we've done with Spark over the past few years. And as you might imagine, we had a lot of opportunity to make mistakes. And so in this next act, I'm going to let you learn from our mistakes. I'm going to have this act of structure as a series of lessons, and I'm going to start with a meta lesson, which is how to master any declarative programming environment. I'm going to give three rules. The first two rules are rules I learned when I was programming prolog in production. How many people have used prolog before? How many people have used prolog in production before? Okay. So the cool thing about these rules is I really believe they apply to any declarative programming environment, whether you're using something really weird like prolog, like a logic language, or whether you're using a functional language, or whether you're using a sort of declarative, immutable, lazy environment like Spark that exists in a bunch of different host languages as well. And you can think of these as sort of beginner, intermediate, and advanced steps. And some of the other lessons are going to tie into these. But the first step is to understand the programming model, the sort of primitives that are available to you in idiomatic ways to do things. I think of this as the what does this mean question. The second one, which is the advanced step, and a lot of people never even get this far, is to say, well, I know how to express what I want to do, but I also need to know operationally how it gets executed. That's the what does this do question. And then the third thing, which is very advanced, which is sort of unfortunate because it shouldn't be very advanced, is how to sort of give up your ego as a programmer and get out of the way and let the environment or the language or the runtime do as much for you as possible. And that's really the expert level, right? Once you've gotten out of the way of the language. And so to give you a sense of how much Spark has grown in the last few years, this is the final slide from a talk I gave at ApacheCon in 2014 on optimizing programs that use RDDs. So if you're writing for the assembly language of distributed data processing, I had a bunch of suggestions about things that you should do to sort of make sure that things work well. Now, a lot of these things, some of these things depended on the what does this mean lesson, but a lot of them depended on the understanding how your programs were actually scheduled and executed, looking at the sort of operational details. Now, you can look at this talk if you want to. There's a blog post there, and I think some of these lessons still hold up, but the really cool thing about Spark today is that if you start with the higher level interfaces, you get right to the how can I get out of the languages way question right away. You don't even have to worry about how things are executing at first to get good at Spark. But what you do have to do is you still have to start with this what does this mean. And thinking about these sort of immutable lazy collections and the things that are built on top of them is sort of a paradigm shift for a lot of people. But once you've gotten past that, there's still an enormous API to use in Spark. And there can be huge performance implications when you're using things that seem similar. So I'm just going to give one example of how using the right API method can make a big difference. And this is something that I actually did and had to learn the hard way. So say you're writing something to train a machine learning model. You have a whole bunch of workers constructing intermediate models and you're going to aggregate them all together and combine them to make one big model. One model that incorporates results from every partition of your data. Now Spark has an API method called aggregate. So I assume that if I want to aggregate something, I use the aggregate API method. Well, we'll dip into how does this work a little bit and see that what that does is that transfers all of those intermediate models to our application. And then it combines them together a pair at a time until eventually we only have one left. And that's our completed model. So this works, right? What's it not doing? What's not happening here? Yes. Yeah, like nothing is happening at the workers, right? We have this just serial program running on the driver. The other thing that's happening is, you know, depending on how many of these partitions you have and how big your model is, you're sending a ton of data to the driver. So you have two big downsides here. Increased memory pressure and decreased parallelism. And those are, that's not the direction I want to increase and decrease those things. Right? A better approach would be if we could aggregate actually at the workers and take advantage of parallel execution on the cluster. And in fact, that would look something like this. You know, where we do basically the same aggregation, but this is happening across multiple machines and the application program doesn't even have to get involved until we have maybe two or one things left and those just get shipped over from the workers where they were assembled. Now, this is what we want to do and you might think that this involves re-architecting your application and making a change, but actually it just involves replacing the aggregate API method with the tree aggregate API method. It's a very simple change, but it just sort of speaks to the advantages of sort of reading the documentation and looking for things that you don't need immediately, but things that you might need later and then remembering them. If I hadn't remembered that I'd seen this tree aggregate method, my code would have been a lot slower and worse. A lot of times you develop these programs and it's difficult to get them working or it's difficult to get good results and you don't always necessarily think about how you can make them better. So this is an example of doing that. Now I'm going to talk about how to get out of Spark's way with a couple of things and I'm going to talk about two of my favorite features in Spark. The first one is these both give you something for free, just in exchange for using Spark in a certain way. So query planning, which is what you get if you run structured queries using Spark's data frame API, basically makes that whole talk I gave about optimizing RDD programs in 2014 and makes it obsolete, because all of these sort of low level rearrangements that you do are things that the query planner does for you in your programs. You get that for free. What characterizes this is something that makes dumb code run faster. If you use Spark from a typed language and you use typed interfaces, like if you use Spark and Scala and you use the interfaces that preserve typing information, you can prevent really dumb code from running at all, which is a big advantage if you're training a machine learning model and it takes three hours to run and you only find a result that you only find that you forgot to handle a null in two hours and 45 minutes into the run. So it's impolite to talk about politics or religion, so I'm not going to try and convince you of the utility of type systems, but I will talk about query planning. And if we look at an example query, if you're not convinced that query planning, database style query planning is interesting, we can look at this simple example and I think it'll be clear the kinds of things you can do just from a simple example. We have everything from two tables where we have a join condition and we have a condition on the values we're considering from one table and a condition on the values we're considering from the other and furthermore we're assuming that not very many things are going to satisfy these extra conditions. Well, what happens if we just, as a database, if we just execute or as Spark, if we just execute this in order? What we do if we just execute this in order is we join those two tables together and we're going to produce a relation that maybe has a whole bunch of tuples in it, right? Taking everything from A and B and then we're going to filter it down. Then we run the filter for these uncommon conditions that we care about in our result. Well, maybe we only have a few things left after we do that. So that works, it gives us the right answer. Why not the best we can do though, right? There's a very simple trick we can do that Spark will actually do for us if we use the query planner and that's if we do these filters as early as possible in the plan. Then we only have to even consider rows that satisfy the extremely rare condition for A and the uncommon condition for B and then we join those together producing a much smaller result that maybe winds up just being even only the things we care about depending on how selective our conditions are. So as long as we're talking about query processing, the next lesson is about our archival storage formats and about data locality. Now, I want to sort of motivate why I'm bringing up data locality because this is something that a lot of people worry about. A lot of people sort of lose sleep over this before they have to. Now data locality is something that if you have to lose sleep over it, but you shouldn't lose sleep over it before you have to. Now why do people lose sleep over data locality? Let's go back to the Hadoop era. If you've done anything with Hadoop MapReduce you know that Hadoop was a scale out file system and you could scale out compute on top of your scale out storage by migrating compute jobs to the machines that housed the data they were operating on. It enabled people to use commodity hardware to process truly ridiculous volumes of data but it motivated a lot of assumptions that have stuck around for a long time. Like your disk is not going to be slow enough, your memory is going to be too small, your network is absolutely holding you back and if you don't have data locality just give up on doing distributed data processing. Now data locality is a really important optimization. When we're doing systems work we want to be as high up the memory hierarchy as possible at all times but do we need it? And why might we not want to need it? Achieving this data locality isn't a problem if you have this Hadoop-like architecture where you have these more or less permanent storage resources and you have enough of them to handle any compute you care about but what if you want to scale out your storage and compute independently? What if you want to run in stateless containers in the cloud and you can't guarantee that your storage and computer are going to be co-located? Just give up on data processing? Maybe not. And we'll see why. The first thing I want to raise about this is this idea that you actually need enormous disks and that you don't have enough memory. A lot of recent studies in the last five or six years have shown that the most realistic analytic workloads the working set sizes actually fit in cluster memory even if in this case in 2012, if you just had 32 gigs of memory per server for caching I mean I think you probably know someone who had more than 32 gigs of memory in their workstation in 2012, right? It's not an outrageous number. The second sort of point about worrying about data locality before you have to is that networks are actually pretty good and even with Hadoop you see that reading data from your local disk is actually you don't take that much of a hit for reading it over a fast network and on my team we've seen that running real world spark jobs against Gluster or Sef running in the same rack as our compute nodes is almost indistinguishable at the application level of running against data on local disks because the working sets fit in memory and the data you read is going to get cached locally anyway and networks are pretty fast but the last thing is about how compute frameworks that assume that they can always write to a local disk and basically have something that's a boundless but fast cache is a disk a random access device no one wants to answer this question is a disk a random access device a disk is a sequential access device that has a random access interface your spinning disk is going it's a rotating platter if you treat it like a random access device you're going to get a terrible performance you have that interface but you shouldn't do it now I think there's something similar that happens with Hadoop if you assume that local disks are fast are local disks fast not really how much slower is a local disk than memory a lot slower but if you assume that you always have this scratch space that you can use and that it's fast enough you're going to use it a lot and this is a great blog post that I recommend people read in the spark for a big production use case and they were migrating a Hadoop workflow and the Hadoop job spent 30% of its time just moving temporary files around because it assumed that it had this sort of fast local disk it could use so in summary before you worry about node locality make sure that you need it there are a bunch of reasons why you might not need it now if you do actually need it so you're still not stuck you can still run in containers there's great work in the Kubernetes community and on OpenShift to support storage affinity running storage in containers and so on so I want to talk about something though that you can do with your storage that will impact your performance more than having locality and that's the kind of archival format you use think about how you might want to store maybe you have every tuple in order and maybe you have indices for some of the columns and if you want to find some certain rows you scan through the file and get them pretty straight forward so let's say we want to take this thing that we have stored in a row-oriented format where we have just a bunch of rows next to each other in a file and we want to select a few rows and then project out just a few columns that we care about from each of those rows so in the row-oriented format we scan through the whole table we keep the rows we care about and then we project away the columns that we aren't interested in pretty straight forward there's a better way to store this data if we store the data in a column format where we have a separate file we have an index for each column we can rearrange values to exploit redundancy say you have a column that has four different values you don't need to store a copy of those for every row you just need to store a way to look up the rows that have each value then you can really efficiently answer queries about these things even if you don't have explicit indices for them and assemble rows based on the descriptions of the values in each column you have and a really nice thing about that is that you don't even need to look at columns at all that you don't care about whereas with this row-oriented format you're basically polluting caches at every level of the memory hierarchy just to get a bunch of information that you're going to throw away so in the columnar format our query looks like this we find the values we care about in each column and then we reassemble rows from them like the panacea of data processing if you were doing transaction processing you would never use this columnar format because writes are very expensive but if you have read-only data like you're doing an analytics workload we found that typically this uses less than 10% of the space and you get between one and two orders of magnitude speed up for queries just by using this format and I want to show you a concrete example of how this worked out when we were modeling infrastructure costs we had a big set of metric and telemetry data stored in a relational database and we also had billing data from a cloud provider we were looking at a big service and we wanted to predict how much it would cost to run we sort of had the idea that if we looked at what the system was doing maybe we could figure out maybe we could figure out a function between what it was doing and what it cost to run so as part of exploring this data we wanted to just run a really basic sort of summary statistics query on the data so this is a table from Zavix where we store a whole bunch of metric data and we had about we had about 120 gigabytes of data and we evaluated this in a couple of different ways where we're just getting summary statistics on each thing and joining it to figure out what each kind of thing is so we're getting a few items.key is a human readable name and we want to just sort of get that from another table and aggregate the values we see so this is a pretty substantial machine we were running this query on but if we ran this in the database itself and we ran this a couple of times it was like between 15 and 18 hours to run this query over this much data for a couple of reasons right one is that we're only using one thread one is that we're we don't have indices for all of these things right so when you have this oriented format and you don't have indices the database query planner can't do anything for you and we ran this in spark was 10 to 15 minutes we were able to take advantage of the parallelism we got that nice property from the columnar format of having the implicit indices everywhere and we got a really big speed up so this is a case where even if you're even if you're not interested in learning about spark if you have to do a lot of these kinds of queries on relational data it might be worth checking it out the quick question that spark was running on that machine was a single cluster that's just a single machine for spark yeah that's the other nice thing about spark is that you can scale up as well as out okay I want to talk to that question was basically that wasn't a spark cluster that was like a single spark it's a 40 node machine so it's like a cluster inside a single computer no it was one one physical node I'm sorry so the questions were this was a single spark machine yes running continuing to make Luigi miserable so I'm talking about streaming but since we're short on time I think I just want to say that like there are a few things to think about with streaming a lot of people there's a lot of really cool things happening in streaming these days and a lot of people are looking at different systems but there are a lot of considerations for streaming you probably already have some streaming data source in your application the main takeaway I want for you to get here is that it's really easy to adapt spark to work with a new streaming engine and if you don't need the sort of durability and endless replayability that some some sources get you can probably use what you already have you can stream almost anything and you know you may be doing model training in something other than spark so think about how you can interoperate between a model you train in spark but have to evaluate something somewhere else so I'm going to talk briefly about predicting the future it's very difficult it's harder than predicting the past usually and just some lessons we've learned about the machine learning side of things really the most important part of most machine learning is feature engineering those of you who are at the workshop yesterday got a sort of crash course in this but basically the idea is how do I take objects in the real world or in my program and turn them into things that a machine learning algorithm can train a model from so basically you're going from something that is recognizable to you either as a programmer or as a human to something that's a vector of numbers right there are a bunch of techniques for this the thing I want you to take away from this talk is that once you've decided on one of those techniques whenever possible have a way to go back from those vectors of numbers to things in the real world it's not always possible to do that but it's really great if you can have a way to say how do I go back from this thing and figure out what it actually means why have I characterized these types of bicycles as these kinds of numbers and the big reason why you want to do that is because really ultimately it's very nice to have human interpretable models people have heard of deep learning maybe it's a very hot technique now and it's very good for a lot of traditionally very hard problems image recognition natural language processing and so on and you may have seen some really impressive demos where you upload a photograph and you get some suggested tags and you can see if that's a bird and this is a kiwi fruit and maybe we're not so specific as to identify individual athletes yet but we're getting there and I'm sorry I didn't have a genetic stebar or a keterine and ash really the question about models is do we care about having accurate predictions yes accurate predictions are really important but the value of a predictive model I think should be about more than just making accurate predictions it should be about telling us something about our data and about the world so a model that's just a bundle of floating point numbers isn't going to tell you a lot about the world but if we have a model like a decision tree where we have a bunch of yes or no questions that can tell us whether or not we need to worry about something that can both predict and explain so as one example I was traveling for work a couple of years ago I was at a conference and I got a phone call from my bank saying my credit card had been compromised now usually when you're traveling and you get this call it's a false positive your card hasn't really been compromised but this time it actually had and the interesting thing is the bank employee who was talking to me said there are reasons why this transaction was suspicious it was a small dollar amount it was late at night relative to where the business was and where you live and it was an unusual business name and so on and listing all of these factors and I was like well that's really great you have a model that can explain why a transaction is suspicious and not just tell me that it is and the really great part is that I was at a data science conference when this happened so I was instant story there so I just want to close by showing how we can go I mean I've sort of hinted at talking about putting the stuff in containers but I want to show you how we did it and as our cluster grew from doing sort of jobs on individual notebooks to having a big compute resource that we shared among the people on my team and sort of coordinated with each other that was great for running big jobs and it worked really well for us where it started to break down is that we wanted to share our work with other people we wanted to keep things running in production we wanted to have a life cycle for these things and support them and having a shared analytics resource doesn't work for that when you care about maintaining applications in fact all of the architectures that we've seen for data processing really think of analytics as a separate workload but we don't care about analytics as a workload we care about an application that deals with data sources in green and blue there has user interfaces in green and does some interesting data processing in orange the solution we had is just to bundle up Spark clusters with our applications in OpenShift so that we can schedule application components at the same time as OpenShift and access storage through interfaces I'm not going to explain to you everything about how we did this but if you want to try it out please go to radanalytics.io we have some images you can get started with right away we have tutorials and example applications and I'm really excited about running this stuff in OpenShift so these are our takeaways when I put the slides up you can get these but at this time if we have any time for questions I'm happy to take them yes yes yes yeah no so in the past and I can go yes so the question is if we've been using local data on SSDs or if we've been using network file system and when we were running a sort of shared compute resource where we had a bunch of machines in a Spark cluster all of those machines were connected to Gluster in the same rack basically now you get a lot of benefit out of doing that because ultimately things get cached in memory as you pulled them from Gluster but we didn't have data locally on each machine yeah so I think that's the question is should models be interpretable or not and how does that apply to image data and that's sort of why I brought in the credit card example because that's an example where the interpretable model is a clearer win now for image data with even these deep learning models there are ways to sort of train them so that you can say well I thought that this part of the image was this kind of thing which is not exactly human interpretable but it's sort of a step in the right direction right in satellite engines you can build models that say oh just levels of reflectance in these areas on the ground so just to say yeah it doesn't matter I think model understanding always matters yeah I agree Steve says model understanding always matters I agree and I think that the interesting thing about the deep learning stuff as opposed to traditional image processing is that you're not doing a lot of those manual sort of filtering and pre-processing steps with the deep learning yes how much of like the spark community are trying sort of newer orchestrator things outside of ASOS and the spark scheduling trying Kubernetes sometimes I wonder if it's just us pushing that you know okay so the question is how many people who aren't in the Kubernetes community or who aren't OpenShift users or Red Hat employees are using Spark on Kubernetes and OpenShift and I don't know how much I can disclose but I know a lot of people are doing this on Kubernetes in production and I know a lot of people who already use OpenShift are very excited about using it on OpenShift we have sort of an open issue and a fork of spark with this Kubernetes scheduler work that has been a joint effort between someone on my team and someone at Google and someone at Palantir and then some other people in the community and it's gotten a tremendous amount of interest both from Spark upstream developer community and from the user community I do think that the developer experience on Kubernetes for applications is so much better than the developer experience on Mesos if what you care about is having a compute resource that you can share between a bunch of applications that you're scheduling somewhere else something like Mesos is great something like Condor is great if you care about saying I have an application and I want a way to package it up and orchestrate it you can't beat that developer experience on OpenShift on Kubernetes so it was just thinking like that your work OpenShift and Spark is that the diamond show picture with the applications running on OpenShift is that then like Spark is actually running a part of my application or is it there is a ball of running in OpenShift that runs my application and connects to some remote scar? No the idea that we have that actually works out pretty well is each application has its own private Spark cluster so that all of those components if you think about scheduling an application you care about the UI, you care about an ephemeral database that it's using but you also care about these data processing Spark cluster components all together if they're all in the same application and just for management and understandability and scheduling it works out really well I will basically schedule a follow up with Spark and some schedule or whatever manager and then I will scale out with Spark Yeah so we actually have a service that runs within OpenShift that will provision a Spark cluster in your application, in your project for you and you can use that to scale up or scale down the cluster Last question So that's a good question so Spark can scale out elastically and if it's mean in a sort of natural way if it sees that you have a bunch of things that are waiting to run if your run queue gets very long it will create more workers where it can so a colleague of mine actually has developed an approach that sort of listens to Spark metrics and then interacts with the service that we have for managing Spark clusters and requests more workers when necessary so there are a few approaches to that and that's a pretty cool way to do it I think so Thank you for the question Yeah Well we would like to be as portable as possible like for all the service that we run should be running OpenShift that's the goal at least so we can move that person to OpenShift that would probably make a lot of sense because we need to switch from AWS to Google Cloud to OpenShift So right now it's still at the data science phase so I don't have a deployable application here I've been looking at CICD data with Anton and I've got Steve Nules so I think maybe we can return that to an application and consider it as a blueprint for future stuff It's good if you want us to find something out or if you want us to put some testing users or something and we'll put you in an awesome place Please send me an email after you get a chance to do that I don't know when you have any questions or any feedback Good to meet you, Chris Hi Nice to meet you Just a quick question Do you have any questions? Yeah, I'm gonna I have a PDF but I'll give them to the organizers that they'll be on the website or whatever I should probably put them on the I'll tweet them if I can That's all How did I start with it? Well, it was I sort of had an unusual path I was a programming languages guy and I you know it was like at the time it would be the easiest way to use Spark was in Scala and I was the only person on my team we were looking at all this emerging data presses and stuff and I was the only person on my team who was willing to deal with Scala when we started looking at it so that was sort of an unusual path of science It's fun stuff It's cool to I mean, this is the thing I like about computing is that you can always find some new problem that's interesting and has a bunch of new techniques Thank you Thank you And in this experience I'll be able to do the the blaster emphasis paranoid or it is just coming for because we have oh yeah Yeah Yeah, let me