 First I wanted to start by saying thank you for having me here today, I'm honored to be invited here to Madrid to come talk to you guys. It's always a great pleasure to come to Spain and I just appreciate the opportunity to be here. And I wanted to give you a little bit about, tell you a little bit of my background, give you a little bit where I'm coming from and then I'll share kind of some things I see happening in the big data space. So I started working on the Hadoop project at Yahoo in the summer of 2007. So I've been part of big data now for about five and a half years. And I've seen a lot of growth and change in the Hadoop part of the big data space in those years. So I wanted to spend a little time talking about that, go over some kind of evolution I see happening there. So as we've kind of heard multiple times already this morning, and I think is very true, this is when you say big data, this is the first thing you think about, right? Lots and lots of storage, lots and lots of data stuck somewhere. And that's certainly, there's no question that's a part of it. This is definitely a part of big data, but it's not all of big data. It's not just about storing terabytes or even petabytes of data or whatever exabytes, you know, on and on more and more data. That is certainly important, but it's not the sum total of what we're talking about. As other speakers have already mentioned, part of it is applying more complex algorithms and new techniques to the data you store, right? Now this is one particular example I stole from a talk by Jimmy Lin of Twitter, a talk he did last summer in the U.S., on some algorithms they are applying to their tweets to learn about their users. And this is an algorithm called stochastic gradient descent. And it's a way to, thank you very much, it's a way to do machine learning on their tweet inputs inside MapReduce. And they're actually using PIG to do this and they write user defined functions inside PIG to do this very complex equation. And I'm desperately hoping nobody asked me to explain that equation because even though I have a degree in mathematics, I actually don't remember what that means. All right, so that's another part of it. But one of the corollaries of these complex algorithms and new algorithms is that we see a lot of new tools appearing in the big data space, right? And on this slide I just threw together a bunch of different tools. This is not in any way an exhaustive list, right? There you could think of all kinds of other things to put on there. I was just trying to give a flavor of some of the different, you know, the number of different tools that are coming up and some of the categories they're coming up in, right? So we see tools like HBase and Cassandra that are focused on individual, you know, high volume reads and writes of smaller amounts of data or individual records in a large data set. Tool like Apache Hive that focuses on bringing SQL to big data because SQL is still a very useful tool. It's great for answering queries. Tools like PIG and Cascading that focus on data flow programming, ETL, data modeling, all those kinds of things. Apache Giraffe is a newer project that does graph processing on top of Hadoop. Obviously MapReduce is kind of the grandfather of all these, or not quite grandfather, it's not related to all of them, but it is at least the first out there. And then we see stream processing frameworks like Storm and S4. So there are a lot of tools in this space, right? And let's think a little bit about how this worked before the big data revolution. It's not like people didn't do data processing before big data, and it's not like all they did was SQL, right? There were tools to do these kinds of things. We're applying algorithms that existed to new sizes of data sets. We're not necessarily inventing entirely new algorithms. But the world used to look a little bit different. It used to be that a particular type of processing was constrained to a particular machine or set of machines, usually a single machine. So if you look inside the data center of a lot of companies or government agencies or whatever is processing this data in the past, and even today in many of them, you will see different distinct systems for different things. Your data is all stored in a data warehouse. You have an OLTP database that focuses on doing all the upfront processing and loading that warehouse. If you want to do statistical analysis on that, you probably pull it out of the warehouse and you run it in some system that focuses just on that. If you want to do more cubing or roll-up, molap kinds of things, you probably pull that again out of the data warehouse and you have some other system that does that. That's kind of the paradigm for processing in the pre-Big Data world. One of the things that's happened at the same time as Big Data, and I think has really enabled it, is cloud computing and grid computing. I don't think that we should totally confuse cloud computing with Big Data or cloud computing even with Hadoop. Cloud computing like Big Data is a very popular term right now and kind of taken to mean anything you want it to mean. Everybody I talk to, their application does cloud and it does Big Data. It's like, well, I guess. I want to be careful how I use that term, but at the same time I want to recognize that cloud computing has changed things and it's enabled some of this proliferation of tools. In here we have, in my nice artistic work here, I've drawn the cloud and inside of just a random selection of these tools. One of the big changes that cloud computing has brought along that plays very well in the Big Data world and particularly the Hadoop world is users don't want to have to think anymore about my data is on this machine and now I need to move it over to this machine or I have to get data out of the data warehouse and load it into the statistical analysis package. They want to know data is in the cloud, the tool is in the cloud, I can apply the tools I want. This is one of the wonderful promises of cloud plus Big Data. But on that slide I did leave the other systems there just to show that it's not like this world, the cloud is taking over and all these other systems are dying because I think that's sometimes what people think, oh, Hadoop will take over the world and there won't be any more databases. Well, I don't think so. There's a lot of data out there that is still going to be out there and there's still a lot of tools that are going to work with it but all these things have to work together. So having all these tools is on one side a wonderful thing. This is a picture of a toolkit and you'll notice there's a number of different tools in there. There's a hammer, there's a whole set of wrenches or depending on where you learned English spanners. There are screwdrivers, there's all kinds of tools. If one day at my house I need to hang a picture on the wall and so I need to drive a nail into the wall to hang the picture on, I'm not going to grab the wrench and start banging on that nail. I'm going to pick the hammer. But if the next day my faucet starts leaking and I have to go in and maybe tighten up the nut that holds the pipes together or something like that, I'm not going to get the hammer and start banging on that nut until I get really angry anyway. But at least at first I'm going to grab the wrench and try to tighten it. That's what wrenches are for. This is the upside of the proliferation of tools in the big data world. You can pick the right tool for the job. You're not stuck trying to write your data modeling in SQL which where SQL is really good at writing a query but it's not necessarily really good at writing a complex step of transformations that you want to do on your data. You end up with these nasty subqueries that bend your mind. So there are other tools that are much better at that data flow programming. But the downside is these tools don't always play well together. In the picture here we have on the right side what an outlet looks like here in Europe. On the left side we have what an electrical plug-in looks like in the United States. Those don't look like they're going to play well together, do they? It gets worse. It's not just that physically I come over here and I have a plug-in for my computer and it's not just, well, computer is a bad example because it adapts. My wife's hair dryer brought that along. It's not just that physically you can't plug that into the wall. It's that if you do the voltage is wrong, right? Here in Europe voltage is 220. At home in the United States it's 110. So if my wife plugs her hair dryer into the wall socket, even if she somehow adapts physically, it'll just blow the hair dryer up because it's not built for that voltage. That's the downside of this proliferation of tools. Things don't play well together across those tools. We see this in a number of ways. Let me back up actually first and say the thing to notice is in almost any organization working with big data you are going to find that you have multiple tools for it. It will often start out as one. It might start out with people saying we can't fit the data we want to fit in our relational database anymore. So we're going to bring in Hadoop and we're going to use that to store that extra data. And since everybody here knows SQL, we'll use Hive because that's a natural place to start. But it doesn't take long before you discover, wow, we have all this data and now we could start doing graph processing on top of it. So let's use Giraffe with it. Or while we have all this data, our data scientists want to build this complex new data model that's pretty hard to describe in SQL. So let's use Pig to build that data modeling process. And we've seen that over and over. I've seen it in the companies. When I worked at Yahoo, we started out with MapReduce and Pig and eventually over time started adopting a bunch of other tools to do different things. I've seen it in other companies that started at other points and evolved into more and more of the tool set. But the problem with this is you end up with silos of users, right? You have the Pig users over here who don't know how to share their data with the Hive users over here because the storage formats are different. The data models are different. And the user-defined functions, the code that those users write to work with that data don't play across those systems. So you end up with these silos and you're basically destroying the promise or at least a good part of the promise of your cloud computing, right? Part of the idea of cloud computing was all my data is in one place, all my processes is in one place. It's a big happy family we can all share. And now you're starting to sort of break that down. So another downside of this is you end up with wasted developer time and effort, right? And this is just purely to pick an example. Both Pig and Hive have roughly the following, you know, sections or parts inside them, right? They both have parsers, they have optimizers, they have physical planners, they have executors. And now some of these are obviously different, right? They offer different features, they do slightly different things. But the bottom part of the stack looks a lot alike, right? There's only so many things you can do to data at the bottom side, right? When you're doing batch processing on data, you can group it, you can apply filters to it, you can select out particular columns, you can join it. Both of these systems have lots of tools to do that on the underneath side. And there's a lot of potential things they could share there, yet they don't because they're entirely different tools written by different groups and all that. So what you end up having is a situation where users end up, or sorry, developers end up replicating each other's work. As soon as Hive adds a new feature, the Pig community goes, ooh, that's cool, and they run and, you know, develop the same thing or vice versa. It's not the best use of people's time. So what I'm asserting here, the core assertion is, we need to think about what kind of shared services should be available in these big data platforms. And I'll be talking today about Hadoop because that's the one I'm familiar with and the one where I've seen things change. What should be, what are the goal of these services? They should give users consistent experience across the tools, right? If I'm, now obviously tools differ, I'm not saying the syntax should be the same or whatever, that's not what I mean. Obviously, tools differ, they present different strengths and weaknesses to users, and that's what we want. But if I write code to run in one tool, ideally it should run in another, right? I shouldn't have to adapt myself, I shouldn't have to adapt my data model, all those things as I switch between tools. And it should also allow developers to share their effort where it makes sense, right? Let's find those things that are common across tool sets and distill them into these services. So another way to put this is, effectively I'm asserting that Hadoop is a distributed operating system and that we need to think about what are the services that Hadoop should offer? What are the services that should be offered in these kinds of, these kinds of operating systems? And I put up a table here of some of those services. This, again, this is not exhaustive. These are just some of the things that I'm aware of that have been done and are being done. And then in the middle of the column, I put up some of the tools that provide those services and some of them provide the services pretty directly, like table management is provided by Hcatalog. And I'll go into a little bit about what that means. Some of these services exist in some of the tools, but not in others. For example, user access control, Hadoop does provide some ability to control what users, you know, who can access your files, checking to authenticate that you are who you say you are, you know, so that I can't just assert that I'm you and go use your files. Those kinds of things. And then in the last final bottom row there, I have data virtual machine underlying, because that's kind of a new service that I want to propose. And I'll talk more about that toward the end of the talk of what that is. So I'm not going to cover all these just in the interest of time, but I want to talk about three that I've been somewhat involved in or at least know a little bit about and talk about some of the ways those are developing and some of what I see as the future in those. And then the rest, if you're interested, and you can ask me about later and offline. So let's start with table management. So the observation here is Hadoop gives you a file system, right? And file systems have some advantages, but it turns out that often people don't want to think about data as if it were in a file. They like to think about it as if it were in a table for a number of reasons, right? Tables abstract away from you concerns about where is my data stored? What format is it stored in? You know, is it compressed or not? All those kinds of things. And putting that data in a table model is nice, not just for the users who get that, but it's also nice for system administrators who now can get a little bit more control over who shares that data and can separate out those concerns about storage and location and all that from the data so that if they need to adjust those things, it doesn't impact the users, right? One of the things that we went through while we were at Yahoo was we decided to switch storage formats because we found a storage format that was vastly more efficient than the one we were using and we thought it would save us a lot in disk space, a lot in processor time and all that. So we announced to our user base that, hey, we're going to take all the data you have, we're going to transform it into this new file format. Well, what that translated into is everybody had to go through a full cycle of retesting all their applications that they had on top of Hadoop, making sure that they all still work, they still produce the same results. Yahoo runs, and this was years ago, so I'm sure it's different now, but at the time Yahoo was running tens of thousands of Hadoop jobs a day on, I think it was 10 or more different clusters. So it turned out the entire process of rewriting everybody's application, retesting it to deal with this new file format was a year-long process and in the end it failed. It just simply died under its own weight. You know, too many people refused to, you know, too many teams just said, you know what, we don't have time for this, we're being told by the business there's other things that are more important and it just, you know, eventually ground to a halt and became impossible. So presenting that table abstraction to users instead of a file abstraction has a lot of power. So it turns out that there's this tool called Hive that already has tables inside Hadoop because it presents SQL to users, and SQL always thinks in tables. So what each catalog does is it takes Hive's table model and it opens it out to other tools in the system and actually to tools outside of the Hadoop network as well, or sorry, Hadoop cluster as well. So with Hive you can, pig users can access these tables both read and write as can MapReduce users or if you connect a third-party database tool to the system you can also directly connect to these tables. So here's a graphical representation of how this looks for Pagan Hive and MapReduce. Without each catalog you'll notice here that MapReduce, Pagan Hive all have different routes to get to the data in HDFS. HDFS is Hadoop's file system and that only Hive is interacting with the Metastore there and the Metastore in this case just tracks things like what data types are in the columns of a table or what format it's stored in, all that kind of, you know, what the schema of your table is, all that kind of basic, what you would call a catalog in a traditional database. So each catalog changes this picture by making there be a lot more lines on it but also by basically saying everything will now go through this one input path, Hive's input path and we'll share that with these various tools. So now you see that Pagan MapReduce can also get access to this which means they can access the metadata and they can use HiveSertes which is the way Hive translates data in and out of HDFS and via the REST interface on the far side there external systems can also get access to this data. So let's look at this now in a little bit of a same kind of idea in a table format with, without each catalog, MapReduce, Pig, and Hive all have different data format or different data models, sorry, different record formats and for at least MapReduce and Pig they really don't have a schema or any notion of data location or data format. That's all encoded inside the user's script, right? This is when I referenced that Yahoo had taken a year for everything for people to change data formats, this is why because they had to actually go rewrite their application, right? Once you add a table service like this now we have a shared data model across these tools and we have a shared notion of where this data comes from. If we had had each catalog at Yahoo at the time that we had tried to do that storage format change users would not have had to rewrite their applications at all. They would have had no notion that the storage changed. Now they might have insisted that they retest their applications just to make sure but they at least would not have had to write any code, right? Because that would have all been abstracted away from them. So that's what a table service will get you and that's where we feel a project like each catalog is really valuable because it brings one of these services that was just in a particular tool hive and it opens it up to the rest of the system. Now I want to change to the next one of those services I talked about which is the resource manager in Yarn. Yarn stands for yet another resource manager which I guess just turns out yarn is a nice word in American English and it yarn doesn't mean anything so I think they changed the M to an N. But the kind of yarn in a nutshell here is in Hadoop 1.0 there were two components to Hadoop. There was HDFS, the distributed file system, and there was MapReduce. So this one paradigm that you use to do anything you want to do in Hadoop, kind of a one size fits all paradigm for processing data. And don't get me wrong MapReduce is a great paradigm but it's not necessarily the size that fits best for all. In Hadoop 2.0 which is now in I believe in still an alpha state in the Hadoop community but is out there and usable. Now instead of HDFS plus MapReduce you get HDFS plus Yarn, this resource manager. So what does this resource manager give you? It gives you an interface to write parallel applications in the Hadoop cluster. So now MapReduce certainly still exists but it is just an application provided by Yarn. You can write other applications on top of Yarn that will also run in the Hadoop cluster and need not at all follow the MapReduce paradigm. So an example Spark which is a research project being carried out by University of California at Berkeley that does cluster computing that focuses on in-memory computations inside the cluster has been rewritten to work on top of Yarn. This is an example of a type of application that does not perform well in the MapReduce world because it's about getting a lot of data and memory and holding it there and then iterating over that over and over again. That's not something MapReduce does well at all but it can fit very nicely in this system. So the resource manager is a very traditional scheduler piece of software. It pretty much gives ways for applications to say here's the resources I need and then it handles the scheduling of those machines inside the cluster between the different applications. And now the resource manager itself is no longer concerned about particulars like job failure all that is abstracted away into the applications themselves. It makes for a much simpler piece of code. So graphically I don't know how well you can see that but hopefully it's still big enough to see there. On the left hand side we have Hadoop 1.0 where there's a job tracker running all the jobs in the system both scheduling and job management and on each system there's a task tracker that is responsible for keeping track of the work being done on that particular node and it interacts with the job tracker. On the right hand side here on Hadoop 2.0 you see instead the resource manager is now running the entire cluster. There is a node manager on each node but its only job is to communicate with the resource manager and deal with scheduling and that kind of stuff. Individual applications are now in that in those I think it's the purple boxes there where there's an application manager that's running the application itself all the job management is being handled there. So it's nice and separated out and that application is very generic it need not be MapReduce it can be anything else. Alright so now let me transition to the last one. The other two are work that has been done it's out there it exists I mean some of it's still you know all of it's ongoing and growing but it's at least there. What I'm going to talk about now is purely in plans speculative kind of stuff so I just want to be clear about that this isn't code you can go download just yet. But let me return to my previous diagram where I talked about how in Pig and Hive just to pick two examples that I happen to be familiar with there's a lot of shared processing going on in the back end right that I reference that there's only so many things you can do in batch processing of data and both these systems are doing them. So what I am proposing here that needs to be done is we need the equivalent of the JVM on this right so one of the wonderful things that Java has done is open up its virtual machine so that other languages can now be written on top of it. You know Python there's a Python version that runs on the Java on Java's VM there's a Ruby version. Groovy and Scala were written targeted directly at the VM to begin with right by having this shared virtual machine you can now start to write lots of different languages on top of it that are all focused a little bit differently right it's not well write your app in Java and then it'll work well on the virtual machine because there are some things that Java is not necessarily the best language choice for right and so my assertion is we need a similar service here in the Hadoop world that focuses on batch processing of data and so this service would provide the standard operators that I've listed here there's a few more but these are kind of the big ones it could provide a shared optimizer that could do physical planning and physical optimization of these batch jobs so this would include both the ability to pick the right implementation of the above operators as data, you know, as a query is being planned sorry a batch job is being planned but also since often in the Hadoop world you don't know ahead of time what the best plan is going to be I assert that you're going to need to be able to re-optimize on the fly as these jobs are going and I'll talk more about that in a little bit this virtual machine can also provide a shared execution layer and start to take advantage of some of these changes that yarn brings to Hadoop and it can provide a shared framework for user-defined functions so that, you know, if you write a user-defined function to run in any of these tools it can run in one of these tools, it can run in any of the tools that use this framework so let me just flesh out a couple of those claims a little bit I want to talk about ways that you can take advantage of yarn in a virtual machine like this, or I want to start there so in this beautiful slide that I've drawn here you have a picture of two MapReduce jobs in a row so just a simple hint, a way to tell which slides I wrote and which I took from other people if they have color in them they probably came from somebody else so I'm not known for my artistic skills so for those of you who aren't super familiar with MapReduce here basically what you see are two MapReduce jobs where the output of the first job is being used as input to the second job all MapReduce jobs have two phase, map and reduce, hence the name and you must always have both, well you must always at least have the map phase you can have map-only jobs but you cannot have reduce-only jobs this, you know, pipeline like this is what any tool that's going to be doing batch processing on top of Hadoop is going to end up looking like because it's going to be batching these jobs, you know, lining these jobs up and a pipeline, they won't necessarily always be a linear pipeline there will be a DAG because there will be jobs that have multiple inputs and some that have multiple outputs but for simplicity this is at least a good start but this isn't really optimal right, so one question here is that those of us in both the pig community and the hive community asked within a few months of starting our projects is why do you make us have these second map tasks if I have a line of MapReduce jobs I never, ever, ever need a map task beyond the first time anything that I can do in that second map can be pushed into the proceeding reduce always so this isn't super efficient so what you would like actually is a picture that looks like this where you can go from Map from reduce to another reduce immediately instead of being forced back through this map phase if I do that I get rid of a whole read and write to HDFS which is fairly painful and slow now the obvious downside here is and this is always the Hadoop community's response when we bring this up is you lose checkpointing so now if you have a failure in that second set of reduce jobs you have a farther to rewind back and restart your processing so this is not to say you'll never ever see a chain of MapReduce jobs again you still want to select points to checkpoint along the way to handle the case that something explodes and you need to back up a ways and start over because if you have a MapReduce reduced job that's a hundred steps deep the odds of failure just get overwhelming on you so part of the thing that this layer would have to provide is an optimizer that made some decisions about when is the right time to checkpoint and when can I do that and optimize performance versus tolerance, performance in the face of a failure because when you're running these machines on thousands of jobs the odds that one machine is going to go down in the midst of that job just eventually become overwhelming okay so I'm going to walk quickly through some of the others here so I save time for questions another, oh sorry let me say one other thing about that which is the huge thing here now is I no longer have to convince the Hadoop community that this is a good idea I spent five years walking over to the MapReduce people cute aisles over for me at Yahoo and saying it would be a really good idea if you did this and they kept saying no it's not now and their reason was always what I just said about the failure recovery it's very simple to recover from failure when you've only got two steps it's like yeah sure it's simple to recover but I can't go fast it's easier to not fall down when I run instead of walk but I don't see people at the Olympics suggesting only walking in the races you still want to go fast so but now I don't have to convince them anymore they gave us yarn and now I can go write this myself and put it inside pig or hive or whatever right so this is another thing we're having a service is a good thing so another place we can take advantage of that today between MapReduce phases in traditional MapReduce the map task writes everything to local disk and MapReduce side pulls that from the local disk so again we've got a disk reading right in here what if we did these transfers in memory what if we did some kind of instead of having to write this stuff to disk and pull it off again you could set up sockets between these two things and do some kind of transfer now this is going to perform better you obviously still have to write it to disk at the same time you need to split the stream because you do have to be able to handle the case where the reducer explodes and needs to be rerun and you don't want to have to restart the maps as well for that and you're going to have to do some work here to guarantee simultaneous execution of MapReduce jobs which today Hadoop does not do so this is not something that's going to work all the time every time but there are some situations that this will really help and again this is something we can do in yarn okay now I promise to talk a little bit about on the fly optimization so traditionally databases do all their optimization for query up front they keep statistics they make some estimates they say here's how we think is the best plan to take on this right that works well for databases doesn't work so well in the Hadoop world for a couple of reasons one often in the Hadoop world you don't have the appropriate statistics to make those choices you know data is coming in much more quickly you don't always have time to curate it and all those things that you would do in a database so you're you may not have those statistics even if you do have the statistics a lot of these operations in the Hadoop environment are very long and they're pipelines of a hundred operations it's not at all in common for users to have a Pig Latin script that does you know a hundred transformations on the data all at once no matter no matter how good your statistical estimates are at the beginning of that script they'll be total garbage by after you do a hundred transformations on that data right the error bars will just go wild on you so what if instead the system gathers basic statistics as it operates and then makes decisions on the fly and I have a very quick example here of two MapReduce jobs that are coming together in a hash join if in the process of doing this one MapReduce job we discover hey the output of this would fit into memory I can switch this to instead load that into a distributed cache and perform a map side join where that data is loaded on in every map task and I've gotten rid of a reduced task alright so having an optimizer that can do these kinds of things would be very useful to these batch operating systems alright and now I'm down I've been told I'm running out of time so I want to stop and ask if there are any questions so I cannot I don't have a planned release date I can tell you that I and others at my company are starting to work on that now so we will probably have something to show for it by the early next year sometime but I can't promise that it won't be it all ready to go or done or anything like by that time and would you plan to compile the virtual machine operations into MapReduce operations or you plan to remove completely MapReduce and write something new so that's a fair question should you keep with MapReduce or should you just throw away MapReduce and do something else I think it's easier to take an evolutionary path and bend the rules of MapReduce a little as I showed rather than throw it out and go for a whole new system now you could certainly do that but I think it would take a lot longer and frankly there's been a lot of work has gone into making sure MapReduce works correctly and I'm not anxious to redo all that work yes Jonathan Clader's Impala takes kind of that second approach of let's start over with a new system so just based on your intuition how much do you think that benefit how much of the benefits of Impala come from just not having to do the checkpoints to disk and the redundant work that you're talking about getting rid of here so I don't know I don't have a it would just be intuition I don't have hard data yet right so I suspect that's quite a bit of it my suspicion is that what I'm proposing probably would well I'm sure you could find queries for which Impala would be whatever I'm proposing right but I think the other advantage of what I'm proposing is you stay inside one tool or set of tools you don't have to add yet another tool to the mix a tool that does things completely differently and is going to fall off a cliff when it gets out of memory as you showed in your slides right so I think they're going to have different strengths and weaknesses but I suspect this can go a long ways toward that kind of stuff do you Paul I'm not sure if this is working since MapReduce has been introduced I think that was kind of a copycat of what Google has been doing since then Google have moved to a completely different model where they abandoned in many ways MapReduce to Pericolator and to some other services so for the questions of whether you still stick with MapReduce or move to something else especially giving that MapReduce was born from Google and now Google has moved to a different thing do you see that same trend going into a loop? Let me clarify that a little bit I don't think it's fair to say that Google has abandoned MapReduce I think they have moved on from only MapReduce so this is it's the same evolution only accelerating that happened to SQL when people came along they originally said no SQL meant we won't do SQL anymore and they realized really quickly that wasn't going to work so they changed no SQL to mean not only SQL and I think we're seeing the same thing here Google hasn't abandoned MapReduce they still do thousands or tens of thousands of MapReduce jobs a day it's that they don't only do MapReduce they realized exactly, I think some of my observations about with Yarn you can expand into other things come from that, that there's lots of other types of operations and I think that Hadoop is following that same path that it won't be only MapReduce it'll be MapReduce plus lots of other things OK, thanks for presentation my question is do you see Pig evolving towards being more like a service rather than a client library? Do I see Pig evolving to be more like a service? I think that, so on the table there that I gave one of the one of my examples of services that I think Hadoop needs the whole Hadoop grid needs is connection points, right? I think that it is not optimal right now that users of Pig or Hive or a lot of these tools have to load a lot of code onto their machine to make it run in the cluster that's not real scalable because as soon as you get 10 people doing it somebody doesn't have the right version anymore and it doesn't play well in a world in enterprise worlds or whatever where they don't trust users to put code on their own machines so you definitely need to the Hadoop system definitely needs to grow connection points where users have very thin clients and come to it and submit their jobs and get results back and there's actually been work going on to do that in each catalog and Hive and some other projects