 Okay, so first of all, Ruben, thank you for inviting me not that often that I tend to be in a new city at my age, a new major city, and I was really excited about being here. One thing that I have a question for you is how do you keep shape with all those stoppers bar here? It's like an amazing place and I've been traveling yesterday. I couldn't stop myself from stopping every stoppers bar and trying it out. So that's my challenge for today and for probably the weekend here. Thanks very much. And one thing that I wanted to kind of segue to this discussion is kind of open the discussion a little bit about what I think is one of the most interesting intersections in the industry right now, which is big data and cloud. So even though I came up and dealt with big data in many years, almost in 2000, as Ruben said, before big data was a term and when cloud was basically an item in weather. I thought that the thing that is coming along these days of combining the two things together is the real interesting thing here. And that's pretty much what I'm going to talk about right now. But as I came to Madrid and Spain, I thought, you know, like, what would be the equivalent of that? How could that, how can I kind of put a segue that is relevant to the place and I like history a little bit, so I read a little bit about Columbus and his charter. So I wanted to ask you kind of an interesting question. What do you think that I meant without? I mean, is there any correlation or anything that you could come up with that has any similarities between what Columbus did when he discovered America to what we're seeing today? Anyone have an idea why I put that slide in place? No. No idea. Any hands here? Hint? Discover America? New World? Okay. Okay, so let me help you a little bit. So one of the greatest, I think, thing that happened when we started the new journey is when we discovered America is that I think we can really look at the map of history before that event and after that event. So I think we all agree that the discovery of America was a very timely event that changed the world history in many ways. Now, the reasons why I thought it would be relevant for this discussion because we're not to talk about Columbus here, obviously. We're here to talk about big data is that there are certain things that happened to actually enable Columbus to discover America. And if you really look at a lot of those things, you'll find that it's really small things because people had ships before, people had journey before, people were traveling the world before. What made it such that at that particular time Columbus were able to discover America and, by the way, he's not the first to discover America, but what made his discovery more interesting than others. And then I'll kind of connect that. And obviously we're here in a technology discussion. There is a technology angle here. Obviously the technology of that time, not the big data in the cloud. And I'll point you to another hint here. These ships. It's vessels here. What do you think was unique in Columbus ships? By the way, anyone knows the names of those ships? Yeah. Okay. I get the reaction here. Anyone want to suggest? Santa Maria first, right? Yes. You wanted to suggest one? Yeah. Pinta, right. And Nina, right. And what is the meaning of Pinta and Nina in Spanish? Anyone want to suggest for those who are not speaking Spanish here? Never mind now. So I'll keep that aside right now. This is not really the important piece for this discussion, but the interesting thing right now to think about is that before Columbus, people discovered America, but it wasn't really something that people could really reproduce in a massive way. Because those journeys was done in a way that very courageous way, very special way that couldn't be replicated. What Columbus did from a technology perspective in his ships was something very interesting. And anyone here look at that ship and see the thing that made the difference? Or know the difference? Look at the wing at the rear of that ship. You see that it's diagonal. Now if you remember, if you look at history ships, you'll see that most ships, especially in Europe, weren't built with diagonal waves. And the reason for those diagonal shape, which was actually borrowed, it wasn't really a new thing. It was borrowed for the Muslims that was actually building ships with that. The reason that that thing made the difference is because if you need to cross the ocean, there's a lot of strong winds. And you can't really go upstream, upwind with a regular way, with a regular sail. And the diagonal sail here is such that you could actually go up against the wind. And only with that change, that small technology change, small quote unquote obviously, crossing the ocean was made possible. But not just made possible for courageous people, but made possible for the whole nation. And then you could repeat that journey over and over again. And that was the important thing. Now how is that related to our discussion in cloud? Now anyone have a hint here? Okay. So a cloud in many ways is that new promised land. And to get to a new promised land, and if you really look at a lot of the cloud discussion, especially in the context of big data, we usually look at that one big thing that, you know, what's different. It looks very similar to what we've done before. But it's usually those small changes that made a big difference. And those small changes are small technology changes, for example the internet speed. Before the internet speed, because cloud in itself is not a new idea. People wanted to do hosting before, people wanted to do grid computing before and distributed computing. And me myself, I think, started a lot of that journey in 2000, knows a lot about those ideas before. But there was something in the recent years that made that possibility not just for niche player, but for the mass majority, brought it to the point where people can pass a credit card and get a huge data center in their hand. And that's what cloud computing is all about, really not just to make things different technology-wise, but to make the same thing that we've done before in a massive scale. The same goes for big data, because we were kind of wandering around the definition of what is big data, is that a terabyte, is that a petabyte, is that whatever. The difference, don't look at the difference in a certain feature or technology that needs to be completely different than others. It's really the scale of a problem that we're dealing with that is the difference. And scale doesn't necessarily mean the capacity. Scales means that now we're approaching problems where we're looking at the entire cell data, and not just a small portion of the data. We're at a point where we can record everything, and not just the set that we're working on. And that's the big thing, and that's kind of, I thought, would be a good segue both from Madrid and Spain, specifically into the discussion right now. I'm not going to spend, unfortunately, more time on that section, even though I would like to. But we can talk about that later on over at Tapos Bar. I won't avoid that. So, I'm going to start the discussion in three parts, just to make it a little bit more interesting. The first part, I'm going to break a little bit of myth about cloud portability and cloud in general. Then I'm going to talk about the intersection between big data and cloud. How the two marry together, why they're so important to be bonded together, and what does it actually mean? And during the third part of the discussion, I'm going to show some demos of a specific implementation of that. How we took that idea with an open source project named Cloudify and some other partners, and we implemented some of those ideas. Please look at that as a pattern, as a lesson learned, and not necessarily as a product or a specific solution. Obviously, to make things, moving things from a theoretical model to a practical model, I need to go through those steps. So, I apologize if some people would see that kind of bias towards a certain direction. So, the first thing that I think made cloud portability interesting is that if I would do that same presentation two years ago, most of those names that you're seeing here wouldn't be on that list. The cloud is pretty much dominant by Amazon, so that brings me to another question. How many people here are using cloud today? How many people are using Amazon? Most people. How many are using our non-Amazon cloud? Rockspace? Wow, that's a brave man. OpenStack? Anyone here using OpenStack? Clouds are using. Which one? Yeah, you were saying something. Don't be shy. Come on. What was it? Your own cloud. Never mind. I didn't know that Spanish was shy. Come on, man. You usually know to say things blunt. So, come back to the discussion. So, you could see a lot of names, and actually there's new names coming in. Very small ones named Google. That's going to shape the entire industry, reshape the entire industry. Microsoft has done a very bad job on that regard, but they're making their ways after doing a lot of circles. And in my view, they're going to be successful at some point. And that's point is probably going to be pretty soon or sooner than most people think. It's just that they happen to do a lot of mistakes before they make the right things, but that's Microsoft, but eventually they look at cloud. They look at cloud as a bet that they can't really lose, and eventually they would get to the point where it would actually work. And if you look at what they're doing with Azure, if you're doing what they're looking with, what they're doing with desktops, offer a lot of iteration, a lot of foolish mistakes, they got it right. With servers and with other things. So, the landscape is changing quite rapidly, and we have more choices. And that's one incentive to look at cloud, not as equal Amazon, but it's something that have a lot of choices, and I want to use those choices because they're coming along, and actually pretty soon. And the landscape is changing pretty fast, so I can't really think of cloud as equal Amazon in that regard. And there are more actually names that I didn't really meant in that. So, let's talk about the first myth of cloud. Most people, when I talk about cloud portability, would think of, well, when I'm choosing cloud, I don't think about portability because moving from one cloud to another is a huge effort. So, maybe I'll start with Amazon, and when I hit the wall, if I'll hit the wall, then I'll think about it. Practically, that's the right way to think about things because there is a cost of switching, there is a cost for portability, it doesn't come free out of cost. But the second point that I wanted to say is that you'll be surprised how soon you may hit that point. And second, the cost of building something that is more portable is not as high as you may think. There are certain things that if you follow, the cost should be fairly low. And therefore, those two interesting barriers would be interesting. So, let's start with one example, Zynga. You're familiar with the name? Big player in the web, obviously most people know. They happen to be one of the biggest Amazon customers. It happens that when they grew big, they found that the cost of running the infrastructure was fairly high. As high as 100 million a year for paying to machines in Amazon. At that point, someone looked at the balance sheet and said, well, holy shit, are we paying that amount of... Sorry for that word. Are we paying that amount of money to Amazon just to run our data center? Can we do something better? And when we look at our usage, and we do have track record of several years, we see that the spikes are not always that different. There is a certain sustainable workload that we're capturing, and from that sustainable workload, there are spikes here and there. So we're basically paying for that on demand a fairly high cost, where we don't need it as much as we thought that we needed. And when they analyze that, they also go to the point where they say, well, we have seven types of workload, not generic workload. We have games that used to do streaming. And if we run on Amazon that needs to serve a lot of application, or just streaming application, obviously there is a limit of how much the machines, the hardware can be optimized for that purpose. So if we'll optimize the environment for that specific use case, we can actually reduce the number of machines per workload quite a bit. And that was the incentive for them to actually move to their own data center and to build their own cloud in many ways. And to a certain degree, it becomes also a competitive advantage for them, because now they can offer their ecosystem players a hosting environment where they can run in their environment, pretty much like Salesforce is doing, and Salesforce is not running on Amazon right now. So they realize that running their own cloud in many ways could be a business advantage as well as a cost advantage. And that's why they switched some of the workload to Amazon and got to the point where they minimized the footprint so that only the peak loads would be running on Amazon and the rest would be running on their data center. So they didn't really abandon Amazon. They actually created some sort of hybrid model where they could manage most of the sustainable cost in their environment. Another example which is the first example was, you know, a company that started small, grew big fairly fast, and then the cost was a factor for them to move. Now, some of you are probably startups at the early stages and then you would say, well, I would be more than happy to have that problem, right? When I'll have that problem, then I'll worry about that. Mixpanel is another example of a startup company that actually started small, as small as one machine. They were running at Linod, which was providing a hosting environment, very low cost, compared to any other ones. They grew not as fast as Zynga, but they still grew from the point where they needed more than one machine, and at that point they looked at Cloud. There was a Y Combinator company, and so Y Combinator had a deal with Rackspace, give them some certain percentage of discount, and so they switched from Linod to Rackspace. That was one switch. And then they went to Rackspace and Rackspace was not as agile as Amazon to a large degree, at least not at the level that they wanted, and they realized that they need something better, and they switched to Amazon. So during a life cycle of three years or four years, they switched three times their entire environment. That's an example of a startup that is not at Zynga scale, but in a positive trend. So just to bear in mind that those things happen faster that you could think, they're not that public, therefore you wouldn't hear that that much, but people do go through those stages over and over again more rapidly than most people think. So that's something that we do need to plan for. That's my main point on that regard. And those things happen when we least expect them, not necessarily when we plan that it will happen. That's the switching from one cloud to another. The other thing about cloud portability is that I think we all agree that it's a good thing to have cloud portability, but usually when we think about cloud portability, we think about cloud portability as a cloud API portability, which is a big thing. And just to begin with, that's not what I'm talking about. Because if we try to create a least common denominator of all cloud APIs and say, this is how you do cloud portability, we'll have a common API for all clouds, that's not going to fly because the APIs move too fast and evolve too fast to actually create some sort of a least common denominator. And we pretty soon will get ourselves into the point where we can't really use that. Or we would lose a lot of functionality outside of that common API. So the first thing is that when we look at that, there are three ways to deal with portability. I mentioned one of them, which is standardization. We kind of give that away. And there is another option, which is the open source framework, like OpenStack and others, that are creating the same thing that Linux created, that's still evolving. And I'll talk more about that later on. And there is some sort of abstraction framework that basically don't try to hide the API, but provide some common facades, very much like Spring Day to Java and J2E, but to do that in a cloud API. So we're not trying to hide the APIs of the cloud, but we're trying to create some sort of common facade so that the switch between one framework and the other would be smaller. But we're not limiting ourselves to that common API. We always have the way to go down to that underlying implementation. And we stress that for the basic compute, for the basic storage, we don't necessarily need to be bounded to a specific API. In that case, we minimize the change so that when we want to do a switch, we need to do the change across the entire environment. And that's kind of a summary of the status of that. So practically, if you do want to approach cloud portability, you have to think mostly from an abstraction perspective, then from a standardization perspective. And by all means, we're not trying to create a common cloud API. That's not something that's going to happen anytime soon. The other realization, and again, I've been dealing with that question for a couple of years, so those realization took a couple of years to actually evolve. The other realization is really, what do we mean by portability? Now, do we mean that we really need all those cloud portability API that I mentioned earlier? So the first thing is that I came across is that when we're talking about portability, we need the ability to run an application in one environment and then switch it to another environment. That's really the important question. Not how do I run the cloud operation in the same way, but how do I move the application from one environment to the other? And the important piece here is that application portability is not the same as cloud portability. Actually, if we really look at application portability, the problem is much smaller in scope, because both in Amazon and in Rockspace and on my private cloud, the operating system would look pretty much the same. Do the exact same thing, so the same binaries that runs on my data center or my desktop would run on Amazon pretty much the same way. That's not a different. The thing that is different is how do I actually move the application, and if I think about VMs, then there's going to be some compatibility around that, and I'm going to talk about the different approach. So instead of moving VMs, I can think about moving the binaries of the application so if I move the binaries of the application again, they're going to compile and run on that cloud environment in the same way that it would run on different VMs. So if I think of VMs as generic containers of operating system, not as a container of my application, then it's not too much of an effort to actually port them. And the third point is that if we think about elasticity and scalability and all those other things and how do we apply that, there is nothing really specific on a particular cloud environment that would solve that problem much better than others because the implication on the application is going to be pretty much the same. The tools that I'm going to have in each environment is going to be different, but the things that I need to do in my application in order to make it elastic and scalable and redundant are going to be pretty much the same and it's going to be in my responsibility to a large degree. This is not my kids, as you can see by my skin color. But anyway, I thought that they are pretty nice. So the last point on that regard is that not all clouds were born the same. I think that we all understand that, but I wanted to touch still a few points on that regard. If we really look at the SLAs that most cloud providers provide, we'll see that they provide different level of SLAs. I mentioned that in the previous slides. An interesting thing to see here is that Reichswitz will give you better support and will pay you for crashes, where Amazon will give you certain different level of SLAs for that same failure. And Geraint claims to have better performance on their cloud. If we look at what Netflix can do on Amazon from a performance perspective, because Amazon introduced their flash disk and SSDs, they can run better on Amazon today than they could on other clouds. The reasons why I'm mentioning it is that that's another reason why you would want that flexibility. There would be certain things that would run on one cloud and there would be certain things that would run better on another cloud. And those clouds are evolving but they're not the same. And that's something that we want to keep in mind so that we could actually move the workload to the right cloud for the job. Now, we're back into the core of our discussion, which is big data. So we talked about the fact that cloud and cloud protocol is important. The reasons why I brought that is because when I'm going to talk about big data in the cloud, that's going to be an important item, especially in big data. So all the things that I mentioned are not that specific to big data, but why is that more important for big data? Because big data tends to be more higher intensive, something that we can't really move. Let's say that if I have terabytes of data and petabytes of data in my data center, just moving into the cloud is not going to work, because I'm going to need days to actually move the data to actually be able to access it. So that would be one consideration that I need to have where to run the big data, where the data is. So if my application runs on the cloud, usually I would run the MapReduce on the cloud and the Hadoop on the cloud, but if my data happens to be somewhere else, I would most likely run it where the data is, because shipping the data is not that trivial. And in big data, the ability to, or the need to have that cloud availability is critical to actually run big data in the cloud. Without it, we can't really run big data in the cloud, because if big data in the cloud equals Amazon, that basically minimizes the scope and the potential of what we can do with big data in the cloud. That's why I started without introduction. Now, the other thing, and that good segue to Paul's discussion from the previous session, is that when we look at big data, most people mind big data equals Hadoop, or to a smaller degree to other NoSQL kind of implementations. But big data tends to be much bigger than that. It's all the systems that leaves out of that data, because the fact that we can reduce a lot of those MapReduce and a lot of those insets and a lot of those analytics mean that there is a lot of bunch of systems that run that data, pretty much like the way databases are served today. There is a database, but there is a whole self-application that sits around that. So if we think about big data in the cloud, we have to think also about the services and applications that runs with it. Because if the data leaves here in the cloud and the application leaves here somewhere else, that's not going to work as well. If there is a failure and I do it and I implement it in one way with my, let's say, Hadoop implementation, in a different way in my database implementation, whether it's Cassandra or something else, that's going to be very complex for me to manage, because I need to manage two different clusters. If the configuration of setting up the environment is going to be very different for my Toncat cluster, and for my Hadoop cluster, and for my Cassandra cluster, and if the monitoring management of all that is going to be very different, then it's going to be complexity that I need to carry. And in a cloud, that complexity multiplies, because it's an environment that is more dynamic and is more sensitive to failure and to network sensitivity and other things. So we can't really go and build cluster of cluster of clusters without a consistent way to manage them, and we can't really think about the system as Cassandra or as Hadoop. That's not a big data system. A big data system tend to be all the things surrounding it. So the first challenge that we need to deal with to actually run big data well, big data systems in the cloud is the ability to run and start VMs and automate the entire process of deployment, configuration, scaling, upgrades, continuous deployment across the stack of my application. And that's a big challenge. Now, the first challenge, if we really look at that, is the thing that I mentioned. How do I actually get to the point where I can have consistent management? And by consistent management, I mean that I can run and install my Hadoop cluster in the same way that I run my Cassandra cluster or my Tomcat cluster. I know it sounds too good to be true, but that's the goal that I want to be. Now, if I can get to that point, obviously my entire deployment would look much simpler. So that's the first challenge. How do I get to that point if I can get to that point? The second challenge is the portability. I think we talked about that quite a bit. And in the context of big data, the portability also means not just how do I run things on Google, new cloud, and in Amazon new cloud, it could mean running on my private cloud and on Amazon cloud. It could mean running on bare metal versus vitriolized environment. That's the type of portability that is more common in the context of big data. I would want to run in bare metal for IO intensive workload, and I may want to run on Amazon for sporadic workload because I don't need those amount of machines per, if you like, MapReduce operation. I only run that, let's say, analysis once every month. And obviously the other criteria in which I would choose one cloud versus the other would be the locality or the affinity of the data where the data sits. I think we talked about it earlier. And that's kind of looking at portability from a different angle, a more practical angle. And the chances that you would need that level of portability in the context of big data is much higher than you hit a wall with rack space or hit a wall with Amazon and now you need to switch to your entire operation, it's actually something that you would do almost on a daily basis. You would run your workload and you would choose bare metal for this thing and you would choose something else for this thing. During the same operation, the same application, the same organization without doing a complete big bank switch. That's the other thing that makes the intersection between the two things interesting. Now, let's talk about Hadoop in a cloud because that's the title of the discussion. How do we actually run Hadoop in a cloud? So there are various distribution of Hadoop. Hadoop itself is an Apache open source project but in most cases people use one of those distribution when they run Hadoop. They use either the Cladera distribution, there is the new IBM big insight distribution, there is the MapR distribution, there is Hortonworks. Each one of them has a different value add and they compete to a large degree with one another. IBM tends to be more enterprise ready, or at least claimed to be. Cladera is one of the originator of that distribution model for Hadoop and been taking a lot of the IP from Yahoo and build it out. MapR tend to be more performance sensitive so they build Hadoop distribution that is more performance gear, they have more algorithm for the ability also of the name node. And Hortonworks are the guys from Yahoo that I think we've seen before that basically try to take a lot of the experience from Yahoo and build if you'd like an entity out of that that is more specialized than that. So it's taking a lot of production use case which is still by large degree the largest deployment of Hadoop and making more production ready. So when we're talking about Hadoop in the cloud we need to think about those different distributions so the other thing that we want to think about is the ability to choose one of those distributions and not necessarily bind ourselves to a particular one. So we don't necessarily want to have the Cladera in the cloud in a certain way and then run the MapR in the cloud in a completely different way. We want to be able to choose the right big data distribution that feeds to our needs. And that's the other thing that is interesting to that. Now all that is kind of the promise of what we've been wanting to accomplish and these are the challenges that we want to solve. Now I'm going to switch to a particular solution. Again, the solution itself should be thought as a pattern and I'm going to refer to a specific framework that is called Clarify which is an open source framework that we built and it's taking a lot of the experience that we've seen in the market specifically from DevOps. So what we found is that if we really want to achieve and address a lot of those challenges specifically the consistent management and the ability to plug into different frameworks and provide ways to run them on different clouds we can use an abstraction model I mentioned that earlier but in this case not abstraction at the API level but abstraction at the way we deploy and orchestrate the application and if you look at a lot of the cloud deployment right now a lot of them are actually using the same ideas that's why we picked that DevOps model a lot of them are using automation tools like RightScale DevOps like Chef and Puppet that's how they run their operation when they create a lot of that automation tools they basically create an abstraction they can package their application in some sort of a cookbook which is kind of a scripting language, a DSL a domain specific language to describe what the application is doing and then run it in the right environment so if we can describe the application in a way that is going to be agnostic to the underlying infrastructure we can be more portable and that's the idea of Cloudify we can take a lot of those ideas from the DevOps world a lot of those ideas from the cloud world and create an abstraction layer that will enable us to describe in our case, in that specific case the Adobe distribution in a way that is going to be such that we can run on different clouds let's see how things are done and the result of that once it's been done and I'll get to the technical bits in a second is that if we could do that in that way we get all this nice benefit we can run it on any cloud we could have a consistent management across the stack we could automatically recovery processes meaning that we could increase more nodes in our Adobe cluster both in our private cloud but also in the Amazon cloud or in other clouds so there is a set of incentives that we can gain just by doing that just by plugging a portable automation into our existing Adobe distribution how that can be done so there are basically three main elements to this solution or to the pattern that I'm talking about right now the first one on the left hand side this piece over here is called a recipe this is basically a DSL a description language don't try to read it I see some people cleansing their eyes and trying to read the data there I'll read it for you it's basically described as specific a deployment in this case it's a Mongo forgive me John for mentioning that even though we work with Cassandra even better so there is a Mongo here Tomcat and some other things just changing that image before the presentation here wasn't something that I could do anyway so we have a description of an application here now note that it's a description of an application not a particular service it's not just the Mongo or Cassandra it's the Tomcat, it's the load balancer it's all the other things that I can describe here and I describe the tears of that application in that model I can say I have three instances of Mongo three instances of Tomcat, one load balancer and now create that environment now how do I tell someone to do that now let's assume that we do that manually what do we need to know from an information perspective to actually take that request and fulfill that we need to know where the binary is for each one of them exist we need to know how to configure each one of them what's the right setup for a dupe what's the right setup for Mongo what's the right setup for the patch load balancer we need to know the order in which they need to be orchestrated so we need to run the database before we run the Tomcat server and then run the patch load balancer and then on a post deployment we need to know how to do all the maintenance work for example what does it mean to add node in that system especially if we do it for a dupe and then we need to be able to deal with how do we add capacity to the Tomcat server because when we add another capacity to the Tomcat servers their patch load balancer need to know about that and add that to its list so we need also to automate the post deployment aspect the maintenance piece if you'd like the scaling piece and the failover piece we'll see that in a second and the last bit that is important here is how do I abstract all that from the underlying infrastructure how do we tell all this information about the application and basically look at the infrastructure as just an endpoint for all that so we could take all the description and say oh this thing now run it on Amazon take that same thing clone it on my local data center and if we could abstract the description from the underlying infrastructure we could actually do that let's see how we do that so we'd clarify the approach that we've taken to do that is that description language is obviously by definition from the underlying infrastructure I'll give you some examples of how that is being done it includes all that information that I need to run if I would do that manually the difference is that instead of keeping it in my head and in papers I have it in some sort of a binary format the other description that it plugs into a lot of orchestration system that exists out there configuration management like Chef and Puppet so if there is a description of the application whether it's in shell script whether it's in Chef or Puppet or any of those tools we could import that and use that and not reinvent the wheel so the recipe is not yet another recipe that I need to go through and discover things from scratch I can actually plug in the things that I have and describe all the missing pieces on top of that rather than trying to recreate things so that's the first element the other element is that all applications are made out of processes just binary processes so if we plug into the application as processes we don't really need to care about how those processes are made of because if they're made of java.net or any of those languages including Ruby and others or C++ they're at the end of the day binaries that compiles to an operating system and runs on that operating system so we have that least common denominator that if we could plug into that level then we can run almost any workload but databases behave very differently than web containers we can't really stay at that level and the question is how do we add intimate knowledge about those processes and how they need to behave from a scaling and a fellow perspective without touching the actual code without being dependent on the actual application so there is a lot of way in which you could plug into a lot of those binaries from the outside those are called custom metrics for example we can call APIs to grab the information that is more relevant to describe whether those processes behave well there are different ways in which we could describe how to do the install and how to do the configuration for each of those processes for that specific workload let's see a few examples of that and I'll show you some live demo that will give you I think the full idea so you could see here two snippets obviously I didn't show the full recipes here but one snippet shows how I abstract the actual recipe from the cloud environment the description in the recipe that says this workload needs to run on big linux and this workload needs to be run on windows behind the scenes I have a way to map to a particular cloud and define what big linux means in amazon what big linux means in rack space in my bare metal environment but it's outside of my application workload script and therefore I can the only thing that I need to do to maintain the portability is to have a cloud driver per environment and the cloud driver will include the abstraction how to start a machine and stop a machine now to map a specific VM in this case not an entity of VM but a typo VM and then I can take that same recipe and run it elsewhere with that same if you like information there's nothing in the recipe that is specific to that so those I can plug in to the processes custom metrics so I can plug in to Cassandra I can plug in to Adoop and grab almost every intimate information about the latency about the IO workload and the performance of that and use that I can use it to decide to make decisions like scaling I can use it to speed out decision whether something is wrong in my system let's see how that works it would work pretty much the same way with other but I'm going to show you the big insight distribution just for the sake of time so this is for example the environment that I use here hopefully that will show up so the first thing that I have here is something that is called interactive shell this is how I'm interacting with the environment itself so when I download the framework that is called cloudify I do something that is called bootstrap cloud the bootstrap cloud create an entity in the cloud that provides me a rest gateway a compatible rest gateway that I can run and therefore in each cloud I can do a bootstrap and run that gateway in each of those cloud and that becomes my access point to that cloud so in this case I'm doing a bootstrap for EC2 to do a bootstrap on my private cloud which could be just a bunch of IP addresses with the network I would do the same command but the endpoint would be b-y-w-n bring your own node that's pretty much the difference the details of how those environments looks like are going to be provided in a cloud driver so I'm not trying to create a common denominator I'm just basically abstracting that details of the specific environment to one place so that they wouldn't be scattered around across my entire environment that's the only difference here so I can be as specific as I want with that driver and with that cloud and if that cloud happens to run certain features I can still use them let's see how the operation looks like so the first thing that I do is bootstrap on EC2 I create the environment once I create the environment the next thing I want to do is run my a-doop workload so I do another operation which is called install application an install application would take the entities in this case of Big Insight which is the a-doop environment we'd read that recipe that tells it where to take the boundaries from how to install it, how to configure it how to install it we'll use the cloud API to create the machines to fulfill that request and now we could see the management console that comes with Cloudify and shows how that process works so by the way the monitoring aspect that we created in Cloudify is actually covering an area that I think is a void right now the market right now which is what I refer to as deployment monitoring most of the monitoring that you would use today happens to show things post deployment how things behave post deployment these cover the process of deployment the process of scaling the process of installation the dependency between that entity this is the areas where we're trying to cover in that specific monitoring and it will actually plug to your deployment monitoring sorry to the application monitoring and wouldn't try to replace that so that's another thing that you get out of that free tool so what we could see when we launch it is that we could see that the request was to create a name node data nodes for a dup and then we'll see the process of the deployment of those services and we can see the monitoring of this process as we go along the gouge below show me the custom metrics the metrics that are specific to my Hadoop distribution which basically tells me and I can see also the logs and other things you can see here the progress of the thing and also the dependency between those entities by the way represent a cluster not a node so it's a logical box not a physical box so there's a lot of things that I can do here and then take that entity say deploy it then monitor the deployment and then at the end of the day we'll see how we can plug into the specific Hadoop management layer that comes with big insight because big insight comes with its own management layer are we trying to replace it? No we're actually trying to enhance it so we'll see what that means and what you could see in the insight management console right now from the Cloudify console so when I deploy the application I can tell it what is or where is the specific workload deployment or manager is and this would be the things that will monitor the more fine-grained Hadoop specific things in that case we get consistent deployment because we can run the same thing that I've done right now for Hadoop for my Cassandra deployment for my Mongo deployment for any other things because it's not going to be that different I can still plug into their custom management thing use their custom monitoring thing I'm not trying to create a layer that will hide you from all that information we're running out of time okay so we want to leave do we have time for questions or anything we don't have okay so we'll end with that and I'll just finish with one last thing here that I prepared specifically for this event in Spain is that the last thing that we want to do and that would be one minute that's okay I see you are almost working here with a shooting gun here so where will we for a second here the last thing that I wanted to introduce here how simple that could be and the thing that you could see here is something that would remind you of another thing that is not related to big data or cloud what does it look like what a player in YouTube right resembles a player in YouTube now how is that related to the big data discussion here now think about your big data distribution or deployment as a binary like a video just like a video now before that if you wanted to run a video you needed to have a streaming server install it, set it up do all those other things and then use it today with YouTube you just use YouTube as a service now imagine that you could do the same thing for your big data you could have a player and if I'll play the player here but I'll be able to see here fortunately I can't really see it there is a problem with the resolution here what I could do is just click on a play and it will basically call the cloud API install that service and I get the URI for that service it's really designed here you could see that poking up right now it's actually on YouTube so you could use it play that player as you wish I will share it with a list here on Twitter who want to actually look at that so the thing is that we can make things very very simple with that approach and actually run almost every big data in a much simpler way as a service without using a specific tool or a specific cloud and that's the point of my discussion and thank you all for listening