 Right or use big data on open stack raise their hand Okay, so those who raised their hand, please come forward Please come forward if you can because we're actually gonna ask you questions during the panel So you're gonna be somehow Follow the panel Okay, so the purpose of this really of this discussion is really to make it interactive So I'm gonna be the moderator. I'll start with an introduction soon But really as a moderator what I wanted to do is really Stimulate a discussion not more than that We have a list of questions prepared if you wouldn't ask question But if I'll need to run through the entire list that means that it's a feather So please don't be shy and ask questions So I'll start with an introduction. First of all, my name is Nati Shalom the city and founder of bigger spaces We've been in the big data space for almost 13 years dealing with the computing problems and some people already know the name so I wouldn't Add to that we'll start with the right-hand side My name is Craig Tracy, I'm formerly the cloud automation architect at HubSpot I recently took a new position at Blue Box Group building private cloud open stack But at HubSpot, I'm happy to speak my time at HubSpot I'm happy to speak about our experiences, although I can't speak for what they'll be doing in the future Hello, my name is Eil Alterman. I'm director of platform engineering at Mirages I'm working on the project like Savannah Heads and so on and my team is actually found at Savannah We just had open open stack Hello, my name is Brent Holden with Red Hat. I'm the chief field architect for our Eastern region I work primarily with financial services customers and technology customers helping them to build solutions around big data and open stack Hi, I'm Bruce Matthews. I'm with HB cloud services solutions architect in the Western region Been working in data management since the early 70s. So I go back a rather long way Recently I've been working with Trotl and Savannah to bring those things to our HB cloud, which is the largest open stack implementation in the US Cool So I think I'll start with the first question Does big data on the cloud make sense given the regulation given the performance potential issues? This would be kind of a questions to all panelists and then we'll start to route questions. So again this time we'll start with you Bruce Well, I think it depends It's a good answer starting starting from that point of view I think there are a few specific scenarios where it doesn't make sense where industry Authorizations and things stand in the way Regulation and things of that nature However, I think that there's a huge cross-reference of capabilities of big data that go all the way down to the fundamental modern pop shop That's really where I think big data Can level the playing field with the large corporate entities all the way down to the small one and two User needs and provide them the same data back and those where that's where I think The economics will make it make sense So would there be specific cases in which you would say this is not good for the cloud And this would be a good case for the cloud is that size is that yeah, I think it's I think it more Instead of size, I think it more has to do with The regulations requiring the data to be accessible only to a given few so You know PCI compliant application services HIPAA compliant application services. I think those are not Really what I would use big data to to access Okay, so we're not gonna do that in order I do believe that in the most of all the all the cases big data is actually good fit for the clouds and Cloud is actually helps to resolve some of the security restrictions because it separates of the concern of the Distribution on the technical level distribution of the difference Actually workflows execution into different hyper visors different virtual machine. So it separates it I mean, it's almost on the physical level and on big data This high high peak compliance and so on need to be resolved in a way and most of the compliance around logging and around The separation of the difference multi Multitenant is of course the separation of the access with different tenants I do believe that clouds actually can help the problem is that is a new field there's a like kind of the tooling But it will catch up Good. So we got a slightly different perspective on on that and the Brent so what I think is that cloud data makes sense for For companies if you understand who's able to access the data now if you have Everyone has a different different definition of cloud and you know, some companies just consider an on-premise cloud for private That is their cloud some companies consider Amazon the clouds people consider open stack and Amazon the cloud So it really depends on who's able to access it, you know, there's Situations where you know red has customers in the United States government where they are able to put their data onto an Availability zone on Amazon now you think of Amazon as a public cloud, but in this instance There's a very strict contract between the government and Amazon that the US government has very strict definitions of who has access to that data And they have very strict access to Who's able to access the equipment that that data resides on so I think in that case you you're thinking about top-secret data on a public Cloud that typically makes no sense But it's because it's a very explicitly defined Parameters of who has access to that data. I think same thing goes for private cloud. You know if you have PCI or Healthcare compliant data, that's something where if you are on a private cloud and you're able to guarantee who has access to it That's where it makes sense where if you're not able to guarantee access to who has that data The regulatory compliance will absolutely kill you So basically, I think you're extending one the definition of the cloud to private and public in this case There might be More granularity on how you deal with regulations So for example, you could still run in a cloud, but it would be private in case you have a locality issue and again giving I think all the news in the past weeks about NSA and stuff. Yeah, not to mention other cases I Think that that that is something that has to be considered now Craig You told me about the fact that you already had petabytes of data for example somewhere else even though it was in the cloud That is also a consideration on how you actually how much data you already have Wasn't petabytes was large numbers of terabytes. Okay, okay, but I Probably not gonna disagree too much here I think it kind of depends for us though having grown up in as a cloud company from the very beginning Like literally the only hardware that we had was a half a rack and it served our wiki. I think so It made sense for us to be in the cloud and what we You know the benefits that we enjoyed from the cloud is the elasticity the fact that we could Start new projects with you know without thinking about the capital expenditure that we have to you know Take on to actually start a new project That said we I think you know extending the point made previously is that if you have compliance Issues that you might want to think about running out in a private cloud. That's precisely what we had done. So Whether we're using public or private cloud We were always a cloud entity and it made sense for us to be But how did the deal with when when you actually switch let's say from Amazon to OpenStack, how did you actually move the data? Did you actually move it or what was the actual? Yeah, so for us that amounted to we laid fiber between the two data centers we were in USC's as well as Chicago and and We would routinely move the data back and forth as we spun up new features inside the private cloud The intention is to eventually one day cut that cord and have everything Collocated collocated in the private cloud in Chicago. So basically what you said is that You actually created a pipeline a network pipeline a dedicated network pipeline to Reduce it or at least increase the balance so you could actually push more data. Yeah, and then you continuously streamed it so that The copies at least or the data will be continuously updated in that case you're talking about incremental not pushing the entire Well, no, so in our case we actually did push the entire data set at one point in time So we decided once we were ready to ramp up we pushed the entire data set But as we migrated apps we didn't we moved the data But we we left the apps actually living in the public cloud and as we've solely migrated apps I mean you have to consider that we have many many apps that are talking to many different sources of data So we use that pipe as kind of a bridge between our public cloud implementation and our private cloud implementation to kind of Manage the traffic as we were doing the migration. Okay, so so again just to be clear here when The migration that happens once after that one time immigration How did you still have to keep the two copies alive because there was no we actually Maybe I misspoke there. So the shift completely to one location. So we shifted. We have maybe the confusion here Is that we actually had multiple? Hadoop and HBS clusters. So they all all the data kind of interacts with each other And so what we would do is we'd selectively take this cluster is moving to the private cloud But at the same time that that the workloads that are happening in the private cloud Actually have to talk back to the public cloud until we're ready to move those workloads to the private cloud as well So we left that low latency pipe between the two data centers while we were making that transition and I mean for For someone that had our kind of footprint in the public cloud, you know close to 2,000 instances. That's that's a big move. Yep cool, so Maybe Ilya Let's talk about distribution. How much a distribution is an important thing Do they do I can can I take you know regular distribution from? Cloudera or all of works or whatever and put it on an open stock without any modification Although I need specific optimization. I know that you've done some work around that. Maybe you could share that And you have also a session. I think Yeah, I'm having I try not to spend too much of the time talking about some other features And I have the session today. It's to 40 in the breakout room one I'm a little bit talking about more details and showing the demo but in general different distributions like kadoop is like open stack It's a huge big data platform with lots of the different components And there are some of the core components that are some of the integrated components and different vendors pack them in In some of different fashion the core is similar But the peripheral service made may differ and the deployment apology may differ So what we're working right now is to create a unified way to work with a core of the functionality in an unified way in one API through the kind of Creating the plug-in mechanism and creating a unified way for integrating different management system from different vendors, but still enable these systems Embed their kind of specific features their differentiation to to the tooling so that and users can take advantage out of that So I would say in general they are kind of quite similar But there are some of the different services and there are some different parameter that supports a different benefits like Intel supports and some hardware optimization so forth which Will will will depend on the details of how they operate Now we're Bruce. Yes. Yeah, the experience with my power on that regard and yeah that I mean I think that there are actually lines of delineation lines of demarcation between each one of the Distributions as they stand today. I mean Starting out with just the Apache and HDFS and map reduce components A lot of people have this conception that I don't need to specialize my hardware to do anything for a dupe I can use my laptops and old you know HP Computers and you know, whatever I have available and stick them together with a couple of switches and voila I've got a Hadoop environment if you do that you may actually be able to run Hadoop But you won't be satisfied with the end results So as a result you really have to look at it as an application service and host the components of it in Different ways on differently configured platforms There are certain formulas and I think this goes back to the point that Elia was making that you know It's sort of a one virtual core 4 gig of RAM one terabyte of storage as a data node. Yep That that will work for almost all of the distributions however If you're dealing with map are for example, which has the map RFS and it's trying to write in a k-byte block across those things as opposed to the 64 k-byte blocks in cloud Darrow or some other distribution you actually have to go through it and Design the sizing the striping and all of those things or the raw nature dependent on which distribution you're working with and it's founded on those sort of mathematical principles, you know, so I think that the general sentiment is that you could run it out of the regular Distribution, but you wouldn't get the most out of it the performance performance aspect There is also the aspect of storage and anything so Brent did you wanted to touch on that? And I know that Greg you had Specifically designed differently for that so we'll start with Brent also, I did want to address the point that the distributions are aggressively trying to differentiate from each other and Some of that involves proprietary extensions The sequel interface is on top of Hadoop and some of that involves the storage integration layers on below Hadoop So some distributions will support the Not just HDFS, but the Hadoop compatible file systems the HCFS standard and some won't and so that will really limit your choice as to What type of storage model that you will deploy? The more upstream friendly distributions that you know, I'll call out Horton works and Intel I think that they're working more closely with upstream to not only define that HCFS standard and give customers alternatives to their storage but they're trying to build a story around anti vendor lock-in whereas You know some of the other distributions You're trying to lock customers into their HDFS platform and also their proprietary extensions And so you know it's up to you as a customer to figure out what's right for you and what model works And do you want to use their HDFS or do you want to use another? Scale of file system the HCFS compatible But is there a distribution that would you would say a better fit for cloud and often stuck specifically? And other distribution that are less optimized for that in your view Well, I think the the distributions that are embracing cloud are probably the ones you're going to see more success with I think Horton works embracing the Savannah project that Ilya is working on With Matt Fairley at Red Hat I think that that that type of embracing a cloud and that type of distribution model makes more sense And I think that those the those the types of distributions that customers are going to lean toward towards when they use it on cloud And Craig you're using there, right? What was your experience? Yeah, we were using cloud error But you know as I mentioned before we were early cloud adopters and you know things like Savannah did not even exist So we were using the cloud error distribution And I think there was a little bit of push towards the end to move more to upstream Straight Apache so that we could actually add modifications that we wanted in into the Into the product itself But yeah for us it was strictly cloud error and the reason why Again, if I remember the discussion that we had was that you were running on regular disk, right? Not on cloud storage, right? Yeah, so When we were in public cloud, we were running on strictly ephemeral disks and In the move in a migration towards private cloud, we were actually running on top of the bare metal driver for Nova and In that case we were just sitting literally on top of 16 terabytes of raw disk Okay, so in that case you didn't have a potential conflict with the way storage is actually managed file system This is Hadoop that is managing its own file system Maybe you could comment about your view of the you know Is there any conflict between having two storage systems sitting together in the cloud? Or is it the approach that Craig was taking is the right approach to actually avoid storage in general? First of all on the previous conversation just make a quick point. So I do believe that's In terms of the optimization of the Hadoop It's not about the distribution being in the center is about the type of the workloads You're going to run are in the center and you are optimizing you're optimizing the set of the services You are picking based on that you may choose different distribution and you're optimizing the actual parameters And what's kind of the value most of the actual parameters of optimization? That's why we embed like savanna support full list for each of the vendors There's because you support full list of all of the parameters and for some of the users That might be extremely important to optimize the configuration of the Hadoop cluster for the actual workloads That's the key for big data at these points because the whole big data story is a kind of a set of what you compromise because you can't do Everything in the cost-effective fashion. So on the storage side I do believe that now kind of big data and Cloud mean big data is also cloud in a sense But they're kind of specialized clouds and now there is kind of appearing move of the general purpose cloud that solves the providing an infrastructure as a service in general and What when this question is happening? There is a kind of one central area Which is a kind of common which is storage and at least for the kind of big archival storage purposes clouds Like open stack general purpose clouds and big data will start with the covariate and more and more and influence each other so like big data workloads can take advantage of the Cloud storage and read the data directly from it because there's already a bunch of the data That is stores over there like logs Tweets and so forth and at the same time Necessity of arriving the big data workloads and big data is all about bringing Compensation to the data will affects how the cloud storage engines are architectures and how like compute and storage nodes are collocated Yes, I think we're touching a very big point here I'll come with a few questions here, but I wanted to first of all see if there are questions from the audience before I continue Anyone have a question here for the panelist? Yes, yeah, here is a mic there Basically, I have a separate cloud era project in because I'm cloud era partner and I'm have open stack in my computer office and my client mostly a government with the cloud stack so we make a we starting with the small project in the hadoop and Java and everything and on the open stack because we can make a virtual machine in my lab Every project so after that we move to the cloud stack Now we are working to make sure that the cloud era have independent rather Then on top of the cloud stack to make it faster and they have managed their own storage That's my architecture in our projects I think we have four or five something like that and now we are like integrating everything The cloud era because we use a lot of cloud era and your question is one project and one question is it This is a good one or not because I can see We have a problem with how to make it faster and if the cloud era at the cost I have problem we follow the problem So the question is whether running cladera on open stack is good enough They are running cloud stack so the first recommendation switch over to open stack. Switch over to open stack. Great start there And then the actual question was Tell me if I'm summarizing correctly would be running cladera on open stack has any performance issues or needs Yeah, excuse me I was going to say that I think really this goes back to something that Julia was talking about earlier in terms of the workload driving the Use of a distribution not the other way around so it isn't that cladera may not be You know Configured as well as you'd like and things like that, but it might actually be that Using cladera to satisfy your workload isn't the fastest to run and The only way that you can actually figure out whether that's the case is to go gauge the number of slots that are being used for specific tasks and whether you're getting contention for The available resources to do different tasks whether you're timing out on data nodes Of the task having to be repeated to get done things of those nature that nature And then if you are that will point to the portion of the configuration that's constraining it and As a result then you'll be able to analyze what you would need to fix Which may point to a different distribution to run faster if that makes sense So if I summarize it, but what you're saying is that there is One option of distribution second options of parentization within each distribution So how you could optimize it and if you optimize it for the and there is no One optimization for all workloads. You have to optimize for a specific workload So it looks like as the answer is that you should be able to get the performance that you're looking for Either by the choice of distribution or through the configuration and cloud is not necessarily the bottleneck in that space I think that's kind of the summary. Yeah, I mean, I even disagree with that. I mean, okay, I disagree with the fact that Workload is really it's really kind of that much drive the choice of the distribution I do believe that all of these distributions are trying to solve the similar problem different ways some do when they're looking some not But almost any problem, which is at all solvable with like how do per se can be sold with Any of the any of the distribution just a matter of different type of the optimization And maybe in some cases the cost will be slightly higher But in general they are not that much different in terms of their what's the end result is and it's absolutely possible to solve All of the applicable problem sure that the distribution vendors here if there any would disagree with that, but we'll keep that So just Bruce and Mike, right? You said that you raise your hand when asked the question of if you're using so maybe you could elaborate on that So we're we're using a Susan yeah, we're using a clad era. So we just have a chef chef script The hardest part actually was was identifying and what when you're saying we oh Come data tactics or data engineering company data engineering. Yes, we do analytics on the data Doesn't matter, you know, we could do it from Excel to Hadoop, okay and everything in between So we're looking at the analytics the question for you guys is Have you done comparisons between okay? Here's my deployment with my sample data set on open stack theoretically optimized vice in equivalent on You know hard our west hard iron Yeah, I actually have a really Grateful position within Hewlett Packard to be able to take what they call app system for example Which are the Hadoop infrastructures reference architectures that we produce and and deploy using similar data sets Against open stack and actually I don't I use other tools for deployment like for or or Savannah those kinds of things but but But the end result is basically a similar configuration and then I actually run quite a few of the you know Word count pie Terra sort Terra gen Those kinds of thing cloud burst There's a new one out there called YCSB that I've started using it's more like an OLTP kind of thing Yeah, that's the whole benchmark for those are not yes, and and I compare results and quite frankly I can actually dependent on the number of nodes that I have to generate more notes Because the the individual data nodes are smaller in my cloud Than on the big iron that I can so Bruce is it faster on the cloud or slower on the it is faster in apsis Which is in apsis, which is bare metal, okay, but No, it's more like 15 percent More than 50 percent. Okay, so you've seen 20 percent You've seen between okay, and you've seen more than 50 percent. No 15 15 15 to 20 percent Is probably a good figure to go with is the differentiators between and hold on a second here Craig you mentioned something different when you moved from Amazon to your bare metal on open stock, right? Yeah, so Again, it's not apples to apples. So we're not cloud to cloud, you know or exactly out the cloud But moving from hypervisors in a public cloud setting back to bare metal We were seeing 10 close to 10 x 10 x efficiencies both in terms of cluster size as well as the workloads And that was that may be specific to our workload. So right so here We got actually 10 x and if I may add to that I mean the benchmark and the reasons why there are differences from what I could see The different way in which you could measure the difference one is that you do apple to apple in that case You might see the range of 20 percent, but If you optimize for the specific workload for example in a cloud you have only seven Choices of hardware and disk and so forth and in bare metal for example, you could choose completely different things So you not necessarily have to have apple to apple You could actually move to something different in that case the difference just because you have more choices Might be significant up to the point of 10x as it was mentioned and I've seen more cases Right. Yes Yeah, I think that's the general sentiment in the community that you know virtualization is not particularly designed for IO heavy workloads I think with with Hadoop. It's unique that You can have the IO sit over the network so you can access that using a Hadoop compatible file system So that that's sort of that allows you to keep your You know your data on fast IOP machines in your compute on just that abstraction layer for virtualization There are situations where you'd want to co-locate your your data with those With those CPU intensive workloads, you know in that case, you know, you will see an improvement over Virtualization just by the nature of if you have high IOP jobs, you'll always see an improvement on bare metal I don't know if it's 10x But certainly it's much better performance Yes, and We are starting off with bare metal obviously and we have been talking to the vendors where We know based on our workload. We are going to push that limits the different big data vendors itself and the question is Given a company like Symantec, which because of the security way in which we work We obviously own our own data centers and we want to be that way You know the the public cloud is not a luxury that we can afford so given that What are the benefits of actually putting the data on open stack other than being able to maybe migrate from You know one particular hardware to another Am I missing something? I just wanted to understand question. So the question just to repeat If I'm you if the accommodation is to use big data on bare metal What's the benefit of running on open stack in general and not if I may running without open stack Right, right, right on a private cloud if I'm on private cloud and I'm using Let's say I do that already kind of manages itself What benefit do I get by running it on open stock versus running it without open stock? So I can yeah, I can touch on that a little bit for so for us having grown up in the cloud We had all of our tooling in our provisioning is based We we build custom tools in-house that do all of our provisioning the nice thing about open stack Using the bare metal driver was that we can use that same kind of control plane We didn't have to rely on tools like razor or cobbler or any of these pixie boot types of things We treat bare metal precisely the same way we would treat a normal instance in a public cloud, which is completely disposable So, you know, if we don't like it we kill it and we spin up a new one And that was the driver for us is that we didn't want to manage all of these disparate types of tooling and use the same Consistent management aspect in which you don't look at big data as a completely separate island It's part of the system and within the context of a system It makes sense to have a consistent way to manage the resources. I think that's a summary of good summary Brent, did you want to add that? I actually, okay If you want to make a comment on this So I'm actually also cross-correlated with the previous question. So God's on the benchmarking Right, so when we're talking about the performance is just one aspect and actually I mean we're working for the enterprises at the end At the end and what is drives is how much you pay versus how much you get So the performance and pure like flops It's not necessary that Really end result that you're striving to it's how much of the Performance you get pure dollar spends and you might want to account all of the electricity data power and so forth so for example If I extend to what you're saying if you have a sporadic workload Even if you have let's say 10 or 10 times the machines, but you could kill them after an hour It's gonna be more effective than having a machine that does 10 times, but you keep it 24 by 7 Right and even just as an example right and even and it's true. It's for the private debt private cloud case for the bigger companies You can mix and match different type of workloads not just different type of Hadoop workloads But different type of the all of the workloads in the one clouds and use their available bare metal in much more efficient fashion Even it's 15 20 or 50 percent degradation Which I don't personally agree. I do believe that's right now It should be along the lines 10 to 20 percent and it's getting better like hypervisors getting better like there's PCI pass through Stuff the cloud management system getting better so it will reasonably quick go to very little if nothing I mean I even seen the Clouds a workload optimization on the rights even better when there's what do you Recommend using bare metal on open stock or not It's depends on the work it depends on the workloads Okay, so you did it still depend on the hypervisor and only if you don't have What would be your recommendation? I mean that bare metal provisioning as a part of clouds as a part of open stack, right? So and it's just a flavor of the type of the on-demand Resource yes puny enough It's more expensive type of the flavor like in the Amazon you have like more expensive instances So you can spend like small or large virtual instance at this cost at the end it's all about the cost It's cost your company whatever like 10 cents per minute Or you can spend the bare metal machine from the pool of the bare metal provisioning pool and it will cost Like $2 so it's a cost versus performance ratio So if you're less worried about cost then you go with performance and in that case bare metal makes sense If you're more worried about cost and willing to sacrifice performance then virtualized instances might be a better choice Good answer. Yes. There are many questions here One of the problems that I was thinking of with the good say your name, and I'm Jeff apple white I work with the net app. I work with our Hadoop implementation one of the problems that I was thinking of with with Hadoop on the cloud is HDFS file system is optimized for Physical location as far as storage and as far as network, you know a copy gets made and then gets copied into a distant rack That then gets copied to a third normally in a three right webcam. So how do you work around that in the cloud? I? Mean there are a bunch of options just use an fmer or local storage And that's that's a local drive and the high provide most of the hypervisors nowadays do the right So in that case you bypass the cloud storage using the local and that's case you buy by the cloud stores, right? And that and that HDFS basically become this storage All right, I mean there are different cases There's different use cases if you are onion you you can around there permanent Hadoop cluster on top of the On top of the open stack per se if you like so and you can take even here the benefits of being able to elastically Change the size of the cluster and just a matter of the clicks on on demand and in no time and Have the data stores in the actual cluster and have like HDFS located on the local drives And there's bunch of the work on the Nova sites, which is happening on There is would be I believe in the Havana time frame It's already possible to map like spindle to the tenant So there's no competition of the different tenants to the right access to one like hard drive spindle Or you can use I know like SSD drives. I mean there's it's it's not a silver ballast It's all of that it's a platform you can configure it to work best in your case So so I think Part of the answer to your question is that you have choices either in the cloud or in the cloud and there are Probably solutions for most of the Options it's mostly the option that you choose not necessarily, you know kind of a one answer to that question so Bruce I think and Brent you had a Different perspective or well, I think it comes down to data locality. I think it's the general Question, right? It's how do you guarantee data locality? And I think you know things like software-based storage like redhead storage have The features like Nufa, which is the non-uniform file access. It's essentially like pneumo for storage that that will allow you to get data locality, but that also means that things like quality works with so that means like things like Nova have to be aware that these things exist Yeah, so I know that those types of you know open-sack becoming rack aware is something that is going on within triple L right now Nova becoming more aware of the storage is effort that's going on right now So I think you know to the point right now is how do you guarantee data locality? You know, I think there's not a great way to do that right now I think that a lot of the things that we're talking about are problems that you know We're trying to address the community if I could just make one comment with that So we we add this problem with the bare metal and One nice thing that we have access to with open-stack Especially when running with the bare metal driver is that we could change it and we had this precise problem where we needed rack Awareness in order to make sure that we had redundancy across racks so we just changed it and we added rack awareness and We actually built an entire GUI on top of it so that any developer could go to the GUI and say I want a new node in this rack because I know that I have to stripe across these racks in the public cloud It's you know, we did the quick and dirty where you know an AZ is a rack, right? And also in Savannah we already move it step forward over that there is no way to learn about Rack distribution of the compute nodes from open-stack itself But if you give the Savannah simple configuration file It will analyze it combined with the information where the actual virtual machines located This is taken from open-stack combined this with information where the swift data located and pass it over to Hadoop So Hadoop scheduler can take advantage of this information and enable data locality For the Hadoop workloads. It's it's what it's basically you're saying that in the pipeline There is in each step There is some level of data locality and each tier takes care of the data locality and its own pipeline So for example, there is no data locality between swift and Hadoop, but The routing of the data is somehow related. Is that what you're saying or you're saying something different that I did not I'm trying to say that even right now like with the help of Savannah You can't provide the information from where the actual data located on the different levels virtual host physical host hypervisor rack, okay It's like tools like Savannah already can collect this data and pass over to the Hadoop scheduler Okay, so so if if the actual data locality exists, I mean if you'll put all of your data in Swift which is stored in a separate rack Whatever information we collect there is no way to get data locality just separate But if you design a cloud in such a way that swift nodes for instance in at least in the same rack So there's already tooling exists that's collect this information and pass over to Hadoop And then Hadoop scheduler like in a native Hadoop fashion can take advantage out of that and schedule their workloads close to the data Got it. Did that answer your question by the way? You got an answer good other questions Yes, just wait for the speaker because I'm not hearing you So I would just like to ask like there is you say that there is 15 to 20 percent overhead comparing to bare method visually virtualization You see any big difference between different hypervisors KVM Or Zen or S6I In data nodes with a lot of disk I overhead so overhead with the Yeah, so I was just gonna say it's that's only really for some workloads because I think Craig had Mentioned earlier that it's really kind of workload dependent as to whether that's 20% or whether it's 10 acts which it could be And I think that that from the hypervisors that I've had an opportunity to deal with and believe me I you know done the the hyper V and and Vmware and and KVM from our standpoint to look at and I've sort of found that KVM for the boon to platform that we're hosted on is actually the Fastest out of out of those platforms, but that's sort of personal and empirical experience. It's not really I Haven't studied it Just to qualify the 10 acts we were I mean we weren't talking about hypervisor to bare metal strictly Right, we're also talking public cloud to private cloud with noisy neighbors noisy networks. So I mean that's a huge difference So the question does anyone else anything to add about the choice of hypervisor actually on this 10 acts Probably a part of the reason that an Amazon environment you have much less of the controls actual Hadoop cluster configuration You even really know unless you're just running your own clouds Reference of EMR We made a kind of like a strict stance that we don't use any of those kind of higher level services No EMRs node RDS We built all in-house and that that what allows you to actually move to open stock in the first place, right? Yeah, and you know and that doesn't discount the value of something like Savannah or Trove It's just a design decision that we had made early on so that we you know in 2007 We had no idea that these things would actually be you know, we were even predating EMR and RDS. So It was just a design decision for us So anyone else have any experience with the hypervisor before Mike we Sorry, yeah, so we actually have similar results. We use the KVM on a bun to Question for you guys. So it came up earlier, but it's the why do I put? my Hadoop Onto open stack, I'll tell you my reasons and tell me if they're valid or not So one we work in a sort of you know debops directly shop You know so the local developers are using vagrant poof, you know locally they can test out their algorithm They can run it into test, you know poof it instantiates in open stack And you know this would be like a small cluster with you know a small amount of data so that we can review and test the algorithms In a production environment Right now on the hard iron systems You could get as low as 4% utilization of that server and and we just can't afford That many servers and so one of the things we were thinking of was that open stack may allow us to make use of the No ops, you know, it's nothing nothing nothing map reduce, you know, nothing nothing nothing map reduce behavior Both networking IO and CPU. So a question there is is that a valid Assertion and it kind of gets into a bit of the scheduler And scheduling of all of this and is that something that would need to be expanded And or have you guys looked at? how yarn and Or masos yes, so I'll have to interrupt here because we actually run out of time I know that there's a lot more minutes five more minutes. Okay, good. So we got time for one For an answer for that. Well, so I guess I'll answer that so I think I think the questions that you're asking are in general Why virtualize, you know consolidation and elasticity? I think open stack provides the ultimate elasticity allows you to rapidly build up workloads and automate and then rapidly tear down and I think that that's what customers were asking for when When you're running a large cluster of machines and you're looking at that utilization saying well, how do I get? Better use out of this, you know, they want to be flexible with their workloads And I think that that's the strongest point that opens stack or the strongest Benefit that opens that can give you is that it provides that elasticity And I think that the whole model is built around elasticity why you know having an API integrated based on Savannah And it makes it easy to do that. And so I think that's why you do it With the overallocation on top of that Do you actually get the the performance that you really want by over allocating by really diving into it? Well, I mean if you're talking about machines that have very low utilization then you know Yeah, overallocation is not going to be a huge problem Right, but if you're talking about machines that have high utilization, you know 50 70% and you're allocating 16 to 1 That would be a much bigger problem So I think that that's a you know, that that's a problem that you know, it's an obvious obvious problem Right an obvious answer, you know, that's something that people have to be aware of and I think that you know Solometer and the monitoring that's coming with an open stack that will help to alleviate that type of problem Other questions from the audience I Made the last comment on this so the bigger this the cloud it is the more benefits of the cost-effectiveness You get as a kind of fitness club problem. They subscribe like I don't know 10,000 people on a yearly fairly cheap kind of you know, like payments to attend the 10,000 or 500 Maximum capacity Facility and the idea is that they are never coming at the same time So if if you're building the system and all the workloads coming at the same time, it's probably I mean you're in trouble It's in effect in effective either way on bare metal in the clouds But if you have a bunch of different workloads and I mean statistically you can combine them them in a way that all of the time They consume all of the most of the capacity. That's the key and that's why actually you need this cloud system in the first place So it's not just a matter of distribution different Hadoop tenants. It's all the type of the workloads Good so again, are there any other questions last question from the audience So that because I'd like to kind of conclude. I think that first of all, thank you for the panelist for Joing me for this panel and for all the questions. I think there was a lot of questions and I'm sure we haven't answered everything and be You know, feel free to follow up with the panelist after we finish if I may conclude I would say that big data represent a relatively complex workload and The type of workload requires a lot of choices on how you can run it And I think that if we look at the industry right now and open stack specifically now We have more choices that actually can accommodate the type of workload that big data requires from the choice of private versus public With the choice of bear metal that I think is a big difference in in the case of in the case of open stack and once we have that Flexibility in the system. There's almost no restrictions on why we wouldn't choose to run big data on open stack I think that would be something that we can all agree Cool and thank you very much I think we'll come back