 Hey, we're on Hey, everybody We're here today to talk about open-stack and production. So specifically we have a panel Panel format super simple to manage because everybody gets to ask questions at any time So you could ask a question now anybody? No, no questions. Okay. I'm just warming you up so this is the panel of five open-stack experts all part of HP Hewlett-Pecher Enterprise and We talk about running open-stack at scale in real-life productions when when we started to think about this topic We were brainstorming Actually with Gavin who is up here up front who is our panelist 5b or 6 if needed to talk about like what are the big core topics in Running open-stack at scale So of course as you define it scale is one of such topics and we have an expert will talk about that a compliance security Absolutely important topics for us to to take into consideration. So What does it take to run open-stack? With DR disaster recovery business continuity in mind yet another topic they will discuss today and Finally networking and monitoring as well. So I will introduce Well, actually, I'll let the panelists introduce themselves and give them one minute to just talk about Just in summary what their area of expertise is At this point just let our panel talk but keep track of the questions You want to ask and what we'll do then is we'll kick off with a couple questions to the panel We'll try to make it in a really interactive type of format So let me ask a question to the audience first because you're our yet another kind of panel esteemed colleagues and partners and customers Who here runs open-stack in production? Raise your hands Just a few. Okay. Who here is interested in running open-stack in production? Okay, so a few more. What is everybody else doing here then? Just just clarifying Installing and fixing excellent So just curious Just so our audience and our panelists know what type of folks are here who here is a Cxl what to consider themselves an exec or manager somebody that Gives directions but doesn't really Necessarily roll up and code or do plugging cables in anybody here. No, that's kind of putting everybody in spot. Excellent Thank you. Sorry. That's very brave of you in the middle Appreciate that. Okay. Now. Who is an architect and actually thinks about deploying Different cloud clusters. Thank you very much one architect two three four more architects five You can also raise hands at the same time and doesn't have to be done It's it actually goes faster that way. How about developers that contribute to open-stack? Projects. Thank you. Thank you. Thank you. That's awesome. See there. They're raising hands at the same time They actually know how to have to build things concurrently when to collaborate. That's really important. Yeah, thanks, Brad Okay, how about developers that do not contribute to open-stack, but our developer? Build applications ice V's cloud service providers system integrators that's so on Okay, awesome now Category that I may have forgotten or don't know how to categorize maybe it professional or somebody is is maybe just learning or Just interested and just hanging out and listen to what HBS to say Okay, we got one. Thank you Anybody else want to volunteer press Nope anybody here to Exxon, thank you Fantastic. All right. Well, so thank you for that actually gives us a nice Understanding of a mix of folks in the in the audience. So we're gonna start off. So again one minutes Brad You're gonna kick it off. Let's tell us Tell us what do you do and why you're here? Hi, I'm Brad Saunders. I'm director of engineering in Haley on OS. I manage the metering monitoring logging backup recovery teams I've been with open-stack for over four years started in Diablo Ran open-stack at scale in our HP's public cloud. We had about four thousand servers in that cloud running in two different regions and Started building monitoring tools from scratch at that time frame and come a long way since then so monitoring's one thing that's pretty important to me and and Being able to have the data necessary to run the cloud. Thanks, Brad. All right Fabrizio. You're up next Hey Fabrizio fresco. I am Here in professional services team Let's say solution architect We supposed to support our customer in deploying and customizing open-stack I am Core a viewer of the freezer project that that's backup as a service We started that a couple of years ago in HP Looking at the requirements from our public cloud and with It's kind of a sausage for project Great, thank you. Joy. Yeah. Hi. Hi everyone. So my name is joy Doray Raj I'm a senior product manager for security at healing open stack and I've been with HP enterprise for a little over almost three years now and the last year and a half has been with Open-stack been in the security industry for many many years So why am I here right? So people talk a lot about you know deploying open-stack and taking into production and you talk about You know in the installation pains and so on but what is the biggest barrier to adoption, right? security and security and compliance so though survey after survey you take any survey that you take whether it's you know the survey that we run at HP or whether it's you know The CIO survey that's done every year. So security is still a biggest barrier to adoption So as customers begin to make the journey towards You know deploying open-stack clouds the biggest question remains is hey, how do I remain compliant and you know What are the security controls should I be thinking about so these are the questions that actually I think a lot about and I talked to a lot of my customers I talked to a lot of my field sales like people like in an interact with my engineering team and so on So I'm here to kind of share my views on you know, what I've been finding so far Thank you Joy Swami Hey, my name is Swami Nathana Vasudevan and I'm working for HP cloud networking and I'm an active contributor for neutron since 2013 and I have delivered some of the key features in neutron such as VPN SS service as well as a DVR So I'm here to discuss about the networking Solutions and how we can actually scale networking in production with open stack So if you have any questions, please let me know and I'll pass it on to Paul Okay, hi, I'm my name is Paul Murray. I'm a technical lead for Nova within HP and I'm a regular contributors and over project As Brad said at the beginning we used to run the HP public cloud That was a pretty large system and I'm one people that had to get up in the middle of the night If there was something broken and log on to the machines and fix stuff and things like that So we've got some experience of things that go wrong with scaled systems You might find that other companies have larger scale systems say 10,000 nodes Ours is only a couple of thousand But one thing that was different about ours was we used to run it as a single large system Rather than splitting up into cells or anything like that So we were actually at least a couple years ago one the biggest single systems out there The other end of the scale I run things on dev stack on my desktop We're now producing the helium releases and that's going out to other people that are going to deploy it in all sorts of different ways And so we're interested in the different dimensions things are gonna have to scale out in it could be Geographically distributed with lots of small nodes. It could be all in one big data center It could be different Aspects of a system that needs to scale like how fast you start and stop VMs versus how many compute nodes there are All these dimensions come up with different kinds of problems and I'm just interested in looking around at those Awesome. All right, Paul. So does anybody have any burning questions right now that you must get off your chest and ask right now Feel free to raise your hand. Don't be shy members of the press. No Not yet Okay, good. Good. Well, I do have a burning questions and since Paul joy has the mic I was gonna say since Paul had the mic previously and Scale is in the title and description for the not in the title but in the abstract of the session So let's talk about scale for a second. Is this an absolute concept? So is it like you must scale, you know, how to scale? Is it dependent on workflows? Does it depend on other factors and how does how do all the difference? Projects they would just discuss here from neutron to To freezer for example factor into that put you right on the spot. Okay. Um, well One thing that's difficult about scale is what do people mean when they say scale because anyone who comes up and talks about it For themselves will give you something different. So scale may be how wide a pass the system is So if you've got 10 nodes at 50 different sites Can you manage that as a single system? That's something they may need to do Alternatively, they may have to you know run a really small system So I'm saying things about small because people always think big. All right, things have to scale down to I think that the speed of Deployment of VMs is another one if people often want to say create a thousand or 5,000 VMs as fast as they possibly can that's a completely different problem to having a thousand compute nodes and You get different parts of the system become bottlenecks and weeding out those bottlenecks is what it's all about for us Makes sense. So Footprint is an interesting topic. Does anybody here want to share their experience or questions specific to footprints specifically? What I'm interested in is do you run Two sites with thousand nodes or 200 nodes in each side. Do you run 50 sites with 10 nodes in each one? different scenarios different types of problems anybody want to share anybody has specific use cases Because if not, we're gonna go back to talking about public cloud that we used to have and Brad can share kind of how we did that as One system as well. Come on. Don't be shy. There's a question Interested in understanding more about deploying to multiple zones Multiple geographies, maybe even having different profiles in those geographies to deal with data regulations and in country regulations All right, so it sounds like a Possibly compliance data regulations type of topic and also deploying different zones. Who wants to take it? So I'll take the deployment part and joy can talk about compliance So it's pretty important actually that you have configurability in the way you deploy your cloud And we've been working at HP or a Hewlett-Packard enterprise around configurability of deployments And I'm sure there's some talks you can pick up to see that But it's it's very important that your your cloud be defined in the region how it's going to be deployed with flexibility and then with that flexibility you can You can actually use the tools in OpenStack very easily to deploy out the cloud in different configurations Specifically around monitoring which is my my key piece. We have automated tooling that when you deploy Your cloud to pin up on the region it'll actually automatically configure the metrics and meters that you need and Send to the right place. And so it's it's it's important to deal with those kinds of situations with the appropriate tooling for both deployment and to set up your metering and monitoring such that it picks up the different configurations and mirrors them appropriately and Joy can talk about compliance So when you look at compliance in a you know in a multi Zone kind of deployment. So first of all when you think about security, there are two things that come to mind So one is how do you you know mitigate your risk? it's up with the big big thing about security is risk mitigation and The second aspect of security is how to remain compliant Obviously, you know when customers move from different types of From there are from their say virtualized or traditional IT environment into say an open-stack cloud They still have to become compliant So so when you really look at it from a risk mitigation standpoint You have to be looking at you have to put yourself in the shoes of a attacker You have to look at different attack scenarios and then say across my entire stack You know what are the possibilities and where are the back doors and how can an attack? What are the risk vectors associated with it? So when you when you really look at that and the the couple of things come to mind best practices in terms of you know Hardening your stack end-to-end. I think ultimately you'll have to look at you know, how are you? How are you hardening your your distribution? How are you hardening at every layer? So how are you hardening at the hardware at the hardware layer maybe doing like you know a hardware route of trust as Attestation based on you know hardware route of trust when you get to the hypervisor. How do you reduce the attack of? You know, how do you how do you reduce the attack surface? How do you ensure hypervisor integrity? So doing things like you know app armor a seal an axe for your compute nodes Those are all really important from a security standpoint and then when you go up to say your Your open-stack services layer then you have to think about you know, how are you? How are your users accessing it? So you got to have a good rollback Rollback or back or at the role role based access control strategy in place So you need to understand you know who your users that are going to be accessing it What are the roles going to be? How are your projects and tenants going to be a doled out between you know different zones? Then the the other aspect to it is you know, how are your data protection? So you have to think about you know, what kind of data you're storing you know How are you encrypting that data? Where are the keys stored now? The key thing to remember for from a data protection standpoint is you got to be able to separate the keys from the data Never put them both together. So you have to look at you know, how do how do you how do you configure your? Open-stack cloud to a Barbican is a great example So you can configure Barbican to say hey go store my keys in an external device So you got to think about that and then you're your Your transmissions you got to encrypt your channels you got to encrypt your data and transit in in transit And then you also have to have a solid audit trail because you want to keep track of all your system level activity You want to keep track of you know your? Your any type of privilege escalations. So in none a nutshell I would say that you really have to think about security from our defense in-depth Perspective you have to think about it from a different attack scenarios How do you reduce the attack surface and where are all the possible threats and how do you you know mitigate those controls? With respect to networking I would recommend that you clearly segregate the L2 and L3 stuff So basically between the zones don't inter operate between the zones and if you want to have some kind of High availability or a tendency or if you want to create some kind of load balancing services Just have it within the zones so that you provide enough Capabilities within the zones available and if if one zone goes away then the other go zone can Come into play but within the zone you just provide that kind of high availability and isolation So that you clearly can articulate that you have high availability and networking for the services that you are providing All right, well Any other questions a follow-on question? Somebody looks like they're raising their hand, but I'm not sure if that Okay so Please go ahead. Wow. Thank you all site security is one of maybe the number one roadblock to adoption but I've seen some some surveys that go into some some depth here where Folks have been asked especially developers been asked, you know, how long does it take a Project to go through the consideration phase within your organization. It's about a year How long is a deliberation take once it's reached the executive desk? No another year Once the project's been decided upon how long does it take the the initiation to get started so that people start working on it? Another year then what's the number one barrier to adoption? Oh, it's security in the in the the keynotes yesterday, we saw The need to change the corporate culture sometimes I'm wondering whether you've considered that perhaps when people answer security to that question It's because they're a little embarrassed to say what the real answer is and the real answer is well procrastination Actually, that's that's a very good point So one of the things that we've been seeing even with Hewlett-Packard enterprises as Our customers are moving to open stack Developers are actually, you know, the the the kings of the new world, right? They have the keys to the kingdom because a lot of these are completely rest API driven and and and a lot and a lot of enterprises are still Understanding that new paradigm understanding the API paradigm. They're trying to understand, you know How does it take from a security standpoint for, you know, securing your code previously? They were all in a proprietary software like who really, you know worried about You know, I mean obviously patching is important, but there wasn't enough attention paid to You know doing vulnerability scan of yours of your release every every release doing threat analysis every release but I think what has happened in recent times is with the with open stack Becoming more in a prime time coming of age. I think people have started to realize that You know secure code Development practices are very key. They're very critical to From a secure from a security standpoint, especially when you look at open source software So from that perspective, I think, you know, the training or the awareness and training is very clear It's very essential and I think you hit upon a right though. He hit the right You hit the nail right there and one of the things that we are doing in at HPE is we are actually conducting a lot of developer Security awareness trainings are secure. We have a security team Dedicated, you know team of experts with many years of Experience running in a public cloud and coming from top-notch, you know companies like AWS and Microsoft and so on So these guys they literally go they go out to our development teams They actually go walk these developers through secure coding practices so that you know developers in don't do things like You know bad security practices like capturing, you know passwords and clear text and putting them in config files or you know doing you know leaving like You know buffer overflows or you know some sort of Go a sequel injection attacks and things like that so I think that more and more awareness is very critical for For breaking that barrier to adoption and in fact Gartner also did a recent serve You know talk on security. They said very clearly you cannot you cannot be more secure You know your data cannot be more secure in a public cloud They said the the kind of controls that public cloud providers provide provide today is actually phenomenal and they've said that you know you really have to change your Perception from that perspective. I don't know if I've touched but I think the key thing here is that training and awareness is critical And you have to understand that open stack is new paradigm. Well realistically also Barrier to adoption means that you need to define what are you running? How are you running open stack? Is it at? Production scale are we running at dev test scale? So talking about production just switching gears a little bit What are the approaches to ensure that your production environment doesn't go down? So we're talking about dr. A little bit. I'm gonna put Fabrizio and on on the spot a little bit So I know at least I we're in the textile building across the street where HP has a Kind of off-site lounge and I have some customers stopping by and at least two or three customers brought up this theme Which is okay, so how are you guys approaching your backup dr? Solutions with healing open stack and what are the ways that you architect it? How do you ensure like and you are in the field and working with a lot of our customers? Like what are the different requirements for the architectures and how do how does Frieza specifically your project help with that? so In you know, but stuck its freezer came out of the box for backing up the infrastructure, so there are a few critical points inside of our open stack deployment That's the my secret database for example, that's a Critical thing it could happen that some mistakes happen. I mean, it's redundant because it's a cluster You can lose one node and nothing will happen, but let's say a human mistake you delete one table and your your cloud is not working anymore So we're taking care of that out of the box We implemented even new functionalities the default began for Frieza was to back up in the object storage But that's not good For that disaster point of view because if you lose your keystone You will not be able to access the object storage anymore and you cannot restore your data So we added new backends that are mainly useful only in these use cases like NFS server shared attached to a Sun or an SSH node attached to a Sun when you store your critical data And That's how we we mean to let's say protect our infrastructure. Yeah, then Frieza obviously is it's a backup as a service. So You can deploy your agents Inside of it with machines so the users of the cloud can do the same that we do For the cloud So you're talking about deploying agents. There's also some I guess agent list approaches and kind of Building what you talk about which is running different control planes are healing open stack minimum is across three nodes, right? So at the same time when we're running public clouds, so I'm gonna ask red What are your findings about running really large-scale productions? So thousands of nodes? And specifically tooling and how we are evolving our monitoring approach and Lessons learn from being able to see what's happening with the underlying Cloud fabric to actual hypervisors to actual VM. So talk to us a little bit about the the holistic approach Sure. So as I mentioned, I've been with open stack for four years and started doing monitoring at that point We actually build internal systems To do large-scale monitoring and as a result of the architecture We created for both large-scale monitoring for our under cloud and our control plane workloads and also Monitoring as a service we created we combined that into a single technology that's now in the open stack big tent called Monosca Monosca stands for monitoring at scale. It's designed actually to have The capability of providing up to a hundred thousand alarms a second And or hundred thousand metrics a second into the alarming database So it's it's a well well positioned to scale to clouds that are very large sizes The key thing though about Monosca and the monitoring at scale is the ability to be able to not only watch the control plane And under cloud situations, but also be able to watch the VMs internal to to that system. It also combines Performance monitoring as well. So one of the things we learned as we were running monitoring at scale there's many different pieces of data that Collaborate together to provide the picture of what's going on in your system Performance monitoring is a piece of that event monitoring is a piece of that logs are a piece of that Metrics themselves a piece of that as well And so we designed a tool that can combine all those pieces together into a single to single place That gives you first off single pane of glass for all those metrics and man meters The ability to do alarming both simple and compound alarming Which is critical because you as you create compound alarms. It allows you to do complex triaging With alarms rather than have to do that triaging when you see multiple alarms coming at you at the same time So you can actually look at you know What scenarios might happen in my cloud and how can I predict those scenarios and create alarms and compound alarms to tell me when That scenario occurs and be able to rack react to that scenario rather than react to many alarms So that's a key learning we had as we were running our cloud at scale The other thing I mentioned earlier is the importance of automating When you when you configure your cloud you add you grow the cloud you shrink the cloud you change the configuration in any way You want to automate the ability to set the alarms appropriately So that people aren't configuring your cloud or adding nodes that aren't alarmed or aren't monitored in some way So that's also a key piece of learning that we We came came to as we were running it and the other key thing which is simple But a lot of people forget is every alarm should have an action So we shouldn't be putting alarms in the system. They have no actions Because our operators will see those and not know what to do and it's very easy to do that and it's very easy as you try to start to build and scale out to get a lot of alarms in the system and Really not have actions associated with them and people get overwhelmed when they see Many alarms going off. So it's critical even though. It's simple. It's critical to have actions associated with every alarm That's a good reason also to have compound alarming because you can reduce the number of actions. You actually have to have So awesome. So just a time check. We only have about nine minutes left So this is a great opportunity to ask questions and thank you. There's a question here. I Have a question for Fabrizio on the freezer project On the backup NDR. Is there any specific reason why you chose to make for you that a separate project under the big tent rather than Go there out of extending Cinder, which would have been a natural fit. I would think so The backup part in Cinder is it's kind of a Perfect example of a bottleneck That came out from our experience in the public cloud again because you know With Cinder you can only back up an entire volume That can be two terabyte when your critical data could be only a few megabyte and We had a lot of performance problem inside of the public cloud because Usually the backup start all at the same time Triggered during the night and you have a lot of data. That's going to be Treated inside of your control plane because it's it's inside of Cinder backup component that run in the control plane that is doing all the compression and All the work to create the backup Exactly for this reason in freezer one of the main goal that we We put ourself was to be able to scale for the reason all this kind of elaboration is going to be done on on The agent that run on the node Physical or virtual where you are taking the backups that it's If you scale out these thousand of nodes, it's not a few a bunch of know that is what it's your control plane Thank you Any others at this point Time is ticking only five more minutes or six more minutes. If not, I'm gonna actually Yeah Panelist number six So one of the other important things to think about opens back in general is that when you go to cloud You almost think of the control plane is a traditional app in that if the control plane goes down You're completely hose in the traditional app world, you know ten years ago the control plane mattered last It's all about you know, I have a number of machines that are very reliable I might use things like the emotion to make sure that those things never go down That's not the case in the cloud and so one of the important goals that we had with freezer And in general with our you know holistic backup dr approach in healing open stack But again, we open source it so as part of the community is you have to make sure that control plane is protected So it could be things like passwords. So if I don't you know, you don't lose access to your cloud You know, I mean God forbid, you know what the whole control plane goes down your toast And so we felt it was important to architect something that in many ways was above and separate from something like Cinder because in many ways Cinder kind of runs in the cloud and of course part of the control plane but we needed something that You know could be so somewhat independent Thanks, Gavin. Thanks for joining us All right, so I actually did want to pick up on something which I think to talk about and and I heard some themes which is and this is just my layman's question for you is Scale and is running open-stack at production. Is that workload specific like can you Tell us about that and also maybe swami also talk about that from a networking standpoint It is and I'll give you a good example of that. We didn't practice this ahead of time. No So CI systems are a good example We had a Whole bunch of equipment that have been set up as a cloud before and have been running proper workloads that they've got Decommissioned and we were going to take it over as a CI workload And it was using a storage area network and it was working perfectly well as a cloud But when we put it up as a CI system We found that you get an awful lot of data being transferred around and VMs not doing an awful lot afterwards They're running a few tests and then stopping you spawn something up copy images all over the place Run some tests Collect all the reports and the logs and everything that were done as part of that test and ship them off somewhere else And so that you can look at it later if there was a failure in the test The consequence of that was that the storage area network wasn't specced for that quantity of data being shifted around and it got saturated and result of that is Everything seems to be working fine But within the VMs they start to behave as if their disks are full They're getting read errors and write errors and all sorts of things start to fall apart So the point of that is How you speck out your equipment to match the amount of IO that's going on the amount of memory the amount of disk That's going to be required network bandwidth. It's all very critical and you can get things skewed the wrong way and your systems out of balance and Then you're going to hit Something that's not going to work you build a VM that's nice and big But it can't talk down the network because everyone else is trying to do the same thing and as a result you then Invested in a whole lot of memory that's not being used so Understand your workload Give appropriate quotas and things in your system make sure that requests aren't coming into it too fast and get the balance right So you're not Spending money on resources that don't get used Thanks, Paul So one last thought from each of the panelists starting with Swami So the one one of the thing I wanted to add on to what Paul mentioned Yeah, for networking perspective We should also consider like how many networks that you are going to define how many routers If at all if you're having L3 networks within your cloud How many routers that you're going to try to provision within a node which is going to service a number of VMs within a node so you You place it in such a way that you evenly distribute the VMs Based upon the resources in the compute node so that you can bring up the VMs and then it would be serviced by the routers and networks and if you want to scale At a higher level like probably like 8,000 VMs or 5,000 VMs in at a stretch So there are different options to go through but if you can actually go through the distributed virtual routers in order to Accommodate the control plane issues we we are creating The the routers are created on demand. It's not going to be created all the nodes at the same time So whenever a VM pops up, that's when the routers will be created and on on those compute nodes and then the your traffic will be routed and also make sure that you are Provisioning the nodes with right network connectivity either if you want not south connect not south connectivity to the external network if provisioned that with Not south connectivity so that you can actually assign Floating IPs to your VMs and past traffic from the compute nodes rather than forwarding it everything everything to the network node with SNAT So unless and otherwise you don't need an SNAT You can just use the floating IPs and and VMs and configure the network and also like if you want high availability make sure that you are configuring high availability in such a way that if you are just using the regular routers just go for the L3 HA and if you are using DVRs, and if you're using a single node SNAT just you can still use the DVR SNAT HA that is available in Mitaka and probably also with the The router scaling as I mentioned With the DVRs the performance within a compute node will be higher enough So if you are deploying a high highly Available solution and with a higher throughput just make sure you you can actually place all your VMs that are That requires higher performance within a single node so that you can actually achieve a higher performance to that Great. So any last thoughts from anybody else or last questions? All right, so we actually thought I thought this was really useful at least to me What I learned from here is that now we're from not from networking to scale Monitoring DR BCDR and security of course all those are big considerations And you can't really take one topic single-handedly and say that's the most important one They're kind of all interwoven and all interconnected as the experience here has shown. I appreciate everybody's time Up next. I believe we have a really cool format, which is an ignite session Where in five minutes a different speaker each there's four speakers will be delivering a topic really fast rapid pace But for now, I just want to thank our panelists for taking their time and thank you for coming over Let's give them a round of applause