 All right, thanks everybody for coming my name is Russ Lindsay. I'm director of solutions architecture for stack velocity and we're going to do a quick little conversation today about getting your hardware ready for open stack and Yeah, Sean O'Connor solutions architect for Mirantis so a couple quick things What we want to do is we want to you know go over the hardware planning hardware management and tools associated with getting ready for open stack as well as monitoring once you're deployed so kind of a High-level approach to to the discussion. We're going to cover things Not in enough detail to really to really cover every any one of them as well as we probably should but you know what we're not going to be doing is we're not going to be getting into a deep dive about bare metal provisioning or You know any any recommendations of particular tool chains or a recommendation of you know OEM versus ODM hardware We're we're trying to do this strictly as informative as we can without Injecting any personal opinions or that kind of stuff So a couple questions that you want to talk think about when you're planning your workloads is Are planning your deployments is you want to think about what your workloads going to be biggest thing you can that is a factor in what your hardware setup looks like is is what you're going to do with it So you want to really make sure that you've spent adequate time planning What resources are going to be necessary for for your hardware to to achieve your goals so? Also budget is a big factor in that for most people we've got You know you can spend as much or as little as you want on any particular hardware deployment especially OpenStack, so just want to be cognizant of that and then Deficient or if you if you care about efficiency Some people are you know green initiatives or you've got limitations of power based on your data center location So those are some things to think about also We'll also cover open source tools and then some of the pros and cons between open source and or I'm sorry I'm cloud optimized or OEM style hardware versus ODM OEM hardware excuse So big thing about processors Again, we're gonna cover these at a real high level, but selecting processors Biggest thing is is think about what you're gonna get for your budget You've got you've got a wide variety of processors to choose from a lot of number of cores available out there Anywhere, you know the high-end core usage from AMD Is an option depending on if you've if you don't care about power consumption. It's a good way to go but most people of these days in the open-stack world are settling into Intel processors today and The the big thing is is to think about Your clock speed versus your core count So if you are if you're really the biggest thing with open-stack to me is how many cores you've got to work with So typical recommendation from my standpoint is to choose maybe the lowest clock speed of a particular processor core range So, you know, whatever pick pick a core processor and choose the lowest clock speed associated with that Memory we're gonna just quickly touch on the fact that you really You want to look at optimizing your memory memory is a key factor in the performance of open-stack Maybe even more so than a lot of other other applications. So you want to you know, take a look at your at your At your memory count for what your VMs are gonna need or so if you basically if you calculate out your cores in your And your memory for each instance Figure out what your total amount of usage is for your machines You want to make sure that you're running that at about 60% of your total memory capacity You want to have that extra capacity to help your system operate? Efficiently because the last thing you want to do is starve your operating system If you're if you end up needing to use more memory for operating systems In an underlying infrastructure than you have for your for your core or for your For your instances essentially what you're gonna end up doing is Starving one of the other and having to start caching things to disk and it's gonna hurt your performance a lot The thing is just go ahead. I Personally yes, probably that's that's that's a common practice for some people to do it's It's dangerous because if you end up in a situation where you completely disable it you know That it can be unpredictable if you end up if you if you didn't plan accordingly and you're in your over utilizing And Don't plan for over subscription. I know a lot of people recommend it Smaller deployments of OpenStack. It's fine. If you're you're trying it out. You want to see what you can do with it But you know if you're really looking at it to use it for a production environment Over-subscription can really can really cripple you if you just it's too unpredictable Network card selection. I'm not gonna really spend much time on this But just make sure that when you're looking at network cards that they are that their drivers are compatible with OpenStack There's hardware there, you know things have improved a lot with OpenStack over the last few years There's a lot of vendors are making drivers now But you know, you'll still run into things occasionally where it's not supported. So Biggest thing is just make sure you check that out hard drive controllers is It's a it's kind of Pandora's box of questions, right? Do you do you use RAID? Do you use HPA? Do you? You know, how do you manage your how do you manage your SSD? Caching there's there's a ton of things that can be covered here but essentially really what I guess from my perspective the key takeaway from this is The only time I would ever use RAID in OpenStack I'm gonna make that blanket statement is Is when you are mirroring your boot drives other than that I would use HPAs across the board for all my storage whether it's you know in in the same server or it's attached through JBod But you know direct-attach one-to-one ratios is by far the best performance ratio We'll take a look at hard drives for a second basically the key takeaway for hard drives is You're you're looking for in most cases with OpenStack. You're looking for volume of storage. So As a mass, you know for the mass storage portion of your deployments I would I would stay with SATA, you know stick with stick with SATA sass has got a little bit better performance But it's really not as as important to you as as a as a storage capacity that you get Kind of the bank for your buck out of your out of your SATA drives You know for for performance increasing you're better off to use more SSDs for caching purposes You can do your journaling on SSDs and then you know use that to write out to your storage media Just as a kind of a rule of thumb Generally speaking the rec recommendations we make is somewhere between One SSD for journaling to every four to seven drives of spinning disk You'll end up with a little bit better performance if you stay on the lower end of that But the sweet spot on price is generally you know around six to one The last thing I've got on there is consumer grade versus enterprise grade There's there's a lot of discussion around this going on in the community on and off Everybody thinks that you can go down to Best Buy and buy a bunch of drives and stick them in your machine And they'll work fine and you know and you know you can save a bunch of money that way Well, it's true. You can but there's a there's a big bud associated with that and that's that's reliability So if you're you know, you've got a couple of machines in your back room that you're trying stuff out on You know it's something that you can try and get away with probably if you're deploying stuff at scale in a data center even Even at the rack level the last thing you want to do is is do that and go have to be out there replacing drives on a constant basis One exception to that maybe In the instance if you are doing a lot of a lot of reads and not necessarily a lot of writes I have seen some studies that show decent performance out of switching to some Consumer grade SSDs that you can get you can get decent life out of them that way as long as you're not writing a lot And the performance is good. So as a cost-benefit ratio. It's it is an option Now we're kind of gonna get into the nitty-gritty of what we really wanted to talk about today And that's what does it really look like from a process standpoint to get your hardware deployed? There's some of these things are Everybody takes them for granted rack and stack and cabling is a prime example You've got you know Oh, you throw is throw your stuff in the rack and plug some cables in and you know You end up with a pile of spaghetti and sure it works as long as you don't ever have to maintain it So big thing that you really want to think about is Trying to make sure that when you're deploying stuff whether you do it yourself whether you do it through a service There's a lot of companies out there who will do rack rack integration services for you You want to make sure that when you're setting up your racks that everything is clearly labeled easy to follow and neatly organized Especially labeling your cables, you know I can tell you from personal experience when you go into look at somebody's rack And there's no labels on any of the cables and you're trying to track one end of one black cable and a bundle this big to Where it goes on the switch on the other end. It's not a fun thing to try to do So, you know, it's it's something you just want to pay attention and plan It's it's not a hard thing to sit down and write it out make a spreadsheet Plan where stuff's going and put a label on each end of the cable Even if it's just one on this end and one on that end and two on this end and two on that end It makes your life a lot easier when it comes down to trying to maintain it I'm not really gonna talk a lot about airflow, but this really comes into play. I think more so if you're looking at Trying to adopt a white box type technology versus an OEM technology OEMs have had a lot of time to plan, test, build machines to, you know, to suit a wide variety of use cases in there And they've spent a lot of time optimizing and making sure that everything works properly When you go the other end of the spectrum you kind of You kind of get all ends of the spectrum, right? You've got you've got people who've spent a lot of time planning and have done a really good job about managing things like airflow You've got people who haven't touched it at all And you know you open the box and you see the bare motherboard and there's no air ducts or to know nothing So essentially what I'm saying is just take a look at that You know open it up at least make sure that you know in your first your first golden sample that you get from somebody that you've got An air duct something looks like it's managing your airflow to help improve your performance in your cooling process Otherwise you can you leave a lot of money on the table, especially at the data center operation level And then really from a from a deployment standpoint firmware and BIOS Management is a really big thing and it's a thing that a lot of people get hung up on when when you Set up OpenStack. You're setting it up on your laptop and a bunch of VMs It's much different than when you're trying to do something at scale or you know even with just a few machines At you know at the rack level you've got to make sure that everything is configured the same every time You've got to make sure that you've Managed all of your drivers correctly all of your All of your firmware correctly all of your bios settings correctly everything always has to match if not you can end up with a lot of Unpredictable performance out of your machines and potentially chasing failures around in a circle trying to figure out what's different on this box versus that box so There there are some Well now there really aren't there's not a lot of good stuff out there for for this process There are some there's some tools that are available to to do provisioning and that sort of thing But really what it comes down to is when you're when you're originally laying out your layouts your designs You want to make sure that you use you stabilize on versions So if you pick version XYZ of your raid controllers firmware You need to make sure that every one of them has the same firmware on it and it's not always cut and dry you could order you could order raid controllers or knit cards from a vendor and You could get a shipment of 2,500 of them in and it's all in the same huge crate and 10 of them are different than the knees and 20 of them are different than that And so what you end up with is really a kind of a conundrum of trying to figure out how to make sure at all That's managed In my experience best way to do that is as you're going through your process of provisioning and burn in Just flash them you you can do one of two things you can do you can do a quick check to make sure that the version is right Or you can just do a blanket right It doesn't really matter as long as as long as you make sure that you are Verifying that those are the same on every machine And you know keep even if it's just a spreadsheet keep track of it Make sure that it's always the same and if you're buying machines from a particular vendor or through a particular systems integrator Make sure that they have that list that they're following it You you know every time you need to be checking when stuff shows up at your location You're verifying, you know at least spot checking that whoever integrated your machines for you Adjusted those firmware versions and things so that they're right Bios is another big one for that, you know There's a lot of people out there are still going through and manually changing bios on machines to try to make sure that Everything matches up. It's not the right way to go You'll you'll end up human errors is the the cause of a lot of people's woes so You know the best the best thing you can really do is just Get your golden sample get it set up You know dump a bin file off and flash that to every machine as it comes through when you're doing your process of testing and Burnin if it'll just save you a bunch of headache. You'll know every one of them is always the same and then That kind of brings me to the topic of burnin Which is also an interesting subject that a lot of people struggle with and they're not quite sure What the right answer is and the reason is is because there's as many options out there as there are people in the room There's people do this stuff all different ways all different methodologies My personal choice if you're gonna do it yourself is to utilize open source tools You know much like we're doing with the rest of our community support those guys You know Google released a great tool called Google stress. It's now called stress app test Does a really good job of stressing your your system and it dumps out a decent amount of log files tells you what went wrong You know at least at least gives you an insight into what's happening during that burning process while it's stressing all your equipment and then from From a hard drive perspective. There's there's a number of things out there FiO is my personal favorite probably it allows you to establish workloads of rewrite to the machines Again, it's open source tool you can use to to You know configure each drive to write and read a particular amount for a prescribed period of time There's a little ones out there thrash and some others But you know do do a little bit of research dig around see which one you like best But there are there are a few things out there that are key Biggest thing is the overall arcing test which you know, which the Google stress or stress app test covers most of and then You know key component testing So you want to also do things with your network test to make sure that you've got good connectivity across between your machines so Let's see Lin pack I don't really think of some others Anyway, there's there's a lot of stuff out there and if you guys have questions feel free to reach out to me later And I can get you a list and then The last thing probably most important thing to think about is managing your systems from from remote location it's it's It's one thing if you've got your machine sitting there in front of you and you can walk up your console And you got a keyboard and a mouse hooked up and you're able to sit down and work on some things That's generally not the case for most of us who are working things from a professional level We've got a one rack or ten racks or a hundred racks or a thousand racks in a data center somewhere and those things are They're usually a long way from wherever you're sitting, right? So you've got to you got to make sure ahead of time that you've got ways to deal with those machines from that remote location whether that is You know setting up DHCP server with pixie so that you can You can load images bootstrap things from a remote location Whether it's you know static IPs and you're going to manage it all manually There there's a lot of ways to go about it But the biggest thing is is make sure that you have addressed accessing those machines from a distance You need to be able to to get to the networks first where they're located and then you need to be able to communicate with them And that's where you can use tools like IPMI IPMI tool is a primary tool used in most of the large data centers in the world today it anybody who's using a Non-proprietary system at least it's it's a standard that was established by Intel that can be used for both system monitoring and power management Fan speed control there's a lot of different things you can do with it really it's essentially it's your it's your out of band communication Window into what's happening with your systems? So if you haven't started using IPMI, it's probably a good time to start if you're going to think about moving away from anything That's proprietary. It's a it's a universally accepted protocol that's used on Everything from Dell to HP to white box hardware You can you can use the same interface for everything and it it really helps you Solidify a platform across your entire organization that can be used no matter what brand of hardware you choose to choose to deploy The last thing on this list is really kind of it's it's it's becoming more important and It's as as things like open compute hardware become available There's there's the there's a trend towards optimizing hardware to remove cost of operations And in doing that some things are removed like video cards or you know in base level video off the motherboard stuff like that So what you want to make sure is that? You understand how to how to communicate with those machines at the local server level If you walk up to a machine that doesn't have have any video capabilities at all It's pretty hard to work on it when you're standing in front of it. So think about think about Setting up your toolkit for either serial over land where you can do remote Management of the machines from another machine a laptop what have you or or serial cable connections just so that you've you've got a process in place to to Work on that if that's the kind of hardware you choose to adopt And at that point I'm going to hand it over to you. Yeah Thanks everybody so again, I'm Sean O'Connor I I'm gonna take a look at two tools The rest was saying there's really not a magic bullet out there for managing Those those four things firmware BIOS raid Those there's really not software to find cabling so we don't have to worry about that but So I'm looking at crowbar, and we're gonna look at the ironic Python agent So a little bit about crowbar It started version one by Dell as an open stack installer. It was open sourced in 2011, and it's actually still maintained by Susie version 2 Was completely re-architected broader focus. So the idea was you could deploy heterogeneous operating systems You and and you can install into any hardware with the idea that you would be able to update firmware Configure raid provision and do all this kind of task I Dell stopped active deployment in April of 2014, but it's still community maintained If you guys are familiar with Rob Hirschfeld in his blog He's been for a long time mr. Crowbar at Dell and now he's kind of mr. Crowbar on his own And I would absolutely if you want to dig deeper Suggest his log. I kind of look at it as an outside in solution So it's not really focused on an open stack. It wants to be able to deploy open stack could do Lots of scale applications and it wants to work with a variety of hardware If you look at what it's doing today, it's somewhat of a limited matrix of hardware It's still very Dell centric. It takes advantage of a lot of the Dell tools WS manuscripts and etc So here's a kind of a you know logical architecture diagram It's a lot of the the original tools were built in chef You've got this concept of jigs where you can take other tools and shim the man So you can actually integrate with puppet and it also has the script jig which SSH is in to do a lot of the hardware management And I an operating system deployment so the counterposition of that is ironic python agent So this was developed by rackspace they They did a really great session on it in Atlanta And again, they've done a lot of great blogs to dig deeper into it So it's it's core of their their on-metal project What it does is it replaces the pixie tftp image with a with a python agent with an api so We're seeing some in a second. It also has an ironic neutron plug-in So it's the goal of that is to actually be able to go out and configure the physical switches So a lot of those virtual concepts plugging in port setting VLANs term of printing VLANs etc are covered by that This is more of an inside-out solution. So it's it's born and bred inside of an open-stack integrated project They're focused on a specific ocp platform. So they have a predictable hardware Footprint that they're deploying to And it's it's really somewhat more of a focused solution, but it's very targeted So every year you can look at the Actually easier to read over here than on the screen So you've got the ironic pixie deploy driver on the left and then you can see what the the python agent is doing So we're on the left. You can see you're doing DHCP requests and tftp across and deploying that image Then connecting back over ice scuzzy and doing a DD copy to bring the to configure the image or to configure the bare metal server So you're actually beginning provisioning at the very top So they're taking that first tftp request and actually pushing a boot image agent And from there they have HTTP. They've got an ipixie interface So they've got a lot less primitive tools. They've raised it up a level. They can do more in the agent And then from there you You actually can reduce some of those reboots. So instead of having a tftp image wait for that to boot up Do some push the image and then reboot again You can actually do a lot more of that in one reboot and then hand it off to your end user I don't know about you guys my servers. I've never said that they boot too quickly. So being able to take that out It's definitely a benefit So moving off to compute NONDA networking So I talked to a lot of customers who are coming into open stack and somewhat new to it and There's a lot of confusion about this like there's a lot of networks. So, you know, we've got you got cluster management You've got storage, you know private networking public networking Pixie networks, there's IPMI networks. So, you know, how do we take care of all of that? How do we make sure everything's plumbed correctly and how do we verify that? So we're not going to get real deep in the weeds on network design here But a couple of options you want to look at are you doing layer 2 or layer 3 at the top of rack? Layer 3 Definitely a great network design But you need to consider that you may have a lot of services that are layer 2 adjacent or that are expecting layer 2 adjacency And that you're wanting to spread those across racks. So you need to be able to consider how you're going to Be able to communicate with services between racks So M lag multi-tracy lag, you know as far as getting away from spanning tree at top of rack you know pretty much all of the Switches in some form or fashion have a multi-chassis lag there, so you can bond ports between two upstream switches Equal cost multi-pathing in L3 You know, you've got a few different options whether you're doing L2 or L3 OSPF or or trill depending on your your switches And then VLANs are you going to be doing GRE or you're going to be tagging all of these interfaces? And and what are those VLANs going to look like? How are you going to map those from the virtual networks to the physical at which network? So you can all combine which ones are tagged which ones are not tagged Hint pixies typically untagged And then if you are doing bonding, how are you going to deal with that with with pixie traffic? Are you going to bond it or are you going to separate out the pixie interface or do your switches support? Some type of force-up feature to be able to pick a primary interface to pixie boot off of And then top of rack aggregation, so there's a leaf and spine architecture So instead of having two large core switches actually, let's go ahead and come over to here So here's you know kind of three pretty common network designs on the left. You have a fully Layer two network with M lags You've got your services distributed across all of your racks. You've got VLANs And then you have two large core switches at the top It's great for about a thousand, you know in the thousands of nodes. You definitely will run into scaling issues So you move to the to the right and you've got a full layer three equal cost multi pathing with leaf spine switches So you've got your top of racks and you have all of those distributed out to multiple distributed aggregation switches at the top Again, your services are isolated by racks So you can scale to 10,000 nodes, but again, you've got to consider how those services are and then you can look at VXLan So now you you have the ability to run layer two over layer three. So now you can have your services sitting anywhere And scale Basically you get larger scaling and then you also can get away from your service isolation by rack So once once your nodes are up, how do you monitor your hardware nodes? Nagios is kind of the grand daddy in the open-source space for this Version one was released in 1999. So at this point, it's fairly mature. I Can run agentless or with an agent most of the functionalities in the agent So you've got a plug-in architecture, so it's very extensible So you've got plugins for for database monitoring application monitoring You can configure network equipment with SNMP or and monitor by with SNMP You've got IP my polling and then you can feed all of that data into display tools or graphing tools kind of the Next generation and we're definitely not going to get into the religious debate about Zabix versus Nagios But I think I would say it's fairly mature. It's newer version one was 2004 It handles auto discovery and auto registration of your nodes and it builds in aggregate graphing capability It has a sequel DB backend and it's a little more resource hungry than than Nagios So next is log management. So all these hardware nodes are generating lots and lots of logs How do we aggregate that and how do we act on it? So Splunk again, they kind of invented this space. So you have to mention them Their own marketing they called it Google for your logs So if you're not familiar it grabs all those logs it indexes them you can search through those logs They have an entire app infrastructure. So you can add application aware search and functionality They've released a SAS version Splunk storm But it's closed-source and there is a pretty extreme or a pretty large cost that comes along with that What a lot of our customers are looking at now An open-source alternative to Splunk is the elk stack. So that is a stack of three projects Elasticsearch does indexing in search log stash collects and manages all the logs and then the visualization is done with Kibana Any kind of real production or deployment at scale you want to separate each of those projects into Dedicated hardware because they can be a little resource intensive and That's what I've got on kind of the the software tools to manage. I think we're about out of time as well So any questions? So we looked at foreman. I see most all of the tools foreman included have have raid and a Lot of those functions on the roadmap. I didn't see where they have Like it where they have the tools actually developed at this point And that's why I left them off. But yeah, I mean Our fuel product a lot of them are looking at different things in terms of, you know, raid network verification network configuration It just tends to be something that is looking forward more than it's actually delivering it at least on a limited set of hardware today Yes, I didn't hear what kind of platform Yes So I've seen the same thing where if you're managing Dell systems, you're gonna use the Dell bios tools HP you're gonna use the HP tools And then the bios manufacturers as well between the different versions, yeah So I believe that they package they take the actual packages and then script them So you have to actually build an update or you know a kind of a tool chain for each hardware type And then they orchestrate around it. Yeah, that's something that we we're actually having a conversation about this earlier today There's there still needs to be support from each individual vendor at this point for that Because the problem is that you run into and it's not necessarily a problem But the the proprietary nature of it comes from the fact that you know bios for example You might all start with an AMI bios, but then depending on what you're doing with your hardware You may make customizations In fact, you pretty much always make customizations to that bios to support what you're specifically trying to tune it to do And so that can break the generic tool for for being able to manage it So what what we were discussing is how The the crowbar initiative they're kind of working towards being able to Have those tools all incorporated in one place So they're gonna work with the vendors to to work on each individual piece and be able to include it so that it's available to You seamlessly in the in the process Yeah, I mean my opinion. I think having the tool set integrated with ironic It's someone gets community focus on it and potentially could build like a back-end architecture or a plug-in architecture Like we see with neutron and sender and things like that and have the vendors You know kind of put it the onus on them and actually get them contributing and building those tools into a common platform Not that I've seen Anything else? All right, great. Well, thank you everybody. I appreciate you listening and let us know if you have any questions