 It's not a second try. I'm just going to try and do it on my own. Hello, everyone. My name is Mike Kedera. And welcome to our presentation on building a flexible open stack cloud from the ground up. Now, I've been with Intel for almost 19 years. It'll be 19 years next week. And so I got my start in our data center group as a software engineer. But a lot of my time was spent in Intel's IT environment, where I deployed a number of enterprise applications, as well as a lot of time that I spent in our engineering group working to deploy open stack, as well as a platform as a service solutions like Cloud Foundry. And now I'm in Intel's open source technology center. And I am in a customer engineering team where we actually work closely with customers to enable open stack, as well as anything within their data center environment around mission-critical Linux or networking and storage. And so I'm also here with John and Leong. And we're going to be walking through an environment that we built for our developers. My name is John Geyer. I've also been with Intel for nearly 20 years off and on. Is there a switch? You're closer to this one. Just closer? Yeah. So I've been 19, 20 years off and on. I've taken a few different jobs. I started out in validation of BIOS in the server realm, a lot of early platform alpha stuff, and moved into lab and data center management design. And recently have joined Intel's open source technology center with Mike, working as a technical marketing engineer. Cool. I'm Leong. I actually quite new to Intel. I joined Intel last year, last August. I have a mixed background in application development and infrastructure provisioning. So I spent the past 16 years doing developing application, peer-to-peer application, distributed applications, and infrastructure development, as well. And I obtained my PhD in 2013. It's about multi-cloud orchestrations. And before I joined Intel, I worked for the IT company, for insurance company, working on the next generation cloud platforms. And then I'm a sort of active working, active member is in the enterprise working group, and product working group, trying to help people understand how to jumpstart the open stack journey. Now all of us had different roles on this project. So I initially started from the beginning working on the scope of the project and some of the initial architecture. Now John also was a big part of that architecture discussion and design as we went through all of our requirements and constraints in the environment and dealing with what we could and couldn't do and how we could come up with something that is very flexible for our users. And then Leong is basically our end user, but also very much an architect as well. And he's going to be showing a little bit more about what he's using the environment for. Now as you entered the room, we were showing you a little bit of an architecture diagram. Now that's actually our single scalable rack design. And we're not going to go through every little bit of that today. There's just not enough time. But what we want to do is we want to talk about the project that we used. And to do that, what we'll do is we'll start off by talking about all the options that are available as you're going to be designing your cloud. And then from there, we'll look at all the solution options that we had available. Some of our users are constraints that we had. And then John will start talking about those, how those solution options started coming to light, and how the rack design came, and how we can actually scale with it. And then in the end, Leong will show you all about what we're doing in the cloud today. So a little bit about the cloud. What do you want it to do? There is just an endless amount of opportunities available for the way that you configure it. But the best designs always start not first looking at the hardware, but looking at what your users need to do in that environment. How are they connecting to your cloud? Are they going to be using laptops or their phone? I mean, I know when I first got into Austin, Texas, the first thing I did was, well, of course, find my hotel room. But next, it was, where's the best place I can get some barbecue? And so I knew that my Yelp app would come up right away, and I expected it to. Are your customers going to have that same demand and request on your environment? Not only that, there's just the growth of all the things coming with IoT devices and scalability. Is your cloud going to need to support that? These are all pieces we have to consider as we're building out our cloud. And now, then, you have to start looking at how you're going to be monitoring those services. Are you charging back? How are you going to be billing for them, or is it more of for your company itself to use? Now, if you're running financial applications, this can really change things with SOX compliance, perhaps even other compliance you need to do and manage around entitlement and how people will access those systems as well. And then even underneath all of those applications, you start to look at how it's all brought together with runtime environments, identity management, big data as well. That could be very much part of it as well. And then one of the things that I know even when I was in IT is how much work we spent in asset management and knowledge database management and just knowing what's going on and how not only the physical components are operating, what's going on with them, but also the virtual environments. You need to keep those patched. If a customer calls you, what are they going to be doing in that environment to trace down every piece of that tenant? And all of these pieces come together in building and structuring what the compute, storage, and networking can be. Now, why build the flexible environment? Well, there's so many options available for your customer's applications. And the cloud and your users are not rigid. And you need to look at all of those opportunities that your customers are going to be doing today, as well as growth for tomorrow to build something that's flexible for their business. Now, our objective within this environment was really to help promote and further OpenStack, really looking at its growth and stability and being able to take on the workloads for tomorrow. Now, in that support, our environment had to support our global Intel OpenStack developers and architects that are working in the community to improve OpenStack for the enterprise. Now, I don't want to confuse this with Intel's IT environment and their production environment there. They have a different customer base and different needs. But Intel IT provided great input and requirements and guidance as an enterprise user. And they do that with our project that we have here, as well as in the community as well. And so our IT environment is actually a downstream user of all the work we do in the community within this lab, as well as the work that many of you do as well. Now, with what we're actually doing in the environment, of course, is developing, testing, and pushing code to the OpenStack community. But beyond that, we also need to enable new features within Intel's chipset. Some of the things we're doing with the Nuzion chips, as well as things that we'll be doing with storage, there's a lot of great things coming there. We want to make sure we can test those use cases and we're enabling it. There's a lot of work that we do in the Linux kernel to enable these features, but it keeps going up the stack. We're all working on the hypervisors, as well as orchestration within OpenStack. We really want to make sure that when our customers come and use this stuff, that it is easier for them to implement it. If it's there and you don't have to think about it, it'll add the value to the platforms that you're using. Now, our environment had to be scalable because with the environment being global, we don't want to have somebody in the lab swapping out cables and moving systems around for various configurations. We wanted it all to be done on demand. And so things like KVM over IP are very important, as well as all the software-defined infrastructure pieces that come in to play here. And so all of this comes into how we're managing our environment. And this doesn't just stop with what our users are doing, but how we're going to manage it. We get it set up. Now we have to, it's that day two operations in keeping this environment in service and running, and then extending it even a little bit further and looking at our solution options and what kind of facilities we have, who's going to have physical access. All of these things weigh into our solution options. Now, first, when you look at solution options, everyone always thinks first about the budget. But as we're just talking about, it's much more than that. You really need to start looking at really the capabilities of your data center itself. If you're on a raised floor, how much weight is going to be on each one of these racks? Do you need to bring in a battery backup? And if so, those things, as you know, are pretty heavy. I don't know if you've ever carried one, but you definitely need a lot of support while you're bringing them in. But all of these solution options will help to tailor what you're going to be looking at within the environment. Now, what we are doing is we are looking for that scalable rack design that would allow us to implement at least initial features and grow to be an enterprise class environment. Now, as we start off, I mentioned looking at it from a budgeted model for looking what we can do within one single rack. Now, for our development teams, really looking at the initial provisioning of OpenStack and getting things set up and running to do that general functionality, just making sure all the services run as expected. But we could also start doing high availability testing by making sure failover is working. A lot of those use cases that our customers are looking at as well. And of course, with the single rack, we were limited to an individual availability zone. But as we look at growing a little bit further and what we could take on for our users, where we're having multiple racks, we could then start looking at truly scaling and testing some of that out in the environment. Rolling upgrades is a very important piece that we're working on for our enterprise as well, as well as we could simulate dual availability zones and also put a network lag and some tests there to see how things operate if there were impact on some of the environment. Then we can do operational high availability. So if you need to patch the system and take all these nodes off, if you're running like a BIOS upgrade, you need to move your VMs off. How fast does that happen? If you can imagine rolling out a BIOS update to a hundred or more nodes and you're bouncing your virtual machines all over, it can take a long time. So part of our goal is to see how fast we can do this, how we can improve those features for enterprise users. Extending it just a little bit further, we look at our enterprise class and we go into multiple racks that that's actually where we're growing to now. And we have multiple availability zones and cells that we can actually use and test within the environment. Rolling upgrades can really be a big part of what we're doing and making sure that the data plane and the control plane stays available. And scaling is very important, making sure that we can push the limits and open stack services and identify bottlenecks that we can work to improve with the community. And then as you'll see within, and when John goes through the design, you'll see more of the opportunities we have for flexible implementations of software defined networking, as well as storage options. So I'm gonna get into the nuts and bolts. Not gonna talk about software, we're all hardware here. So when this started out and we had come up with our different solution options, we hadn't identified a facility. So that tends to throw a wrench into the works because you don't really know what you're planning for. So we decided to just aim for the ideal. And what we came to was a rack with 30 servers and in rack battery backup. That was ideal. We really wanted to hit that mark. But then as things came along, we had to start looking at what we were going to have available. So when we did identify a facility, we would know what we were aiming at. So we started to go through and there's a few things that you need to look at when you start to pick out your hardware. One is a lot of people will miss exactly how much power is this really gonna draw? You can look at it from a couple of different directions. You can either go to the manufacturer and actually do the math on all these things and figure out, okay, we're gonna draw this, we got this many hard drives, it's gonna be this many amps. What voltage is this gonna run on? Typically 208, right? You can also build a model and do some real-world tests. What we chose to do was, somewhere in the middle of that, we took both manufacturer specs and did the math. Also, we set up models and just ran to see what were we running into power-wise. And in the process of that, we also found that we needed to pad that just a little bit because if we wanna upgrade down the road, if we wanna make some changes, whatever our power figure became, we needed to add just a little bit there for cushion. Also, the other thing to consider is your startup process and your maximum workload. Sometimes those are gonna be very similar. Sometimes they won't and it's really gonna vary depending on the hardware that you decide to choose. So in that, your worst case scenario could be that all of your servers, all of your switches, everything rails and you're drawing maximum power and can you handle that? So then moving on to density, this is where you need to involve a facilities manager who is gonna have information on how much cooling is available because if you don't have the cooling, you don't have the power regardless of whether you're wired for it. You're going to run into problems at some point if you exceed any of those maximums. Let's see. Then according to your budget and what you've got space for in the rack, you need to start planning for your networking because you're not gonna wanna throw just one switch in there, we're talking open stack, we're gonna need multiple Knicks per box. How far do you wanna go with that? And then your footprint, as Mike mentioned, sometimes if you're on a raised floor, you may not have a whole lot of weight to work with. So you gotta start weighing all these components and then when you look at the battery backup, like you said, those can be very expensive weight-wise and we ran into solutions that were upwards of two tons. So once we did find a facility, it was an older facility, had been recently rewired for additional power, giving us some capacity there, but it had some other issues which we'll talk about as we move through the actual hardware that we chose. As it came down, when we were able to identify how much power and cooling we had, we found that we had enough capacity for 15 servers per rack, which is half of what we wanted. We also found that the battery backup was not gonna work out and we'll touch on that in a second. So we ended up going with five 1Us and 10 2Us, allowing us with the 1U to do some light storage and compute the 2U having a number of extra bays was able to handle a lot heavier storage. However, when you look at this, keep in mind that sometimes in a 1U, you will not be able to run a top-bin processor. You're gonna run into heat problems and some of them, so be aware of that when you choose them. Our solution didn't have that limitation. For your management infrastructure, the KVM as Mike mentioned is pretty critical. We chose to go with a model since we had 15 units in the rack that had 16 ports, giving us just a little bit of room for growth, which we ended up using at one point. We also found that our power was going to be limited to only about 60 amps for the servers. And in this solution, we ended up going with a PDU that was network controllable so that we could switch each node, each switch, whatever we want, remote from anywhere in the world. This also provided some management. We could keep some data on how much power you're using under certain circumstances. Our network infrastructure, we chose to go with a single 48 port 10 gig switch. And with 15 nodes, that gives you quite a bit of flexibility for VLANs. You can bounce things around quite a bit. And since it's on the fly, if you're pre-wired, you can actually change this stuff from anywhere that you can connect to the rack. And then we used a pair of one gig switches to provide support for the KVM and the PDUs so that they had their own private network. And then we also ran the second one gig switch for the BMC and out of band connections. And as this all worked out, we found that we had enough space left in the rack so that we would be flexible to move to a 4U or E7 at some point down the road should we decide to change things. And with our power budget, we left enough of a gap in there that we would have some room to grow as well. One of the things that we ended up adding was a bastion server we weren't planning on. We also found that the battery backup solution that we chose was big enough to handle three racks. So now, as we buy one, the next two racks would not need to have a battery backup, we could just share. And at that point, we pretty much had reached the hardware that we wanted. So we decided at this point, it's time to bring in Leon and transition on over to the software end of things. Okay. So this is kind of a laptop that we have today. So it's kind of an overview. So it looks very complex to some of you. So I'm trying to explain one by one. So if you look at here, so at the bottom here, we have basically all the rack-mounted servers. So every rack-mounted servers, we have all the KVM and the PDU attached to every one of the servers. So for one per rack, we have 15 servers. So all of them are connected to the KVM and all connected to the PDU as well. So for the users, right, from the remote users, they actually can control the servers remotely and to change the bioselecting, power cycle servers, and that's very convenient for the users. And on the every rack servers we have, we, right now we're actually using three of the NITCUT. And everywhere server that we have here, we actually have the capability to support six NITC, right? But because of the sum of the constraint that we only, right now we only have one rack, one network switch that we have a 48 port, right? So we have 15 servers. You can go to maximums of three NITC per servers because they add up to 45 ports that we have to. So if we want to use an additional three NITC, we can do that, but we just have to add another rack, another switch at the top, okay? So, and we also, every server we also utilize a BMC. So that allows the users to access doing things like IPMI stuff. So we have a separate switch that just dedicated for the KVM network. So this KVM network basically connected to the PDU and KVM. For people who want to access this because this KVM over IPM PDU has a web interface for people to access that. So we have to provision that subnet for people to come into that KVM to access the KVM PDU. So there's two KVM, two racks that actually connected to each other for the KVM something. And as I mentioned, because some of the users actually want to utilize IPMI features. So we also have a separate subnet that just dedicated for the BMC. So on the top of the rack, that's where the high end switch that we have and the rack, the top of the rack, top of the switch, the switch at the top right, I mean, has the capability to support multiple VLANs. So right now basically utilize three VLANs. And it depends on the use cases that we want to support. So that VLAN can change or add multiple VLANs to support multiple use cases. And all this VLANs connected to the neat cut that we have in using today. And one of the things that for our use cases, as Mike had mentioned just now, what do you want to build this lab for? So generally for our lab that we're trying to build here is to support our open stack developers, all the upstream contributions. So most of that work that, some of the work that require physical nodes access. That's why we want to build all this PDU and KVM. And for some cases, for some of use cases, they want to connect to our internal networks. So we actually have a separate 10 gig links there to support, to give them access to our internal network. So that part gives us some redundancy on the network. But one of the constraints for that is, okay, that network is also connected to our KVM so people can access from the internal network to get into the KVM PDU. The one constraint on that network is all the servers within that lab, if they want to get out into the internet, they have to go through the corporate proxy and corporate firewall. So because some of the development works as being done by our other users, they have some issues with the proxies because we're still doing some development testing. That's the development work that our users are trying to do. And they want to avoid the kind of proxy issues. So we actually set up another link that actually allowed the users to go through the DMZ zone. And by one of the limitations on DMZ, because this is some of the constraint that we have in the lab, we can only provision a maximum of 100-max link for that link. But that gives, I mean, even though the bandwidth is limited, but that at least gave the users, the developers, and options to go to the access internet without going to the proxy. And of course, from that point, if the user want to access that part, they, due to our security policy, we have to segregate these two networks differently. So if people want to access the servers on the lab, so they have to code through the DMZ, and we have the best intros and VPN, firewall, everything setting up there. And they cannot access that part of server through the internet network because those two records are segregated. And of course, if the benefit of using that part is they allow them to go through the internet to do some testing without having the proxy issues. So that gives the developer the flexibility to run their different types of workload. So in terms of storage right now, I mean, because we are deploying the OpenStack Cloud, so for the storage backend for Nova, Glance, and Cinder today, we basically just use the red and the red configurations in LVM. So we are looking in the futures to include SAF, and we're still working on that one. So maybe in the next summit, we can talk about more options, our development here. So definitely we want to look into SAF and how we want to do that. And we're also considering other projects for the storage. So we look at Swift and Manila. So those are some of the projects in our pipeline. So helping our users to utilize the lab, the cloud, they do some sort of cloud storage testing on using Swift or Manila, or any other storage options they want to try. And also we're also looking into the Big Data use case. So looking at the hardware that we have today, because every storage server we have a 26 slot. Yeah. So right now we only use six of them per server, right? Six, six. Yeah. So we still have a lot of room to expand if we need more storage. We just have to add more hard disk without affecting the red design. So that gave us a lot of flexibility to expand for the future needs. And okay, one of the things I want to talk, I want to talk here is how we build a cloud today. And of course this is not the only option that people can do. So we just use this as a one of example that we have been testing, experimenting in our lab. So for the first use case that we try out, right? So we have basically when we want to build out the cloud, right, we have a deployment host and we also have infrastructure host, the compute host and the storage host. So the way we provision the operating system, we're actually using experimenting by force and ionic to provision all the infrastructure host, to provision the operating system of the infrastructure host, compute host, and the storage host. So I mean, of course we, I mean, throughout the process, right, we learn some of the difficulties on using by force and ionic and we try to fix that bug and put that things upstream and try to help the community solve the issue of the by force and ionic as well. So this is one of the experiments that we have done in the lab right now. And to deploy the OpenStack services, and today we're using OSA, the OpenStack Ansible. So there's a couple of sessions I think in the summary talk about OpenStack Ansible. So we're actually using that as well. And that is the tools that we use to deploy the OpenStack services after the operating system has been provisioned by the by force and ionic stuff. So the OSA basically can deploy the MariaDB, RaviMQ, logging, Keystone, Neutron, Horizon, Glance, Heat on the infrastructure host. And then same thing using the OSA is Ansible script to provide the normal services on the compute host and so for the Cinder and Swift as well. And we're also looking at doing bare metal service provisioning in the futures, try to create an ionic cloud, bare metal clouds. And we're still looking at container in the future. And of course I mentioned just now the self storage. So this is the installation work flow for the OSA. So if you are using OSA, you're probably familiar with this flow. But if you're not, I mean, there's a link down there. You can actually refer to the install guide for the OpenStack Ansible. So basically the installation work for, right? So first thing you have to prepare the deployment host, okay? You have to prepare the target host. And once you have the target host and ready you have to configure the deployments. All these things are being done by the OSA. And then you run the foundation playbook to lay out all the foundation services. Then you run the infrastructure label to install all the infrastructure services. And then you run the OpenStack playbook to install all the OpenStack services. So basically this is just the work flow from the OSA perspective. And of course I mentioned this is just only one of the options that we tried out, we experimented in this lab. And we're also looking at COLA, using COLA to deploy all the OpenStack services and try to compare the differences between them, which one is more efficient and which one is more easier for people to use. So those are the other things that we are working now in the lab. What else? So I think that's what we have been doing so far in the lab. So we are still looking at other projects like even testing Sahara or Trove. So that depends on how busy we have, okay? So with that I pass the time back to Mike to talk about summary. Yeah, so there's a lot of opportunities for us to work within this lab to scale. And a lot of the work that we're doing, you'll see within many of the things that we're doing. So we do have of course the Intel booth where you're gonna see a lot of the features we're enabling, that's where we're promoting that work, as well as a lot of the work we're doing within the community. So if you're familiar, there's the OpenStack Innovation Center that's a relationship that we have with Rackspace. And they're also supporting a lot of the enterprise use cases. And so we're working closely inside Intel as well as that team of 100 plus engineers all working together to improve the environment. So if you do want to come by and see some of the work we're doing, this is where the Intel booth is, feel free to stop by and ask us anything about some of the work we're doing in enabling both not only Intel features but how we're working to improve OpenStack. The OSIC, the OpenStack Innovation Center, they have a number of people over in the developer lounge and that environment, you can actually see their roadmap in detail and feel free to stop by and talk with them. And they can talk to you also about the 2000 node clusters that are available for the community to use to test OpenStack at scale for the enterprise. So with that, I'll open it up for questions. Any questions? Can you hear me? Yes. Okay, great. So a couple of questions. So 15 nodes in the Rack. Could you go over how many nodes were in the control plane, how many compute nodes, that type of thing? Yeah, it was actually designed to be whatever we wanted with the project. So initially we set it up to be within a single Rack you could actually have high availability. So from the controllers, we would typically set up with three of them set aside for that task and then you can provision all the others with a mix of compute storage and networking depending on what you wanted to do. So I mean, when we first start out this project, we try to make this more flexible. So the use case is actually involved over the time. So to answer the question, how many control nodes and computers that we have, so it depends on how many projects we have. So on a per Rack perspective, if we are working on one single Rack 15 node servers, so basically we just, because we want to use testing some of the HA features, so we actually use three controller nodes and four compute nodes and three storage nodes at a moment. And the rest will be using some of the utility, as a utility servers. One thing that's nice too with the ability to VLAN everything is that we can each one of these racks and even further is that we can have also large-end scale. We can scale high as well as we can segment the environment to have parallel projects running at the same time. Yeah, and I forgot to mention just now, so because it depends on the tools that you're using and how you want to architect your network, the open stack network. So if you use OSA, right, they're actually using Linux bridge and they suggest the use of Nick bonding. But if you use different tools like few or COLA, the setup of the network will be different. So you have to look at that, consider those options when you want to design your lab. So it all depends on what kind of tools that you want to. So the hardware is, as we mentioned, the average servers is we are capable to support up to six Nick at a moment. If we, let's say we want to do some other, we want to do Nick bonding on every network, right? We can use up to six if you want to, okay? But it all depends on what kind of tools that we use. And of course, if you want to decide to use six of the Nick, then we have to add in another switch at the top because right now we don't have a switch for 48 ports. Thank you. Couple of questions about power and cooling. Can you specify how much power you had on provision to the rack? And I also wonder if you had like redundant power like A and B power and in tier three data centers, they say you have to use A side and then keep the B side for redundancy. So how did you deal with that? Okay, so for our solution on the server end of things, specifically, we had 60 amps available and we ran that through 230 amp PDUs. We had the option with the service run with one power supply or two, which provides your redundancy because you split it across the PDUs. In our current scenario, we're running it on just one power supply because we want to test things, things drop naturally. Let's let them drop naturally. We're working on fixing things. So let's let some things break. There was an additional amount of power available for the battery backup to handle the networking and the management like the KVM and things like that. And that allowed us, like I said, to reach about three racks worth and still have enough capacity to run should power dip. What you're looking for? I'm curious if you could elaborate a little more on your HA solution and are you using your intelligent power switches for fencing? So the HA solutions were actually options we wanted to test in OpenStack. So there's many things that we can do and I could probably let you talk out to specific tests you're running. Yeah, so it depends on what kind of use case they want to test today. So definitely that is part of the thing that we want to do as part of our roadmap. But as we mentioned, we just started this lab a couple months ago. So we will try to share all the information maybe in the next summit. And we try to publish all the information upstream to the upstream OpenStack community as well. So once we have the data ready, we will publish the information on white paper either on the OSEC website or the OpenStack. Zero one dot count. Zero one dot all. Since you didn't have the cooling issue with the CPUs and you decided to go to TUs, was that for the storage density? And if so, with storage density increasing, do you plan to go back to one use when you can? We left that open because we didn't know where this is going. So both SKUs are capable of running whatever hardware. They're literally the same hardware in each box. The difference was the 2U has additional backplains, hardware RAID and capacitors for more drives. But when we did the power analysis, we calculated this at all base field. So we have a lot of capacity left if we add drives, if we decide to add RAID, things like that, we're still fine. That amount of power covers the whole bit. And don't forget that everything's to have to fix within budget. Yes. Oh, there's not a mic. There's only mic here. Thank you. Thank you. In storage, did you have all the hard drives or SSDs? Did you use caching, SAFs caching? Yeah, we do have SSD. How did you use SSDs? That would be configured on your end, actually, if you wanted to use SSD caching. What we did was the boot drives were all SSD and data was enterprise data. So you're not going to cache your SSDs in the boot in the data if that was the case and that would be an option. We didn't include it, but it's something that can be easily added to and we had the power budget for it. We did not measure it because we hadn't built it. The question was, did we measure performance latency for those that couldn't hear? Yeah, so at this point, now that the hardware is built and the software has been applied, this is where we could measure some of that stuff and then make those options if we want them. So in your slide, you mentioned five nodes with one ubay and the other ten had like 26. So is the intention to have storage nodes separate from compute? Like some nodes were purely used for storage and other nodes were purely used for compute. Again, that is the option. We can do both. And the idea was that we could test multiple use cases within the scenario. Yes, depending on how we segment the environment, too. Any more questions? OK, well, thank you, everyone, for coming. Thank you.