 Well, hello and everyone welcome to this sort of second open shift on OpenStack SIG meeting the first one was part of a briefing and Judd Matlin was the one who spoke at that one He gave us a wonderful overview of the reference architecture from Dell of OpenStack And we thought because it was such an interesting that we would definitely try and and host sort of bi-monthly opens open shift on OpenStack SIGs and Today we have Judd with us and Jeremy Eater Who wrote a wonderful blog post about his work testing OpenShift on OpenStack on the CNCF lab Which is the cloud native computing foundation and the CNCF has a cluster which is a wonderful thing that was donated by Intel and is I think a supernap or switch one of those places and It's been a great test bed for us for doing some scaling testing So we thought that would be a good way to open up the conversation since this is a SIG and not a briefing of sorts and Jeremy is kindly offered to do this talk We do have a chat for this so raise your hand in the chat if you want to ask questions I'm gonna unmute everybody. You can self mute if you have dogs and cats and kids and heavy equipment in the back room But then just raise your hand and ask a question and then at the end of Jeremy's talk We'll open it up for a conversation about what this SIG means to you and where we want to go with it All right, so I'm gonna turn let Judd introduce himself and then we'll kick it off Thanks a lot Diane. My name is Judd Moulton I'm a principal systems engineer at Dell and My job these days is to write a reference architecture and deployment guide for deploying OpenShift 3.3 on the latest OpenStack platform from Red Hat and address all the issues and hopes and dreams of folks delivering OpenShift on top of any stack of OpenShift on top of OpenStack in whatever vendor configuration you like and act as a an emissary into OpenStack in order to promote features that are most helpful to OpenShift Kubernetes and that whole infrastructure We As the chairman of the SIG I'm really interested to hear from you all about what you might need from OpenStack In order to accomplish what You need to do on OpenShift And also what you might want to talk about successes failures war stories to share with the group and get more Knowledge about this popular stack Um, we have the gathering coming up right before Kubecon in Seattle in early November looking forward to seeing a whole bunch of folks there and learning and teaching tons about what we're doing the My work is available through unfortunately your Dell sales rep And I'm also very proud and surprised that Intel bought the thousand nodes of of Dell gear Dell R6 30s, which are featured in my reference architecture for OpenShift So we can actually go pretty deep into the configuration of these boxes So you really know what to order when you're looking at deploying this stuff That's it I guess without further ado Again, ping me if there's stuff you want to talk about and collaborate about I Don I'd like to thank Jeremy for for going over this and I'm excited to to dig into Into what they accomplished at the CNCF lab Jeremy. Yeah. Hi. Hi everybody. My name is Jeremy. I'm engineer in the performance and scale engineering team at Red Hat and These past couple years. I've been working on OpenShift container technology in general and In the month of June or May We became aware of the CNCF environment that it was going to be become available And we had a product need to get some additional scale testing done and here we are So we were able to use that lab and I'll go over some of the details about how we used it and why we did things in a certain way and And Some of the results, you know, which are Which are I guess the most interesting part and for me, I guess the journey was equally interesting Lot of work came out of it and there's more work to go, but and we're actually right now gearing up for An additional scale test, you know, we're gonna do these as often as we possibly can So we're gearing up to do newer revise this type of a test and Hopefully achieve even greater numbers. So Hopefully in front of you, you'll see a red slide Yep Okay What I I guess I figured maybe it would be interesting to take you guys through how we turn requirements into results from our perspective And that would start with The lofty kind of marching orders from the product managers saying, you know, here's what we need to do Or here's what we'd like you to try and get to and you tell us how realistic this is and go try it anyway How CNCF got involved I want to share with you the deployment of the open shift on open stack at scale that we did which we learned a lot from and Is feeding into our product documentation and our own reference architectures along these lines I also wanted to take you into some of the details about how how we actually do our testing Because the results and the numbers have less have more meaning that way if you understand what how we are actually exercising the environment Some results. These are the ones that are shared in the blog So if you've read the blog already, you'll have seen some of these results And then hopefully some Q&A and as I'm going through this, I didn't prepare a terrible amount, you know But so so please do feel free to ask questions. Diane. I cannot see the chat while I'm looking at the slides So just, you know interrupt me if there's questions and If I'm going off in the wrong direction just refocus me here. Okay, so What did we set out to do? We get a product requirements document from the product managers in a release planning phase in advance of code freeze for open shift 3 3, which I guess was A couple of months ago now and Our goals were pretty lofty Along the way, we've been getting more and more information and feedback from the field that Density and consolidation is a very important attribute for the system So to that end we decided to try and up the number of pods that can be supported per node Open shift provides an elastic search fluent D and Kavana based stack for consolidated logging We wanted to exercise that at as big a scale as we can It also provides a Hockular, Heapster and Cassandra based metrics Stack that runs again all actually logging in metrics all run as pods on top of open shift So we wanted to test that and then we also had a like a full density kind of it's more of a control plane exercise at full scale at as dense we as we could get so Let me just say one thing before I continue The support statements that come out of open shift and out of kubernetes Are generally around how many pods you can run And how many pods you can run at a certain number of nodes? So it's it's a little bit difficult to grok, but the the idea is that fewer pods you can have fewer pods on on Nodes the more nodes you have So at full scale thousand nodes, we're not be we're not doing 250 pods we're doing half of that And so if you needed to get to 250 pods on a node, you would only be able to have 500 nodes In other words 125,000 pods currently is the maximum for open shift 3 3 So the idea was okay given these type of marketing marching orders. What what how the heck are we going to do this? How are we where are we going to do it? How are we going to do it? And we already have the why so from there we we have previous relationship with chris danzig from cncf and And dan cone. So we discussed our use of their lab Extensively and what ended up happening was We were allocated 300 of those physical nodes So everyone was bursting through doors and windows like the kool-aid man saying okay. We got this kick-ass gear It's super high-end stuff. We really can't even get higher end stuff even if we had infinite budget Because you know every system has an nbme device for example, there's just 14 000 plus cores that were available here really really awesome gear del gear and Plenty of ram basically we wanted for nothing in terms of infrastructure here 10. Well, I will say this The 10 gig nicks we probably could have used a few more of those but anyway, so it That's the hardware stack del power. It's our 6 30s With uh with haswell chips and 256 gigs of ram so really great gear for being able to do the Try and hit the targets that we are trying to get to so in other words. We didn't have any hardware limitations We were able to focus ourselves on the software which was which is really untying our hands in a lot of cases When we try and do scale testing sometimes it's a mix of well We don't have the hardware to do that or well We don't have the software to do that in this scenario the hardware question was answered for us. Thankfully by cncf So here we are. That's the left hand side of the page Right bottom quadrant talks about the versions Before we get to the software I'm curious if you know what sort of northbound Networking gear is involved for our del r. A. We have a full cross linked ha switch infrastructure both for northbound and east westbound across racks our solution In our r. A. Goes up to three full racks. This is a lot more than three full racks But you could still get a lot of processing power. So what what do you know? Who is supporting supplying the networking gear? You know the lab was turned over to us as as machines. So I don't I don't I know we had uh Problems at the network switch level with dropped packets that we were working on for a while that that ended up getting resolved We also had driver bugs in the intel I-40e driver That made it look like there might be switching problems when there weren't so I don't actually know what the uplink switch models where I believe they were I believe they were Cisco, but I could be wrong so uh This on the software side Again, this is one of those things where you know the next time we go through the cncf work One of the main requests we had was can we have read-only access to the switches? For example And they were not set up in a multi-tenant fashion that would allow that in a way that the ops guys over there were very comfortable with so One of the ideas was to dump the switch configs for us every day and and you know Process them such that they we only see what we're what we're supposed to see Or or even do it more often than that so we could have at least a tech space output. We were looking for things like You know is the switch dropping packets. Do we have a link? We actually had a bunch of link errors, you know Just driver bugs on you know that were related to even establishing link with the switches So those are things that were difficult to debug without access to the switching Infrastructure and this is just the nature of the way this lab was given to us at the time so part of our feedback to them has been We would like read-only access or something similar Around the switching as well as you know, we would love to have like net flow type data from the switching to understand if we're hitting across connect link contention or You know bandwidth issues along those lines that we we never did stress the The networking plane to the point where we're pushing 10 gigs of traffic But from a functional functional standpoint, it might have speeded up our debugging That's very lab specific and really didn't have much to do with open shift and open stack, but You know since you asked Okay, so we did deploy with rel 7 2 We had to upgrade to rel 7 3 development kernel in order to pick up the driver fixes that I had mentioned earlier The the bugs that were fixed upstream that are that weren't fixed at the code level we had at the time They're they're fixed at this point And then open stack Red Hat open stack platform 8, which is based on liberty if you're familiar with their naming scheme A bunch of patches that are again publicly released Arata and we used open shift 3 3 which was an alpha alpha stage at the time these tests were run. So That's the mix of hardware and software resources that we used now For private cloud, this is probably a fairly standard config Maybe it's a little bit larger. Maybe the nodes are a little bit nicer than than many people can afford, but Um Galing those down, you know, I would think this is a pretty happy path for for customers who are doing private cloud at this point There were a lot of there were a lot of things that were more nice to have than need to have in this environment The nvme devices go in that category But they were certainly we certainly put them to good use once we figured out, you know, how to wire those into open stack Okay, so any questions about the hardware software configs? Um, nope looks good to me so The logical diagram that we built is on the right And I hopefully you can see that there were the the colors are blue is a physical node and green is a virtual node And what we did was we created two different Host aggregates which are essentially like regions or zones And in each one of those regions we deployed a separate open shift installation And the reason we did that was was only because we had more gear than We needed to get to a thousand nodes And we wanted to run Different tests that would have otherwise interfered with each other open shift tests So we were able to parallelize a bunch of stuff by doing that and that's why you've got the curly braces one comma two there There were actually two mirror images of these sharing the same Open stack control plane And so you've got those for the infra Uh, I have a host aggregate zones which are open shift has the concept of masters or your api servers And scd nodes which are key value stores that that store all the persistent state cluster state and the masters talk with the scds directly And then there's also actually not on this diagram But there's a load balancer in front of those masters which is a single endpoint that all the nodes talk with So that's just another vm And we did end up with only three this diagram says it goes up to you know master And there was actually three masters and three scds And then there was a support of the availability zone or or host aggregate And in there we stuck the routers registries metrics and logging pods that were in use so The third part was a catch all which was everything else In our case it ended up being just nodes And it was a fourth actually where we stuck a bunch of servers We had in reserve where we ran our workload generators from and those are just bare metal rel nodes That and our workload generator currently uses jmeter. So it generates web traffic Some of the flavors we used are at the bottom of the diagram here All of our nodes were four vcpus 16 gigs of ram and 32 gigs of disk And those were Probably larger than we needed but it's kind of the thing where we had a bunch of hardware. We figured we would use all of it And then the the xcd servers and infra nodes those definitely need to be beefier There's a depending on the type of test or workload that's going on You generally need more cpus to handle the metrics and logging and the router and registry depending on if you're doing web traffic You know the routers would get busy on cpus The registries do a lot of cpu activity, but mostly disk activity and so the nodes tend to be Infra and fcd nodes tend to be need a little bit more cpu resources And then finally the most the most cpu intensive ones are the masters and those are the ones that are constantly busy Doing command and control traffic with the nodes Part of our effort in this time frame was to improve those command and control Channels and we've done that in a couple of ways. First of all, we've put in some some caches And we've done we've added some xcd washes If you're not familiar with those a watch is a persistent connection that a node will establish on the master That allows it to or sorry. Yeah via the master that allows it to be constantly updated with Status changes from the cluster So a node will immediately know whether or not a new node became available and it knows how to route packets to it because It has it has received that update So those watches are a current point of command and control that that has gone under optimization and and the main thing we've done is move from JSON based Packet payloads that encapsulate all of the data About state over to protocol buffer, which is binary encoded and thus much more cpu efficient to encode and decode than json is I basically we found that golang's json library was Or could use some optimization and and so the google team has invented protocol buffers Which though they're in use in more than kubernetes, but we applying them to kubernetes allowed us to reduce the cpu usage I believe from 22 cores Down to 15 cores or so so maybe a 30 or 40 percent Improvement in the number of cores it takes to manage a thousand nodes a tremendous improvement between the caches and the Protocol buffers so phenomenal scalability improvement there So for a while to get to these numbers we were just throwing more hardware at it making that's why the masters are 40 bcp Us at a point we were up to 22 You know and we really felt like we didn't want to be bottlenecked on cpu So we threw as much hardware as we could at it and at this point what we're doing is whittling down those numbers to Um to allow the masters to be just be the overall efficiency in the control plane is is what we're trying to address And in order to do all of that not only did we Involve all of the special interest groups in in kubernetes land likes particularly sig scale but also Open shift engineering quality engineering inside red hat the open stack engineering group our group the performance team and then Analogous to the teams that that Judd is likely on we have reference architecture teams, which are writing documents that Represent these best practices that we learned and finally there was a funny bit around ansible Which i'll get into in a minute that we identified some some issues with a Some scalability issues in in their code base and if you haven't seen or used open shift yet The installer for open shift is written in ansible Before you go on um i was curious about um the load balancers not pictured here um What was your load balancing solution and did you have two with a vip? Did you opt for open stacks built-in load balancer? How'd you go about it? Yeah, the so load balancer is a service and open stack is is not yet actually supported by red hat And so what's used in this installation is the native Load balancer, which is h8 proxy base which comes with open shift whenever you install Open ship with more than One master you will automatically get and you have to configure, you know a load balancer node You will get um a single load balancer with Uh with you know an ip address so there is not an uh fault tolerant config there um In our environment here So i'd have to double check i'm i'm fairly certain that that that is a tunable option to have multiple software load balancers And it's not so much for balancing a tremendous amount of load as it is for Providing a single endpoint for all of the nodes to access rather than doing maybe round robin dns or some other thing to To spread that load out But the idea is that a single master can't handle the amount of traffic that a thousand nodes generates So you have to spread it out somehow and the proxy is the load balancer does fans that out for us And then just an equal weight scenario Thank you So we have the environment let's say we have that environment built out What are some of the things that we wanted to test? our team is uh again, we're on the performance team and we collaborate with our quality assurance guys to Develop tests for open shift and um So our repository is linked here. It's called svt and I didn't name it So I don't know why it's called that but other than the fact that it stands for system verification and test Um, so those are open source. You can grab them right now and look at what we do There's four main things that are in there at least from my point of view There's actually a fair amount more but the main and interesting ones for this For uh, the subject matter we're going over are a cluster loader network tests workload generator And then we've got some reliability tests, which we're we look for memory leaks and connections that don't close properly. Um There's a fair amount of uh tcp connections HTTP connections in the system. We want to make sure that we are not leaving sockets open and You know that would potentially build off to the point where there's you've got that type of exhaustion and memory leaks, etc So we've got some reliability tests that run for a generally run for anywhere between 10 and 14 days I didn't mention it here because it's Uh in that list just because it wouldn't fit on the page But we also a key piece of what we do involves this image provisioner And what this is is it takes a rel cloud image, which is a very minimal q-cow image or an amazon am i Or any other public cloud and it turns it into an open shift node that's got no configuration on it In terms of open shift, so it'll set up the partitions make sure that the docker storage is set up correctly Bakes in you know a lot of I mean this tech this type of image gold image techniques been around forever This is just automating it through ansible and wiring in the open shift specific bits Which are really just pre pulling a lot of the images that are required to run open shift So if you've seen it before there's like ose pod ose router ose Itself and you know, there's maybe a dozen of them that we need to pre pull onto each node The reason we do that is because if you're doing thousands of nodes or a thousand nodes going over the network Becomes a point of contention to pull all of that stuff and so Baking it into the images became a necessity actually in order to get anything done In the time frames that we had so what we did is we pushed this up Into github you can use it yourself right now and It's you know as long as you configure an inventory file for your environment It will spit out a q-cow 2 or it will also spit out an amazon am i that has all of this stuff Baked in so A means of self-defense really to get work done. Uh, you don't necessarily need to do this, but it will take you a lot longer If you if you didn't do something like this this takes the world off of your Your rpm repos and your docker registry correct Those are the primary two things that you get yeah And we did some stuff like we put our monitoring solution in there We have a metric system and monitoring solution called p bench. We wanted to bake that in What that does is gather system metrics and provide graphs for us at the system level so that we can understand You know when we run these tests, what is the impact and efficiency? I mentioned you know dropping from 22 to 15 cores That was uh, we measure that with the p bench utility So yeah, there's the main two wins from a shared infrastructure standpoint or the registry and the yum repos or the yeah Okay, so now we've got open stack open shift on top and we've got our uh And we've got our images deployed the cluster is built at this. Let's assume the cluster is built at this point Again, what do we do with it? We designed some tests that are centered around a shared web hosting scenario Which is kind of the bread and butter for kubernetes and open shift at this phase in its life We expect that to change drastically over the next year or two, but right now Uh, it's centered around 12 factor application development And web based, you know, let web languages like mongo or postgres and Node and php or whatever else your favorite thing is Open shift itself ships examples, which are templates and we've used those and adjusted them to our needs So what we've developed also developed a utility called the cluster loader And it shows up in the blog what it is is a way to express an environment that consists of all the complexities of uh of kubernetes like Represent really what a customer is trying to do or a user is going to do it'll include replication controllers secrets Uh templates pods services routes all that stuff and Allow you to express it in a very human friendly way and have this script go off and Populate the environment according to your according to your wishes So the link is there and um if you were to explore it, you'd find directories like we've got uh the basic pod Manifest we've got the basic replication controller manifest and all this stuff is templated to allow us to build as if we had a box You know as if we had a a tub of legos and we could build whatever we wanted to with these with these building blocks The Config directory contains and I'll show it to you in a slide just the guts of of what the cluster loader will go off and do And it's a python script right now and we have some plans on how to improve it As you could imagine anyway, so that's the config and content directories including the templates So you have cluster loader the python script itself, which takes only two arguments you know, um what actually one argument what file am I loading it from And optionally second arguments around using the workload generator portion of cluster loader And then of course a utilities that's got out just a bunch of functions in it. So That's the framework that we use to test Open shift The architecture is pretty simple. Um, probably the most simple flow chart you can have Start it off filter the arguments. Um create a namespace or a project in open shift parlance And within that namespace create however many things you want to represent what your applications might be doing quotas users pods etc And then at the end iterate until you've reached the total number of pods and then exit and at that point We'd have an environment that's ready for the second phase, which is Actually putting work on it. So this is the first phase and oftentimes Uh, this is the only phase that matters because potentially we could be using this as not only the loading utility But also benchmarking how quickly do we load applications when the cluster is under load? What is the api response those server response times etc under load and Additionally, how many cpu's does it take us to get this a certain fixed amount of work done? The test we ran inside cncf Is the guts of it are pretty lengthy and the actual test is on the blog, but here's an example that we Here's an example that's in our repository and the the number of projects is really all that changed here This is on the left top. You'll see in numbers only five. So it would deploy five projects each with one user With three templates inside of a build configs six build templates one image stream Two replication controller templates with each of one of those with 256 byte secrets Secrets and routes and the idea here is that if you have all of this within one Namespace or one project you have pretty much what a user or what a developer might try and deploy on open shift And you know we can bike shed about The numbers and the idea here is to provide a framework and allow you to have tools to experiment on your own These configs are what we've arrived at internally for what we expect people to do, but could be right or wrong Depending on, you know, whatever you're trying to load onto the environment. So for example Java apps Malik their memory up front. They are not They they you'll you'll get less density because there's no ability to over commit with Java. So for example, your project's number if you're using Jvm's would be lower than on the same gear, but you're writing in some interpreted language like like node or Or php where you can over commit memory So instead of five we tried to go as high as we could until failure and this is roughly what we this is Well, this is exactly what we got to 1000 nodes 13,000 projects our target was 20,000. We didn't get there in the time allotted 52,000 pods, you know, I'm not going to read the rest of it, but the numbers are pretty pretty impressive. I guess and I think We learned a lot, you know, we learned a lot about where to adjust to config what to fix and the deliverables were There were bug fixes a lot of them and a lot of it actually ended up being documentation as well For example, if you're doing logging scalability Or if you're doing a big environment and you want the you know, the the elastic search to scale Here's the three or four different, you know things you need to do for example System D has a thousand messages a second rate limit by default if you have a large environment A thousand messages a second might be nothing and so you would want to adjust those types of things and That's the type of stuff we hit at scale that a smaller environment would never encounter Or we're likely, you know likely not encounter okay, so One of the stuff we did encounter on the way to those numbers where We noticed that at cd took a more just space than we were initially recommending So we've since gone from a 20 gig recommendation up to a 40 gig recommendation Even though only 13 gig is ever consumed in our test. This is the left graph Uh around I guess midnight or so The blue line where you know, we're somewhere around 13 gigs of data and so we're recommending 40 at this point and that's probably At least twice what you would actually need for a massive cluster So and actually the scd space tends to vary based on how many images you have And how many pods you have sorry no It ends it ends up varying quite a bit around the number of images that are in your registry because we do store some metadata About the image in at cd And so a lot of the disk space is not necessarily about the number of nodes routes projects, etc As much as it is around the number of images So for the next generation test, um, you can imagine we're going to Pile on the number of images and come up with the exact capacity planning formulas around Around what you can expect there And then on the right hand side towards the right hand side of the graph you can see um a fairly busy cpus that suddenly dropped off and uh, yeah, that was a bug that we encountered and um, that's actually where we ended our test pretty much because We had just run out of time So we got that's how we got to 13 000 projects instead of 20 000. That's basically the point where things fell over And Yeah, we had to bail at that point because the gear was being taken away. So that's where we ended up with uh And that bug is fixed in kubernetes as well as openshift 3 3 at this point So if we had the gear again, we would get well beyond 13 000 projects No, and that's what we're hoping for is to get the gear again and do another run through so Yep, uh a question about eight o'clock Uh in the morning follow on the left slide the left graph Um, does the seep does the disc utilization go back down because you withdrew the number of images? Yeah, so what well not images but total content. So one thing i've Forgot to mention I guess is that the cluster loader also has a tear down phase And depending on how we've configured it. Um, the growth is attempt the growth Occurs as quickly as possible and the tear down occurs as quickly as possible But there's a steady state that it tries to uh maintain Between set up and tear down so that we can kind of examine what's going on And so the fall off is because we're trying to tear it because actually Um, another data point that was asked for was how quickly can we delete stuff or not really how quickly but Are there any issues with with project deletion speed, etc. And and to be quite honest, um We did file bugs based on that and they're um being worked. So I think that was pretty useful And that that was really the whole purpose of getting this cncf cluster access to was It was a huge opportunity for us to to do this kind of testing and It's pretty pretty awesome offer on their part and on tell us part Yeah, and sending, you know requests for this type of usage is nothing more than a pr to the cncf github repo Um, we have some plans as their environment Expands to start using it again for newer versions of our product At you know, so this gave us a lot of homework to do these tests and these results and it those are not fixed overnight. They involve coordination with upstream developers and productization on our part So it's not the type of test we could immediately repeat anyway. So it kind of gives us some think time In the interim while we're scheduling ourselves for getting back on the gear where you know, this was a lot of stuff that we learned I might have a I'll just skip ahead. We we filed Somewhere around 30 plus bugs around this This effort and a lot of them were in core kubernetes. We found one kernel bug And you know, but for the most part These are solved at this point. I think there's there's some outstanding But none that are related to scale and none that we consider blockers There's there's actually there's zero blockers that we're aware of on openshift 3 3 at this point Um, given the scale numbers that we're quoting for support. So going beyond that, you know There's dragons there most likely and uh, that's actually the the scope of our next set of scale tests Which we're working on right now That's to be the name of the next blog post there be dragons there Yeah Yeah, so The idea is to push kubernetes to its to its just you know, I mean our job is pretty cool Where we get to break stuff and and try and put the shattered pieces back together as quickly as possible I think that's kind of the Or crawling over broken glass as a service So I should say this before I continue is that you know, I'm here talking with you But there was at least six people involved in this Kim Sinclair and uh, Mike Fiedler being the primaries A couple of ansible folks that helped us along the way really a team effort overall not just definitely not just me More of the coordination point of everything so The ansible issues that we encountered where between openshift 3 2 and 3 3 We pulled in a new version of ansible because we wanted some of the features And we got those features, but we also got a performance regression and what you're looking at is actually the the uh The after the before the before was so bad that you could really not even capture Stats because it never really even completed the issue is that the is linked off there At the top of this slide and you know, it was essentially a recursion of ansible tasks that was Spiraling out of control and just ending it or just being in these infinite loops where every cpu is just pegged and the system would This system would crash essentially so That bugs resolved in ansible 2 2 which is the version that will be included with openshift 3 3 So we'll we'll be on this and you can see now actually the y-axis of this graph is not is Percent of cores so 15 k is actually 1500 Uh percent which means 15 cores This is a 16 core system for a while there and what it's doing at the spike is Actually, all it's doing is reaching out and collecting all of these Facts about a thousand nodes and so that that takes a lot of cpu to gather all of that stuff together And uh, it actually takes a lot of memory as well so This we couldn't even get to this point in the installation um with some of the other versions of ansible So those haven't seen are not supported on openshift. They they work just fine in smaller scales But when you start scaling out to the thousands, do you start seeing these issues and uh So what we've done is we've got a version of 2 2 that we're packaging with uh with openshift itself Okay, those are the bugs and I think that might be it all right Let's see a bunch of people on And i'm wondering if anybody has any questions here. Nobody asked any in Dr. Dr. basically just a knowledge and creating easy to use packaging A bunch of different layers and binaries into a format That's easy to distribute. So they did that there was a first thing they did They created a runtime and made it easier to run those containers That's And then Hey, i'm having a tough time here. I didn't see anything. So essentially you had Fendi About I'm not able to make it out. I'm sorry. You're not send deep. Um, I know Can you pipe your question in? The layers For example, let's say you want to Yeah, the microphone's not working very well. No, no system copy. Well said Yeah Send deep um your your microphone is not working correctly. Can you type your question into the chat? Or your comment? Yeah, he was breaking up pretty good there In the meantime, I'd like to learn a little bit more about your open stack deployment. Uh, Was this triple o and director running typical osp stuff? Um, and did you run into anything interesting? Yes, it was typical osp stuff things we ran into were Because we were using osp eight Everything we hit had been already been fixed and uh, so we pulled some packages back We had a a small issue with heat osp d needed A newer version slightly newer version of rabid mq not osp d but the cluster needed a slightly different version of rabid mq Because there were some skill ability issues in the versions that were That we were running so the idea was that if we were Running later versions of code we would not have hit these issues and The idea was at the time we run this eight was the latest ga Um version of open stack from red hat and so we did the best we could with the latest ga You know, we weren't necessarily about Um open stack as much on this test We we were kind of just using it as a as a cloud because we wanted to get to a thousand and we only had 300 nodes Um in future versions of open of this type of testing We're going to do a lot more stack integration. For example We've got these heat templates. We'll use those, you know And we'll use the latest version of open stack newton, which is version 10 in red hat land And the latest version of open shift. We'll use those hemp heat templates to do the deploy We'll be integrating with cinder and gluster and uh, we'll likely get rid of the double vx land encapsulation so trying to do a lot more of Benefiting from running on an open stack cloud than we did in this test in this case It was mostly about providing vms for open shift to to run on um, so yeah, so the heat issues rabid mq was a little bit of an issue for a while and then To be quite honest, we were blaming software for a lot or blaming open stack and open shift for problems that were Really driver bugs in the intel driver and that would and we actually had to flash firmware on a bunch of the intel mix as well So the reason was because we were one of the first folks to get on the cncf lab And they were extremely helpful But it didn't save us any time having to you know, having to ferret those issues out because you can imagine Without access to the switches. This was a bit of a mystery for us. So uh, took a couple of extra weeks to to to um To nail all of that down and finally got to the point where everything was super stable. Um Same for those issues. Yeah, I've really just eaten rabbit were the two issues that we uh We encountered There's one more question that popped up to um, a rash Um is asking does the whole thing work on on the rdo oo oo to try it out for a poc so we've done it on eight and um I cannot imagine there's anything in In osp that You know would preclude you to be running rdo. I really don't think so You know, I have to be honest with you. We're we're using the products here for a reason. Um The the our open shift origin on rdo is not something we're we're necessarily focusing on we're more focusing on kubernetes upstream rather than you know origin scalability because we like to fix the issues in q as soon as we possibly can And from the open stack standpoint, we did the same thing where we work everything upstream and rdo becomes an integration point So I don't I'm not aware of any restrictions around that though. So it you know mileage may vary but Hopefully you're lucky And if you do it then we'll definitely get you to talk about it Let us know what happens There anybody else out there? I know sam jeep your microphone wasn't quite working So um, it looked like he bopped off. I thought he might bop back in So I would also um while we're pausing here and people are thinking if they have any questions We're going to have our um first face-to-face Meeting of the open shift and open stack sig at the open shift gathering november 7th at kubcon in seattle um, and I typed into the chat room the link to get to Registering for the gathering so we'd love to have you come Jed will be there. I think jeremy's going to be there for part of it. Um, and we're going to Have a at lunch. We have an hour and a half set aside for all the sig meetings So if you if you're going to kubcon, please come the day before and join us The url to get there is commons dot open shift dot org slash gathering and Well, we'd love to have you there And it looks like that was all the questions we have the question I now have for you, and I think jud has for you too is um, what would people like to hear for? The next Topic for then maybe we're going to do these by monthly seven two months to do another talk On open shift on open stack or something related to open staff that we should be looking at You know talking more about the load balancers or I you know if there's a topic that's near and dear your heart sign up for the mailing list Which is on also on the commons dot open shift org site and Let us know I haven't created the mailing list yet. So, um, that is one of my tasks to do today So I will add you all to it. Um, if you sign up Um, but if there's anything else jud, is there anything near and dear to your heart that we you think we ought to cover in the next one Yeah, a couple things actually, um the productization of the open shift on open stack project Um, they're their codes really good. Um I was wondering if you guys had used That project in your work to create the heat templates necessary and drive them That was really my question. How much of the under the over cloud heat templates did you have to hack up from what? um osp is delivering you out of the box So for this test, we didn't use the heat templates for the open shift on it We didn't use the open shift on open stack heat templates that I believe he was used to deploy some of the nodes that we had so Yeah, but I would say this jud is that We have prototyped open shift 3 4 On open stack 10 with those templates We have at least five patches that we're carrying now to make that happen So it's not not bad at all and our next version of this stuff will definitely be using those templates cool because i'm especially interested for my customers in upgrade paths and having well-defined templates makes upgrade somewhat tractable and so you don't want to have to redeploy your entire Your entire network in order to upgrade through osp's and through open shifts and even through say del bios updates or driver updates Thank you so much. Yeah, I'm looking forward to uh, to meeting you again. Actually we saw each other at red hat summit. Um, and uh Um, yeah folks hit me with I with ideas for future calls We've got a lot of knowledge here in within del about Managing and scaling the open stack and open shift networks. Um, the storage is is uh, Is really strong at del considering we also merged with so we have a a wide variety of of enterprise and smb mid market Storage products that we've been testing with open stack and open shift To let you know how they're working. What kind of integration is going on see whenever I deploy I'm deploying on to Equal logic boxes for all my storage. I've got a a slightly different layout and Would like to See where folks are hitting problems or or success stories around cinder integration with different storage back ends There's a lot of interesting stuff going on in this relationship and as as red hat looks more at integrating Uh, the open stack services with open shift services. There's going to be a lot more to talk about Um, so thanks so much and I really hope to see you folks in Seattle Yeah, that would be great and thank you everybody for for joining us today Um, and we will I'll reach out to Sandeep and see if I can figure out what his question was And we will do another one of these in two months. So sometime in december We'll take a date and we'll post it on the calendar. So you can always find the upcoming events on commons.openshift.org slash briefings Um, that's where my calendar hides and we add all the sigs and all the briefings and upcoming events there so Looking forward to meeting you all in person at the gathering in seattle and also going to coup con We'll take care and we'll talk to you all soon Thanks, jeremy. Thank you