 Hello, everyone. Thanks for joining us today and welcome to Open InfraLive. Open InfraLive is an interactive show sharing production case studies, open source demos, industry conversations, and the latest updates from the global open infrastructure community. This show is made possible through the support of our valued members, so thank you to them. My name is Kristen Barrientos and I'll be your host for today's show. We're streaming live on YouTube and LinkedIn and we'll be answering questions throughout the show, so feel free to drop them into the comment section and we'll try to answer as many as we can. Some of the most popular episodes on Open InfraLive is a large-scale open-stack show where operators of large-scale open-stack deployments come and discuss operational challenges, solutions, and solutions. Today the large-scale open-stack show is back and ready for an off-steep dive with SAMHSA and SDS. Joining today's discussion, we have Felix Hutner, Thierry Carrez, Arno Marin, and our SAMHSA and SDS guest, Dan Pike. Welcome, everyone, and I'll pass it to you, Thierry. Thanks, Christine, and thank you all for joining us today. I wanted to start this episode with a quick introduction on the large-scale SIG and what we are doing there. The aim of the large-scale SIG, the large-scale special interest group, is to facilitate running open-stack at large scale and answer all those questions that open-stack operators have as they need to scale up and scale out their deployment. And as part of the activities of the large-scale SIG, we are doing these shows on Open InfraLive to bring together operators of large-scale deployments, but also people that are more at the start of their large-scale SIG journey. We are producing documentation at the large-scale SIG that helps people go from the value stages in the large-scale journey, starting with making the right configuration choices, making how do you monitor your systems, how do you scale up, what are the options for scaling out, and then how do you upgrade and maintain those environments. Today, we have the chance to have Dan Pike at the Samsung SDS who is at the start of this large-scale journey with big goals. And I wanted to start with you, Dan. Can you tell us a bit about yourself, about Samsung SDS and your objectives with this project? Yeah, sure. So my name is Dan Pike. I'm an executive vice president here at Samsung SDS. I run our cloud engineering development team. And this is really the team that builds our cloud platform. Samsung SDS as a company was originally started really to provide infrastructure in a standardized way and very efficiently, primarily for the Samsung group. So we operate over 20 data centers globally. We operate the data centers themselves. We manage them. We manage power, manage backup, manage uptime. But we also will have managed services as well. So we do things such as cloud migrations, cloud modernization. If a customer needs to update their applications using a microservice architecture, we can do that. We can bring them to be much more cognitive, digital transformation. Those are the types of things that we offer as services in addition to the infrastructure that we do. Primarily, our customer base has largely been the Samsung group, but we also do a lot of projects and we've been expanding a lot outside the Samsung group as well. So both domestically here in Korea, as well as globally in and outside of the Samsung group, we've been expanding these businesses quite a bit. So our cloud platform is really critical to the future success of our company in enabling us to manage hardware much more efficiently. Okay. And where are you using before and why did you choose OpenStack for this project? Yeah, that's a great question. So we really started with OpenStack about 10 years ago when we first started building out our cloud, right? And so we hired a lot of OpenStack engineers. We really looked at this and we really wanted to build that cloud platform based on OpenStack. But due to time constraints and various other issues that came about at the time, we essentially abandoned that plan and went with building a cloud that was based on more commercial software. So our current cloud is based on the commercials of private as much more private cloud based commercial software right now. So what we'd like to do going forward though, longer term is, you know, we've said, hey, we've, you know, perhaps you recognize, you know, some of the pitfalls and some of the things that were less ideal about going with that type of solution. We decided to rearchitect and build a new solution based on OpenStack. So we've been working on the solution really for the last two years. This year, we actually had a little bit of a derailment in the sense that like our resources that the resource that I had assigned for OpenStack got diverted away to do some more things with generated AI unfortunately. So, but next year we plan to, we do have plans right now for 2024 to restart this project and really get this enrolling again. So we actually have much bigger plans again for 2024 to get this back on the road. And we plan to launch the initial beta of our platform based on OpenStack in July of 2024 and really have a fully functioning version that's in general availability, hopefully by the end of next year. So that's been our goal. It's a very aggressive goal, but those are things that we need. And the reason why we're here talking to you and the reason why we've been having this conversation is we operate at a very large scale. So we need to ensure that that not only will this platform scale properly, but also we also need things such as high availability, you know, multiple availability zones fail over disaster recovery. Those are the types of things that our enterprise customers really need in a cloud platform. And I think since you are now in a quite unique position that you know your existing environment and what you need to build, can you share a little bit about what this platform will be used for and what things you expect from your customers? Yeah, so, you know, we're a little bit unique in that, you know, we do get a lot of new business every year, but we also have a very large footprint of existing business that will likely stay for the time being, right? So for the foreseeable future. So in terms of scale, it's a wide variety of applications. A lot of them are, you know, they range from internal applications that are used here within the Samsung Group to external apps that are non-Samsung companies use that are used within their enterprises as well. I guess the one thing that is sort of common between most of our applications is they tend to be enterprise based applications. So, you know, think about things such as ERP type systems, you know, internal, you know, build processing type systems. It's really those types of systems as opposed to perhaps, you know, another vertical such as gaming or something like that, right? We also have a big financial sector as well as, you know, expansion into public sectors and government and things like that. So these are different cloud pools and setups that we've set up. And we plan to migrate these from our existing stack onto the open stack platform once that's built over time. So, you know, total, you know, our footprint right now is, you know, we're looking at over 50,000 VMs, 10,000 plus servers right now. But the goal eventually is to migrate as much of this as possible. But, you know, if there's certain things that are more difficult to migrate for various reasons, those will probably come a little bit later and we'll probably do the easier ones first. It's really a pleasure to see a new company joining OpenStack community like this and start building a new infrastructure from scratch on OpenStack. Is it something that your customer were also asking for or is it something you decided to go? What are the key decisions on building OpenStack in your company already? Yeah, that's a great question, but it's a little bit more of the latter part. So it's not really something that our customers have been asking for. Our customers tend to be a little bit more abstract from the actual implementation of the foundation there. So, you know, our customers, you know, they may need a VM, they might need some managed services for like a database, you know, they might need some data processing services. They might want to run, you know, do things like that, right? So we offer those types of services for our user, but whether it's running on top of commercial software or on an OpenStack, that is less of a concern typically for our users. So the main sort of impetus and drive around this, it's really two things. One, you know, working with commercial software has its pros and cons. There definitely are some pros in terms of like being able to come to market, go to market a little bit faster than maybe had we rolled things on our own. Maybe some of the support in those types of areas, but you know, our hands are tied a little bit in terms of like new features that we need. You know, we need to convince their product team to build these new features. And as much as like they're cooperative and, you know, we're a valued customer, you know, it's just a different thing from one. If we can look at code ourselves and if we can divert, if we could, you know, actually invest in engineering to build these features, that's kind of a different ballgame, right? And so, and a lot of the commercial tools out there, they're really about private cloud, right? And the things that I'm talking about with you, we're a little bit more public cloud-esque, even though we're not, you know, we're an enterprise cloud and we're not the type of cloud where, you know, a user can just come in and punch in the credit card and get some VMs and start using our service. Like, that's not how Samsung Cloud Platform works. You know, it's an enterprise cloud, you know, you work with our business teams. If it's a good fit, you know, then, you know, we'll set you up with an account and it's, you know, post building and things like that. But the way that our cloud is set up in that sense, like it's really set up for this type of enterprise cloud. We've done the work to say, long term, we need to own our own cloud. We need to be able to build the right features that we need. And rather than trying to be dependent upon this other sort of third party and another people do this, let's own, let's have ownership into the future of our cloud. So that's what, like the second reason is really the main reason why I've been pushing for building this platform so heavily. And I think one of the main things not only about owning it and building features is also about if there's an issue, you can put as much power or as much energy in it as you need. Yeah, and that's a great point. Like, if there is an issue or if there's an area where, you know, perhaps, you know, we need to scale in a certain way that, you know, maybe other customers are not. And so we need to focus there. Like, you know, we can focus on engineers to do that here, right, which is great. One of the fears that we have, though, as part of the session is sometimes our engineers or operations teams can be a little bit more apprehensive because they don't have this sort of commercial company with a support contract that they can rely upon if there happens to be an outage or an issue, right. So at this point, they're like, well, we need to take care of this ourselves. What if we can't figure it out? Or what if we have problems, you know, actually fixing this thing? Like, then what do we do? And so that's the kind of expertise that I think we need to build in-house, but also we need to rely heavily on the community to try to get some of that assistance as well. So there's some pros and cons, definitely, that come to that. And there's some growing up and maturization that we need to do as a company as well. So you said you want to build this cloud internally. I imagine it's a lot of duty on the team itself. You need to hire and to build that team from scratch. How is it dealing with the current team working on your commercial infrastructure? Yeah, so what actually makes it difficult is two things. One is, yes, you're exactly right in terms of like staffing the team and the hiring. That makes it difficult. We have an existing team though that needs to continue to manage and build features and upgrade the existing cloud that we have today, right. So it's not like I can pull everybody out and say, hey, our current cloud is frozen, no new features while we spend a year or two building, building, building. So since both need to go in parallel, that's actually the hard part that we've been juggling. So the approach we've taken here is like one of the things I mentioned earlier was that we had started with OpenStack 10 years ago. And, you know, a lot of the resources that we had hired 10 years ago because we had hired over 100 OpenStack engineers at that time. Some of them, believe it or not, are still here. So, and they've been the ones that have been leading the current team here because they end, you know, and you can kind of guess the hunger that they have because they've been waiting for so long. I'm trying to build this platform. Now, we definitely do have open headcount. We've been hiring as well. But a lot of our work has been sort of cross training, having these folks as senior architects and also, you know, relying upon other third party consulting companies as well. So we've been working with other companies as well to also assist us in building this OpenStack platform as well. And one thing that I can definitely share as an experience from us is you will have these components that you know that you use heavily and you might have knowledge into them. But there's always these components that nobody ever thought about that they might ever break and then they break and you need to invest into things that you thought would work well. Yeah, and that's actually what does scare us, right? Because, you know, if that were to happen, how do we fix that if we, you know, we'll go through the source code, but, you know, we may need some assistance. So we do have some partner companies that we've been talking to as well that can also help assist in some of the support type of issues that we have. So we've been speaking to some companies like that as well that have offered that type of support to say, hey, you know, look, we have a lot of experience where we're top contributors in OpenStack. And, you know, we're happy to work with you in terms of helping troubles you if the need were to come. That's kind of worst case though. It's worst case we have an outage and we need to figure things out. Typically, it's more like, hey, there's an issue, we see some degradation here, we're adding hard work to it right now, but let's try to rearchitect, you know, so that we can fix this issue. So I think that's more typical rather than kind of a full on outage and we just, you know, we're caught without our pants on trying to figure out what's going on. I think it's less that. In terms of deployment, do you plan to, so you said you would deploy your OpenStack for next year. Is there any component, key component you want to deploy and to be available or which component do you plan to have at the beginning on OpenStack? Yeah, so step one really is building all of the primary IS features. So the key components of OpenStack that we've been looking at. So we built our current environment that we've done that we've been testing and playing with right now, primarily based on Antelope and a little bit of Bobcat. And there's a few that are under master right now. So, you know, there's a whole list of components that we've been deploying and that we see as kind of our beta, right? You know, it includes identity, so of course we're using Keystone, you know, image, you know, image processing. So things like glance, key management, barbiton, you know, cinder, designate, you know, horizon, manila, nova placement, masakari, like all of those is what we kind of see as our foundational layer there. We're also using Neutron and then for testing, you know, we have Tempest and Rowley, right? So that's sort of our core environment that we're building now. We have a managed service and a platform as a service and a managed Kubernetes service that we've built currently for our current cloud. And we're trying to reuse as much of that as possible for the OpenStack platform. And that's probably what we'll go into, not the beta, but the next sort of like full release is what we plan to do for that. So it's not finalized yet and, you know, we'll probably move things in and out on a mandated basis. But really like we're trying to get something that's like fairly stable run. We actually have something fairly stable right now, but, you know, we don't think it's quite ready for to put an actual test application on it quite yet. So your lab is ready, but it's not yet available to your customer, right? That's correct. It's in a lab right now. And it has those types of environments. Like we have three environments right now. We have our LCM lab. So that's where we test all of our lifecycle management tools and, you know, the build processing and how to patch and how to upgrade it and using that process. And then we have really a Dev and a QA. That's really all we have right now. So staging a production, that's what's really getting prepared right now. And that's what we'll be launching it. I guess that's where then the skating part actually becomes interesting. I actually had a question around the target deployment footprint. You mentioned 20 data centers available to Samsung SDS. What is the footprint in terms of, you know, number of CPU cores or a number of servers or number of VMs that you're targeting with at least like the phase one deployment? And how much do you expect to grow beyond that after? Yeah. So like one thing I mentioned a little bit earlier was like the end sort of total goal that we're looking at at least right now. And it grows over time because there'll be new workloads being added. But right now, I mean, we're looking at kind of a footprint, roughly the size of about 50,000 VMs is what we're looking right now. And it roughly translates into our current sort of standard VM size and rack of like 10,000 servers. That doesn't mean that's what we're going to start with. We'll start with a few racks is really where it's going to start with. And we'll start with all the global centers globally and we'll start domestically here in Korea. But we will start with multiple data centers because one of the early requirements is to have that the multiple zones and the multiple regions, right? So our career in Korea, we have two different regions with three zones. So we'll expand to those. But, you know, we're looking at, you know, we haven't purchased the hardware yet, but we're looking at a few racks in each one, primarily to test out just to make sure things do fail over. And it's enough that we can test and then build upon. And maybe as a general question, how you want to set up these regions, do you want to set them up in a way that they are completely separated from each other? So it's basically separate open sector problems. Or should it be one lounge? Yeah. You know, that's been the question that we've been struggling with for really the last year, year and a half, because the easy answer would be, let's just treat them as separate open stacks and then sort of figure out how to tie things together like in another sort of upper abstraction layer, right? The direction that the team has gone and architecture team has gone is they prefer not to do that. There's a lot of pros and cons with that, but they prefer to bridge these data centers together so that it looks like one cloud and that it's one open stack, right? You know, that means like things like, you know, SAF and identity, like keys, things like that need to be shared, right? So they're not necessarily like easy architecture decisions and easy architecture implementations, but that's the direction we're going. Which is why we also have to be in charge of our own lifecycle management tools because those tools set up that architecture. But to quickly answer your question, like we treat them as one cloud, not separate. One region, one open stack region, right? Yes. That's a good architecture. So you have one region split on multiple data centers. That's the plan. That's what you are going to build. So a region could be its own open stack and another region could be another open stack. But the region will have zones inside them and those should all be tied to the same open stack, right? That it could be that or we could have both regions all as one open stack. Like that we haven't figured out yet. So you don't know yet if it's separated region, but with the same keystone or same glance or stuff like this, or maybe it could be easy. Yeah. I mean, ideally, I think I'd want them to be one across everything. But the one thing we definitely do not want is a separate open stack per data center. That's what we're trying to avoid. And why? So why is it important for you to have? Primarily for like management purposes, like management reasons, like managing identity across sort of independent systems like that. Seemed a little clunky in terms of like building an interface above that. Things such as like, like control planes that are separated out and, you know, we, we for performance reasons storage, we do sort of want to keep their own separate storage so that it doesn't have to go across. But we think overall, they'll be less, like it'll end up making things a little bit easier if we were to stretch across the multiple data centers for management purposes. And also for data plan features, like for example, for layer two networks that run across multiple data centers, at least that's for us a quite strict or important requirement. Maybe it's also one for you. But yeah, and, you know, we also think like if we have to start doing things like, you know, keeping data in sync across multiple places and, you know, something that's out of sync, how do we troubleshoot that? Like those are the types of things we're trying to avoid, you know, if at all possible. So what we're still on the topic of architecture and before we dive more into into details of how you plan to operate. You mentioned earlier that over the last year there was a lot of diversion as you looked into, multiply into generative AI. It did that inquiry and changing the requirements you have for the infrastructure because obviously AI is driving a lot of new requirements on new demands on infrastructure. And so I was wondering if that actually triggered changes in design or the type of services that you plan to deliver through this deployment or if it's just like completely separate concerns? I mean, there's been some impact in the sense that the types of managed services that we are offering, you know, we're now adding generative AI services, right? And so ways where, you know, customers can come in, you know, choose an LLM model amongst, you know, the public ones as well as ones that we create, you know, adding RAG for their own data to it. And then, you know, being able to use that model in their own applications, you know, privately. Those are the types of like applications and platforms that we've been building on top of our platform for our users to use, but it has not affected the actual infrastructure itself. These are just additional services that we've had to bring to market fairly quickly. And so because they needed those resources, it kind of said, okay, well, let's pull the resources, unfortunately, let's pull the resources off of the OpenStack project to build these generative AI services. But now that they're, you know, and that's been moving, you know, they're not done or anything, but they're moving. But for next year, it's like, okay, well, look, we can't keep pushing the OpenStack stuff back. So let's get that back on track. So we do have a big commitment to get this project back on track. Oh, you're actually muted, Terry. Oh, thanks. I wanted to give a quick reminder for our audience that we're streaming live so you can ask questions in the comments of the platform you're using to view this. And don't hesitate to drop questions and we'll try to answer as many as we can as well, while the show is running. So take, please ask any questions you have for Dan or Arnaud or Felix in there. Who has another question? Arnaud, Felix? Maybe we're going to, oh, sorry, go ahead. No, I was going to jump on the network again. You start asking this, Felix, but have you already choose a network backend for your OpenStack deployment? Because splitting your infrastructure into multiple data centers and I think you should rely on a very... ...performance network infrastructure. Have you took a look at the network part of OpenStack yet? Yeah, that's actually where we've been spending most of our time is designing the network topology and then the network architecture. But, you know, we actually need a lot of help there too. So that's where, you know, our guys have probably struggled the most. So they've taken a lot, I mean, they've studied a lot on like, you know, just like the standard neutron features and OVS is really what we're planning to go with in terms of that architecture. I mean, with that said, like a lot of it is built primarily for like, at least from what we've seen, it looks like it's built a lot for like private cloud type features. To make this more like where it's more exposed, kind of like a public type of offering with multi-tenancy across multiple data centers, even globally. I think there's some additional work that may be needed there, but it's an area where, you know, quite frankly, we need a lot of assistance. And it's probably what we view as, you know, and you've sort of identified it pretty well is we've viewed that as probably the most critical part to getting this right before we actually start writing code and launching. So just to be sure, you said you want to go with the neutron OVS plugin where you have neutron agents on the nodes and not with OVM and all things like that. Correct. That's the plan. That's a current plan of record that the team is moving towards. Just my very personal recommendation, go with OVM and don't go with the neutron alteration. Okay. Yeah. And it's not like, you know, it's not set in stone. So we're very happy to hear the comments like that from your experience. Yeah. We just had very positive experiences regarding stability and especially regarding recovery time in case of error. On OVN, right? Okay. So on our side, we have a custom network driver. So I can't say which one is the best on the community side. It's based on OVS what we currently running. Anyway, we are not building yet an open stack across multiple data centers. It's in our plan, but we will go with our custom driver anyway for this. But that's a good challenge that you are building there and I'm pretty sure it will work based on OVN or maybe OVS as well. Yeah. I don't know how much we've looked at OVN, so it's definitely something that I'll take a look at. I mean, we looked at open B-switch, we looked at tungsten fabric, we looked at a few things like that, but we'll definitely take a look at OVN since that's been a good recommendation. Okay. And yes, I also know because we also had discussed regarding, let's say, connectivity across multiple regions or features like that. And let's say we also don't yet have a perfect solution and we are not even sure if we want to offer this to our users because in some case that might not necessarily make sense and we rather leave that to our users. But what an option for OVN might be, might be an internet just to drop this name here, we have not tried this out at a larger scale. Hi. We have a question from the audience. On the topic of multi-region and multi-deployment, what would performance concerns for Horizon be from experience and open stack per single DC scale around 50 servers, Horizon is pretty slow. So yeah, we usually observe slowness at a higher number of servers. What would be your experience, Felix and Arno? Go on. So at least for us, we are not actively using Horizon anymore because we don't offer it to our users because we offer a more simplified approach to our users because Horizon is quite detailed. And we are administrating rather both the CLI than actually the CLI. On our side, we use Horizon. We provide Horizon to our customer. It's not the only web UI we are providing. We are also building a custom one, but Horizon is working pretty well. We don't see this limitation on 50 server like the comment on YouTube. So it can scale very much more. I think it's not limited to that. I'm not even sure it's limited to a number of servers. It's rather a limitation on maybe the API on open stack side where you can configure the number of elements the API should return on each call. And I think Horizon is taking care of that pretty correctly. So for me, you can scale pretty well. But again, we split our open stack into multiple regions with only one keystone and so far it works. We can scale up to more than 1000 of hypervisors and it works still like a charm. What's been good for me to hear is our plan is we actually have an existing cloud management console that we use for our current cloud. And the primary difference is in addition to all the IS features like your VMs and your networking and storage, it also has all of our managed services, our managed databases and our managed features on top of that. Our plan is to reuse that as much as we can and not use Horizon actually. So I thought that might open a whole different kind of worms, but I was kind of happy to see that it seems like it's a common thing to not necessarily use Horizon. Not because we see a problem with it, but just because we wanted to add some of the features and have our own custom console and UI. And let's put it that way, Horizon is for people that know what OpenStack is, what OpenStack API does. And if you need more simplification because your users don't actually care about it, like maybe it might be also if it were your case, then it might not be the option you're searching. Okay, should we move to operational questions more around how do you currently deploy your OpenStack lab installation? Do you plan to use the same deployment tools for the main deployment that will have to be ready in six months? What's the way you plan to do the deployments? Are you using some kind of a framework? Yeah, so we've actually been spending a lot of time working on our live check on management. So we build our Kubernetes manifest, we use OpenStack home. So we use home charts to build them into containers. We build the source code into container images, we use Kola and Jenkins. We customize, we use Argo CD. So we use pretty much a kid ops type of framework across our operations to do the deployment of OpenStack. So we found that it works pretty well at least for us in the sense that we can take advantage of some of the benefits of Kubernetes and containers, growing updates and things like that is what we try to do right now. That's what we've been doing so far, and that's a part of what we've made so far with our live check on that. Who wants to ask the next question? So does that mean you also integrate every node you have into your Kubernetes cluster, so including everything that's actually a hyper-wise or just trying network services? You know, I believe so, probably something I can double check with our architecture team, but I think every node is part of our Kubernetes cluster. Yeah, we do a similar thing. Currently we just see limitations if you want to get to hot reloading services or things like that. Okay. We'll take a look at that to see if there's an issue from the scale. Did you run anything in that regard? But I think you and Kubernetes are a little more far apart. We deploy Kubernetes on top of OpenStack, and we are also deploying Kubernetes to deploy OpenStack, so we do both actually. And yeah, it works. It's okay for all API stuff and the brain control brain of OpenStack. It's okay. It works like a job. Thank you, Wernerscher. A little bit how large the current testing our lab environments are. Sorry? I want to ask then how large the current lab environments are. Yeah, so right now our lab environments are fairly small. We did separate them up into a few physical racks, as well as separating some of the network topology just for some of the testing. But I would say it's not really for performance testing. It's more for just development. And that's really been the size of what we've been doing during this development thing. But before we do go live and actually can check for scale, we will need to do a lot more of that type of scalability. In terms of scaling up between now and June, which is your objective, I wanted to ask the question to Felix and Arno. Going from that lab environment to 50,000 VMs in six months, is it something you think is totally doable, or it depends a lot on how much work has been done up front. What's your experienced look at the, is it optimistic? Is it realistic? Actually, just to be clear, we are planning to launch the beta in July and we don't plan to launch at that kind of scale, anywhere near that kind of scale at the start. So that's been more of our North Star. That's been our goal if we were to be able to, if we're able to ever migrate everything over, that's kind of what we'll have to do at the scale. So it's not really, just to be clear, it's not what we're trying to launch in the next six, seven days. Yeah, otherwise I would say that's a little, let's say ambitious. That will not work. And scaling fast is possible, it's doable, but it depends also on how much hardware you can plan and you can have on your data center. It also depends on if you really tested live migration, then if your operation team is aware of managing OpenStack and doing stuff on instances, on network debugging and all of this stuff. So if your team is ready, it's definitely possible. Yeah, and just to be clear from a company perspective, we don't necessarily have a lot of small applications. We tend to have a smaller number of larger applications. So uptime and scalability tends to be very important. So in the sense that our customers are very open to delaying migrations and things like that in exchange for making sure that we do all of the testing ahead of time and have no problems once it's running. So we are actually very conservative when it comes to these things. So I don't want to necessarily give the wrong idea that we're being super aggressive and trying to get everything done and the whole thing migrated really quickly. We're trying to get something out the door so we can put a test application out there. Maybe one of the smaller applications that are less important than trying to put that up first, it may be things like internal club websites or things like that that exist within the Samsung Group that are used for communication, but it could be the company's sacrifice, for example, something like that. And we'll probably start with things like that and then we'll start moving into the heavy-hitting, you know, DVRP type of systems that I've been talking about. So we won't definitely take it fairly slow because we do want to make sure that we don't have any, you know, negative perception of the platform itself. Siyoung, too, has a question. How are you configuring Keystone in multi-region? What is your strategy for replicating Keystone DB? Yeah, you know, it sounded a little bit earlier with what Arno spoke about. Our plan is actually not to replicate the Keystone DB across multiple regions like that. We want to have one Keystone with one shared database. That's what we're trying to do at least, but I don't know if Arno, you're doing something different. So we are deploying Keystone in a multi-region across the world like this. It works like a charm, actually. We have a single Gallera cluster for Keystone doing the right request on the database, and all of the around the world are replicating data and are read-only. So as soon as we need to get a token or to validate the token, we can talk to a slave Keystone, but as soon as we need to introduce new user on new endpoint or write stuff into the database, we talk to only one of the cluster. And it works pretty well. It's definitely doable. It's only a question of architecture. Yeah, I think that's been the plan. I didn't necessarily talk about the slave databases, the replicates, the read-only database side of things, but we do plan to have one source of truth, one database that stores all of that. And the reason why we wanted to stretch was to kind of avoid, I guess, the question that comes to your head here, making sure that we have multiple databases all in sync and if one's written in one place and another is written in another, they have like the same primary key type of thing. How do we reconcile that? Do we force people to only log into one reason but they can't log into it? Those are the types of things we're trying to avoid, which is why we want to kind of stretch this. 99% of the requests done on Keystone are read requests anyway. So it's very easy to scale and only resurrect the right request to the correct Keystone database, and that's good. Okay. I think we should make some progress on the operational questions. I had one around the size of the team that's currently working on the lab environment and how many people do you expect to have working on the deployment when you start working in production in June? Can you share a bit more details about how many people are working on this? Yeah, so this year, we started this year with about 25 engineers on it, but we ended the year with about 10 because we lost some engineers earlier like I said. In 2024, we expect to have around 40 engineers working on this project. Now in the past, also this last year we were doing some group of kinds of stuff. We did work with an external consulting company as well. But in 2024, we plan to do a lot of the work with a subsidiary or something, as well as 40 full-time engineers on the team. 2025 is only expect to be fully operational and live and actually migrating more and more applications over and we plan to have over 100 engineers devoted just for this at that point. It's not as many as some other places, but we hope that's enough to at least get a sort of off the ground and running. I don't get all the numbers because of network issues, but you said something like you are 10 for now, but you plan to be like 50 or 40, right? Yeah, so this year we ended up at around 10. 2024 will be about 40 and 2025 will be about 100. Okay, so we are around going fast. Going fast, yeah. Some of them are new hire, but a lot of them are existing engineers within 4% for now. So that's a good thing about it. So Arno is dealing with a real space issue. So you were saying Arno? I was saying that to manage open stack deployment, we are between 40 and 50 currently in the team. I mean on the technical parts. And we always have someone available behind the computer in case of emergency or any for duty reason. So yeah, that's pretty much the size of the teams that we have. Do you already have identified issues that you already have custom downstream changes that you've made to the code to make it work in your lab that you would have to maintain over time or is it just running pure open stack code at this point? So right now we do not, and we've made some changes and we've contributed them back to the community, right? And the community has adapted them, so that's been great. I expect that we likely will have some custom changes that we'll need to do. My personal philosophy is whatever changes we make, we should offer to contribute back to the community. Now, whether the community needs that or not, that's a second issue. So there will be a list of things that will likely be certain features and things that we add to the code that is more specific to maybe some of our security requirements or our use cases that the community may just say is not for the actual services themselves. So with that, we do plan to store those in a separate, get source code repository themselves. And one of the fears I have is I don't want us to be able to get too far behind in terms of the official releases. So I do plan to have engineers that are strictly dedicated to making sure that our code stays up to date with the latest build as it comes up. So it's a lot of additional work on our side to make sure that we can do that. But it's critical that we don't fall too far behind so that catch up becomes impossible. So we do plan to have this separate code repository for our custom changes that we are likely going to need. But as of right now in the testing that we've done, we haven't had anything, but I don't expect this to continue that way. So I guess that's also the main answer to how you're planning to do upgrades. Excuse me, what was that? The question to how you're plan to manage and roll out upgrades. Yeah, so our current plan for upgrades is we plan to go slurp to slurp, so antelope and then process. So that's our current plan, is to go slurp to slurp and make sure that we do the testing of our end changes that we made into the next build before we launch that. Now for patches and the reason why it's not a one and done thing and why I plan to have engineers that are strictly devoted to checking this compatibility is because of the patches that come up and do it over time. But our current plan of record right now is to go slurp to slurp. Well, we did have a question from the audience. What are some of the best practices in version upgrading OpenStack in larger deployments? I don't know if there is any best practice on that part. I can share what we do. I don't know if it's the best, but what we do is we usually we do not have time to upgrade every six months. It's too much for us, so we decided to skip some of them and we do... I don't remember the name of this, but we do the upgrades... It's fast-forward upgrade strategy, if I remember correctly. So we apply all upgrades, but in one shot. And it works correctly. The only downside is that sometimes we have to shut down API, so control plane. So we just need to let our customer know about this and we do some maintenance operation for a few hours, usually, until we upgrade the OpenStack API, mostly. And that's basically what we do. I don't know, Felix. Oh, you do that? I think we now managed to upgrade nearly all services at least once because last time we did it, we basically jumped a bunch of versions from my creating workload. I think for the smaller services like Keystone and Glance, we never had any kind of issue with just running them via the planned online procedure that's published. For Cinder and Nova, I think that also worked quite well for Nova, we here relied on being able to just restart Nova compute even though there's still VMs running on this hypervisor. For Neutron, we basically take Neutron out of the, let's say, complexity equation by using OVN. So our main interesting upgrade part becomes OVN and Neutron is now it is a schema change, it is a restart of the API, but there's not like a thousand of computers that you need to touch on. So your biggest issue is now to upgrade OVN rather than Neutron itself? Yeah, but it works quite well for us by just doing live migrations for all hypervisors, making them empty, installing new OVN versions and filling them up again. That takes a while, it took like I guess one and a half months, but if it runs in the background and you just need to babysit I think that's good. One of the biggest issue in upgrading we had in the past were related to either databases or RabbitMQ. It's mostly because when you have to upgrade databases or when you do fast forward upgrades you have to apply a lot of change on the database in a short so you have to shut down API to do that correctly. We decided to shut down API at least on our side. About Rabbit, we did a massive upgrade on RabbitMQ and we did change a little bit the way RabbitMQ configuration was done on OpenStack recently and to do that correctly we had no choice rather than shutting down the whole cluster and if you shut down the whole cluster the whole habit cluster you basically shut down your OpenStack API. That's the two biggest issue on upgrading that we had recently. We're getting closer to the end of the show so I want to make sure we get the opportunity to ask our last questions. I personally wanted to ask Arno and Felix what they expect the first skating issue that Dan will hit will be. Is it going to be a surprise with RabbitMQ? Is it going to be some nutrient some nutrient issue? What would be your favorite common skating hurdle that you want to warn him about? It will definitely be either RabbitMQ networking issue for sure and it depends mostly on which network component you will choose and which RabbitMQ configuration you will choose as well. We recently introduced it started at Merge in Oslo Messaging but if you choose to go with the latest change on RabbitMQ side you should be safe on that part expected. About the networking part I'm pretty sure you will have some issues as well even if you use OVN but that's how you will grow your team and how we will finally be able to debug and to work on scaling your networking stack. Dan, do you have a question for Felix and Arno? No, but like I did read a lot of the large scale SIG notes from the past couple of years and I read a lot of RabbitMQ issues and notes and configuration things in there so I know that it was something that we've been looking out for doesn't mean we're not going to run into the issue but we probably still will but it's something that if it comes up with something that we're expecting would likely be an area that we need to look at. The main part here is really just around where it was really on the networking part that's really been our biggest worry and our biggest area where we just don't know what we don't know I think part of what helped was we'll definitely have a team take a look more closely at OVM and see how that will work and see if that will look at a different approach that way but no other specific questions they might have I think it's been great. It's great to hear that you've been able to use the contents that the large scale SIG has been producing and documenting it's a good confirmation for us that have been involved with the large scale SIG for a while now that it's useful for newcomers in the OpenStack large scale club that we produced that we work on producing that documentation so it's great great news I don't know did you have any question for Dan? No question it's only that feel free to join us on large scale and start applying change on the documentation as well after you scale and after you deploy and if you found something new we're really happy to have news stories. Felix any any question for either Arno or Dan? Yeah I know but I think the same thing and I guess there's the OpenStack operators if the audience questions afterwards I'm at least lurking in there most of the time I think Arno you too Great I mean I think we're definitely going to run into issues so I'll be bringing up hopefully some interesting problems for the large scale SIG to help address in the future so yeah it's a lot So I guess we have question for last minute question from the audience there was a question from Diwan Kim that was actually a question in the OpenStack Korea community regarding Octavia so I don't know exactly who of our three guests will be able to answer that the best but there was a question around around M4 in Octavia maybe Kristin perfect how do you guys deal with massive numbers of M4Is which is part of an Octavia deployment do you guys design a one huge big enough subnet for the LD management network Are you using Octavia yet Felix or Arno or Dan We planned to but we haven't run into this issue yet obviously but we do plan to use it as our load balancer We also ran into this issue in the past and we did something to plug in additional subnets on the network I'm not sure if I would do that again because I think it was quite ugly but it helped back then but since then we have migrated away from Octavia and on our side we deploy Octavia as well but I don't remember about this big subnet so I don't know I have to take a look I don't remember it's not a problem yet maybe it will be but it's not yet so I don't know I think that we're about at the top of the hour so I'll pass back the mic to Christine to close the episode and thank you again Dan for joining so late in Korea it was great to have you on the show and I like this format where we were able to also give advice to at the very early stages of the massive project like the very ambitious project that you have and thanks to Felix and Arno it's going to be our last open-infra live show for the large scale SIG Ops Deep Dive for the year so I wish you all happy holiday season and back to you Christine Thank you Turi and I want to thank Dan and everyone for joining today's discussion and thank you for our audience for waiting and asking some really great questions I have a little update so if you want to hear from more open-source users we have some exciting open-infra event updates the first regional open-infra summit Asia has been announced for September 34th 2024 big thanks to Hyunsoo, Ho-Chul, Jeff and the rest of the Korean organizers for organizing the open-infra summit Korea we also have another update about open-infra days Europe we will have a special edition open-infra days Europe in May and June next year some of the information is still being finalized so you can find that on the blog and you can also find sponsorship information which is available on the open-infra foundation blog and we'll drop the link in the comments for everybody to see and we also have a PTG event for next year set for April 8th through 12th and it will be held virtually team sign up is available on open-infra.dev forward slash PTG and last thing if you have an idea for a future episode we want to hear from you submit your ideas to ideas.openinfra.live and maybe we'll see you on our future show thanks again everybody for joining us today for our guest and we'll see you on the next open-infra live bye