 I got them. I hear you. Thanks. I see a lot of you at some instructions to follow. Let me just do them really quick. Terry just announced our session at the operators one minute just finished up so it may take a minute or two for people to shift over. Great. Yes. And I might take that minute to set up this recording, I guess. We're also the same time as the client gap. Meeting and Mari and Victoria are covering Manila and yet. Awesome. So Tom, did you did you do the host ID thing. And, and this live custom live streaming service. Think. To get in today. I. Is that what you mean. No, I got an email from Jimmy. Oh, let me see. And he asked me to use the host ID to join. And so no, I did not. I just clicked on the link. I can see. Did you join that way? I actually don't get that thing, but let me, let me, let me drop and join back in. I will hold out here. Awesome. Thank you. So for anyone on the call, we're, we're making the, we're still setting up a little bit and people are. I know some people were in the previous sessions and are on their way over. We had an adventure yesterday morning with our forum session. This seems to be going quite well in terms of setup. We got him's just figuring out how to join with the host ID. So hopefully we can record from there. Hi Tom, can you hear me? Good morning. Yeah, hi, Vita. I was trying, I could see you guys, but couldn't join with Firefox. So I switched. I guess I'm okay. Sure. For those of you who do not have it. I'm putting the etherpad link in the chat. So while we set up, if you get a chance, please add your ID or name. Under the attendees section in the etherpad. Thanks Carlos. I see you doing it. Thank you. Thank you. So that way we can. Know who's here and follow up with people later. It's we. Appropriate. Yeah, for sure. You can see there a list of topics and read ahead in them and see if any of them connect with your concerns. And you should of course feel free. It's an etherpad, this collaborative and list as a forum. And then you can add things that are on your mind as operational concerns. For Manila adoption. And this is of course also a fine forum for asking questions about Manila itself and whether the service would be appropriate. For you to use. Hi, Mike. Thank you for. Making a. Thank you. Thank you. Thank you. Thank you for your comment here. Maybe I can speak. I don't know. It's my mic muted. Yeah. No, yeah, I can hear you. You just remarked that you're going to launch. Manila in your cloud and that you have a CFS cluster. Just corrected me. Go ahead. So yeah, we're going to launch a CFS cluster. It auto corrected me for CFS. I don't know what that is. Anyways. Yeah, we're about to launch a Manila share. So we've been running a cloud for a while. And we have an RBD cluster already running. And so we built a really set the fast cluster to run object store and set the fast off of the. So that we don't have to worry about RBD costs for getting trounced by CFS traffic. And so we've rolled out Manila on top of it. I'm wondering if there's any security concerns that we should be worried about one person. In our team was worried that because we have to open up. Access to the client network. That all the clients would then be able to see the mons directly. There's no proxy in between. Sorry, open stack and the clients is, is there anything to worry about there? Yeah, let's come back to the question in just a second. Gotham, I see you are back in. Yeah, no, I didn't want to say that. Sorry about that. The, I was just checking whether this is streaming live on the summit app. And it is. So I guess someone else did this for me. I see Aaron as is a host. So thanks, Aaron. Because I was trying to figure out if I should, I should set this up for streaming. Anyway, so it's already streaming. So we're going to come back to Mike in just a second. And thank you. We were just talking informally before while we waited for you to start up the session Gotham. I will just say my name is Tom Baron and I, I used to be the. I was a member of the GPL of vanilla a couple of cycles ago. Gotham, who I'm talking to is now and he can introduce himself, but the project is in much better hands now, but I still help out a little bit every now and then. And then we'll come back and talk about Mike's operational concern from my cave in just a minute. Absolutely. All right. So thank you so much for joining everyone. And hope you're having a great day so far at the summit. I'm Gotham Pacha Ravi and I'm a software engineer at Red Hat and also the current project team lead for the open stack monola team. And many of my team teammates are here, contributors, long time contributors to open stack monola. And well, this is one of this is the first time I'm doing a session Tom's done a bunch of these. So we're going to be talking in tandem, my guess. And we have a bunch of questions for you. And this, all of, I mean, the session is basically for you and us to interact, network, ask us questions. And then maybe we can have some, you know, feedback to take back to the next week, where we contributors are meeting for a project team gathering all virtually. And you're welcome to join us over there as well. I'll leave some details on the etherpad. And if it's not already the etherpad link is now on the chat, as well as on the discuss comment thread. And if you're, if you're watching this on the discuss comment thread, please feel free to join us on the zoom call so you can, you can actually talk to us live or chat with us using the zoom chat thread. We will take all of these back to the etherpad as well. But it's always nice to have that two-way communication. So thank you, Gotham. Mike cave just introduced himself and brought up a concern. Micah, would you repeat yourself? I would really appreciate it. Yeah, sure. No problem. So. My question was around, I guess, the security concerns that were raised by one of our teammates around. When you enable Manila. And inside of open stack that allows the users access to the staff client network. Because they now need to be able to talk to the months directly. And as well as all the other skis. So I guess the question is, is there any security concerns around having that level of access? And is there maybe any way to set up proxy networks. And inside of Manila, so that the clients don't actually see the monitors directly. I don't know. This is a question from one of my teammates. I don't think there's a way to do any of this, but is there security concerns because now the clients can sniff the traffic on the client network. So anybody who has access to that Manila network now has access to the client network and data traveling over it. So that's the big concern and then access to the monitors. So I'll say a couple words about this. There are a number of aspects to this. Question. First of all. SFFS itself. Is an interesting file system protocol and that it relies on a smart client. And it's implemented to do things like enforce various delegations. And to enforce quotas. SF quotas, not Manila quotas, but when you make a Manila share of a certain size, and it's implemented with SFFS, it's a self quota on the back end that's. Enforcing the size. So. At red hat to give an example. We have. Lots of customers who first of all, we have some customers that run public clouds. And then secondly, we have customers that run enterprise private clouds. Where the tenants. Regular users with member privileges and a keystone project, but who are not administrators. Are not really trusted. They're not trusted to muck around with the infrastructure. So we document how to. Only expose the stuff public network and our deployments. To trusted tenants. And then for regular tenants who are untrusted and they're untrusted, of course, in the sense that I'm not trusted to manage your bank account or you mine. It's not a per character assessment. We can talk about that later. Yes. So yeah, sure. If you want to leave your account in my chat room over here. I don't know what to do with that. That's fine. Untrusted tenants are not allowed on that network directly. And we deploy NFS gateway. Or general. Regular users. For that reason. So stuff us clients. Need to be run by. Trusted tenants. My trusted tenants you need to make sure. You have out of band communication with them beyond the apis. You know what they're deploying. You know that their CBEs are up to date. CBE fixes are up to date, that type of thing. Just to give you an example depends on your cloud. CERN, whom you may have heard of. Feature here. And large at the, at the summit. And so. And so. Has a large set of trusted tenants because their researchers are using their cloud or. First of all, they have a good way to keep, keep, keep the software up to date. And so on. And then, you know, they're working on fundamental problems of the universe rather than trying to hack in. To the infrastructure. Yeah. And they also run flat networks over there. So they don't run through neutron either. They just expose their networks differently to their clients as opposed to us. Well, we run it through the network through the neutron. Agents. Yeah. And can you tell, do you mind telling us a little bit about your application and your. Your, your open stack deployment? Are you a commercial? Are you research? What are you doing? Yeah, so we're a research cloud. We run the largest cloud site for compute Canada. Obviously in Canada. So it's. It's the cloud only site. Yeah. So we've got multiple HPC sites and we're the cloud only site or the biggest cloud only site. And. It's a 40,000 core cloud. Our RBD clusters, six petabytes are set past clusters, 13 petabytes are, we have about 650 tenants. We have, we do some HPC, but mostly portal work. Our clients can come in and ask for a, an allocation on the cloud. They do require a federated ID. And so they are put a quote trusted in that sense. When they get their, their project set up, but they then can go ahead and allow access. Obviously once they get began, it's going to anybody that may or may not be part of the consortium. So. Right. So you may have your colleague may have a point. I mean, I don't have an opinion on this. So you make up your own mind, but. You know, essentially, if you have a self-client, you're able to access storage infrastructure. You may not have the keys and so on to do it, but you could theoretically Mount DOS attacks on the network. And with respect to your own, you could substitute in a different client or have an out of date client that has a CBE, et cetera. So those are the type of concerns that come with, with native set of us. So we, as I said, use an NFS gateway from Ganesha. Today, we're also working on, and we will have a session at the PTG with Nova next week on vert IO FS approach where basically remote file systems like CFFS can be exposed to a compute node. So the CFFS client will be running under administrative control on a compute node. And then the remote file system is exposed of our bird IO FS through a hypervisor, more or less like you do with RVD. Today. Okay. So this is a work in progress. Early stages, vert IO FS has just made its way into production Linux kernels. Although it's been in the works for several years. And that will also be an approach to addressing this. We think it has promise of being more scalable approach than the NFS gateway stuff because the NFS gateways run today run. Really you can only run an active. Passive type set up or active standby type set up. And we don't have a way to scale that out. So we run it in a pacemaker cluster and that limits the scaling ability. Sure. Yeah. I mean, currently our use cases. Number of clients will be small. Their requirements on size will be large. So we buy, I guess one of my, one follow up question be if we've turned on stuff of us, it's open to the clients in the share menu and on the APIs. Do they automatically now just have access to the client network? I mean, not unless they have, I mean, so you'd assume that the stuff cluster is exposed by some sort of a provider network into, into the open stack. Right. And so unless the clients have access to create ports on that provider network, they do not have access directly to the client, to the self affairs shares. So if they want to mount the shares onto their VMs, they will need to plug into that storage network that you're the set for public network that you're exposing. Okay. So we can, we can lock it off that way as well. Yeah. Yes. And in fact, in the downstream red hat, 16.1 documentation, you will find guidance that got some pressure Ravi wrote about how to selectively expose that network to certain tenants rather than others. There's nothing specific to red hat about that. That's a technique anybody could use. Awesome. Thank you very much for that. We are going ahead with this regardless of the concerns we feel fairly confident in our user base that we can get this done. But it's interesting just to kind of play at the scenario almost a little bit here. Also it's 630 in the morning where I'm at. So I'm a little out of it. So thanks very much for answering my question. Oh, nice. So you're in Vancouver area. I'm on Vancouver Island. Yeah. Oh, nice. Beautiful place. Yeah. Gotham is in a slightly less advanced country. A little bit to the south of you, but at the same time. So. Oh, have I heard of it? Yes. Yeah. So I live in Seattle. Nice. Awesome. So any further concerns on CFS. Mike, I did throw in some links on the, I think the one in the red hat guide is a little more comprehensive because of all the flexibility that, that's possible with open stack networking. We, I don't think the upstream documentation is, is going to touch on, on the specific steps and such to take in, in a cloud that you're building. But I would recommend looking at the red hat guide just for the literature, not recommending the use of relo SP or something. Fair enough. Yeah. And at the university. So we're at a university here, university of Victoria. And so we had access, full access to red hats documentation as well as part of our licensing there. So that's nice. Oh yeah. Great. Cool. And yeah, there's nothing, nothing secret there. It's just a, but you get, you get out of a manila proper into deployment concerns in this, you know, so, and, and they're important concerns, but. Okay. Thank you, Mike. I'll tell him I'll, I'll shut up now. You like to talk again. No problem. Thanks, Tom. Thanks, Mike. Okay. So. Did anyone else here have a concern that they want to bring up. So we can, we can drive this via questions. And we, if we have some for you, but in, if, if you have some, we'll prioritize those and talk about those. Cause they're always more interesting. You know, one thing maybe to check also got from is how, how many of the people present have actually already deployed manila and how many are considering deploying manila. And they're here because they want to, to learn about that possibility. Yeah. So, but is everyone here familiar with Manila? Have you all deployed Manila? Can we do some sort of a roll call? If anybody here that is looking to explore it like Mike is, and if you would talk to us about your use cases, that would be one, something interesting to us. Everybody else is still sleeping like me. Well, I know that I'm going to be a little bit rude here, but I know that Maurice Escher is awake. I will see if he's paying attention at the moment, but if he is, he has a fairly large deployment of Manila and might have some operational concerns in the following listing. Help us prioritize them from his point of view. It could help us get the ball rolling here. Karthika, are you here by any chance? I'm able to share. He can get back at me later for calling on him. He may be dealing with a large scale issue. At the same time. It's not nice to laugh. Anyway, he works at SAP and they have a large scale Manila deployment. Several of them in fact, and he has driven a number of the performance and scale issues that we've worked on over the, over the years. I guess maybe that's a question too is, is scalability. So we don't have, when we launch ours, we'll probably be launching with less than 10 clients running on it. But this could easily scale out to as many clients as they would like to have access for not restricting it. Is there, are there any gotchas as we, as we go forward to watch out for like how many users should be using the services there, bandwidth and throughput concerns, that kind of thing. So, what, I mean, what this is, is, I mean, since Manila is not, not operating in the data path, you're just interacting with Manila to do the provisioning and the management of your shared file systems, but not really putting it in the data path, the service plane itself. So that's, that eliminates the, the, the question of, you know, how many clients can be connected. It is something that Manila cannot control. And it is, it falls back into the, into the storage system itself or the storage solution. So technically you're, I mean, it says as much scale as you're, for example, in your case, this F cluster can offer. Okay. The, on the scaling side on the control plane is where Manila's involved. And that's actually where Maurice, for instance, whose name I see up there at the moment, has had some interaction with Manila itself. Maurice, are you there by any chance? I see your name. He can't unmute apparently. I am now. Is it working now? Yeah. Yes. And thank you so much for volunteering to talk. Yeah, we have, we have a pretty big Manila installation. I would say at SAP we are offering an OpenStack cloud internally for our developers. And yeah, the special thing is maybe we are using NetApp Appliance as backends. And we have a lot of customers that want to use CIFs protocols. So I see myself supporting that a lot. And that's also something where I put a line in the cloud. What I see myself doing a lot is looking up errors in the logs because my customers can't find them themselves. So even if it's some simple issues, some typo in the DNS IP of the security service for CIFs or the password is wrong, I always have to look it up in the logs. So that's the thing. I do a lot and I hope that this could get better somehow. Go ahead, Gautam. No, that makes sense. And I think, so I don't know how many of you were at the user messages forum session earlier yesterday. And this was farcinder, the block storage service. And the concerns were very similar that there was this user messages feature that can allow users to take corrective action and they don't need to talk to a cloud deployer or an administrator. But Manila suffers from exactly the same problem as Cinder does. We introduced user messages I guess about four or five releases ago. And users are familiar with it, but we have not covered all the possible recoverable cases. And the one that you just mentioned seems like a bug to me. So if they have a typo in their security service, this is something that the share driver would be able to detect and would be able to report back to the Manila API. And so if a user is watching their messages and takes a look at why something happened, they can actually go ahead and delete that security service and create a new one. Yeah, it would really help to get this a better, to a state of a better self service. So they don't have to open support tickets and wait for a long time to fix an easy error. That makes sense. I don't see my friends from NetApp on this call, so I will remind them and we can probably open a bug and fix this specific case. But this sort of thing is actually something that we'd really like folks to start using and reporting bugs on. You can extend the user messages extensively and you can even, we even have the ability to backport it to releases that you are using. We've made changes to use the messages and added them all the way back to Queens and such because we know a lot of folks are still running OpenStack Queens. Yeah, so all the infrastructure is there. We just need to get the coverage. So opening bugs when you see them will be really helpful and fixing them should not be that hard. They'll make pretty good progress. And if you know what the message should say, either put it in the bug or submit it and fix yourself, but you don't have to propose a fix in order to file the bug. Yeah, it would just help to bubble up the NetApp error. Yeah, but especially if you have operational experience and you know what would be the most helpful thing to present to users. Yeah, not everything makes sense for them. Maybe a bit more context for others that are trying that as well. So we are running on Docker containers and Kubernetes and that's working pretty well. So apart from the Manila share service, everything else is scalable so you can run multiple containers of it. So works good for us. Yeah, so that apart was an interesting remark that you just made. Can you elaborate a little bit on that? Yeah, the share service is not high available because it controls one backend and all communication to that and you can't have two share services running that would then pick up the same messages. Yeah, so the share service is not safe to run active today. It's I guess the way I would phrase it. So for example, when we deploy our customers deploy, it is highly available, but in a single instance, active standby way because we use pacemaker core sync to control it. You, on the other hand, are more advanced and are using running it as Docker in a Docker container. And you're saying that because it's not active active, you cannot scale the deployment under Kubernetes the way you can with for instance, the Manila API service or the Manila scheduler service, which can run active active. Yeah, exactly. So important to us is the startup time of the share service and that's. Yes. Yes. And especially, you know, one, one. The head of the storage SIG at in Kubernetes. It's not an HA service. It guarantees eventual consistency and eventual recovery. But not quick failover. You know, so getting an active active service would enable you to keep a service up while Kubernetes. Controllers or whatever discover that, you know, your scale number is not right and start up another service. Another instance of the service. So you would like active active service and have us prioritize that work. So I can speak to it a little bit more. So the active active service has been on the back of our backlog for a while now. And the at the very initial stage, I mean, this was possible to be deployed active active. Even today you could, except we've not tested it. And that's why we don't claim support for it. So when you have two services that that are running active active, the way you configure them is to make them look and feel exactly alike. So rabbit MQ when it when it when it's communicating to these active active services knows to pick one of them fairly. And you don't get into the aspect of the same message or the same action being repeated by two different services. But the problem becomes when there are these repeated polling actions that are happening, as well as when, you know, what happens when some somebody goes down in the middle of an operation. So if, if, if this is a long running operation, let's say you're performing some share replication and you're, and there is a thread in the share manager service that's waiting on the, on the back end to report back. And, and in the middle, you know, something gets broken and the service has to die. So we have a, we have a problem determining whether that that sort of an action can be rescheduled safely to another share manager service that's managing the same back end. And this also extends into another set of problems where there are storage, so storage solutions that may not be prepared for two management threads that are probably talking, you know, at the same time. And so they may have made some assumptions about concurrency when it comes down to the storage system, like for example, let's say you're trying to add, you're trying to create snapshots and there is some coordination in Manila to prevent you from, you know, like overwhelming a storage system. But at the same time, you might have instances where, you know, you might snapshot two different parts of your file system. That is two different distinct shares at the exact same second in time. And the last time we discussed this with this, with the driver authors, that was a concern for some of the storage systems. So they would have to step in and have to have some sort of process locking added to their share drivers specifically and test it against their storage system specifically when we, by deploying it inactive-active and running through some tests and such. So that is why we weren't, we haven't been able to make these broad claims of can we support active-active and such because of these deficiencies. And we are certainly looking for help in this regard because it's been on the backlog but not bubbled up to the top of the priority stack because there have been other user requests that have more, you know, weight on them. But like you said, I mean, the scalability of the service is something that's a very desirable thing. And we, if we did address the startup concerns a couple of releases ago, I think, where we, I mean, initially we used to reconcile all of the service, all of the service resources when some, when the share manager service goes down and comes back up, but now we don't. We ask the share drivers to take control of that. And if the share drivers detect something is wrong, or something has changed in, let's say that configuration or something in the back-end storage, the share drivers will, you know, push for an update. And this does not happen at startup. This happens maybe in a deferred fashion as after the services started up. So existing resources are getting reconciled while new requests are being honored. That way, you know, your cloud is still operational and nothing is happening to the availability aspect. So I'm very interested to talk more about this at the PTG that's there next week. And we've at some point, I think a couple of cycles ago made a list of small items that need to be covered to continue to work on this path. And we made some progress, but it's not, we're not there yet. So if you are interested in helping, working with that, working on that, please let us know and come to the project tech gathering and we can definitely go over it and discuss specific issues, challenges, testing, and other aspects. Yes, that's good. Thanks for these insights. Thanks, Morris. We'll be there next week. Awesome. I did want to talk about that in the, in the, I mean, talk about availability a little bit more with service ability aspect. So like Morris is doing, you can run any number of copies of the Manila API, the Manila scheduler, and, but, but not the Manila share service because of all of these concerns. And, but, but when the, when the service stacks, stack breaks down because of some reason, things are lost. And we, we're, we're, we're, this is a concern that, for example, I think Liron's on this call. He's tested Manila extensively with the, the, the, the LOSP cloud. And he's seen a lot of concerns where, you know, he intentionally causes a failover. But then the resources are stuck in some state at this, at the time when the service went down and when the service comes back up, that particular resource is still, you know, unfinished or it's not getting rescheduled or anything of that sort. So if you have concerns like this, we'd like to know, I think Liron opened a few bugs. Thanks, Liron. And I don't have links handy, but I think he was observing access rules specifically, not being applied when, when, when the service is being restarted. Any, any of you have seen any other, any other concerns such as these that we need to take a look at? And this recoverability is not just when the service stack goes down because of some issue. I mean, these issues can be network issues. You know, the docker container just died. Sort of things. Or the node that was running the Manila share service just went kaput. But they could, it could also be when you're going through an upgrade and there are operations on the fly because we want to make, for example, upgrades as smooth as possible and you don't really need, you know, any amount of downtime when you're performing an upgrade, especially if we're handling things properly, that would be the concern. But we are at the same time not saying like if you have a massively distributed cloud where you're running multiple services, we do not recommend a rolling upgrade, not with Manila, at least. We do expect that the database connections are off that all the services are not writing actively to the database so that we can perform a database upgrade and then you can fire up the services back up again. And that's the current upgrade concern. But as with most open stack services, there is nothing that's going to affect existing resources that are controlled by Manila when you're going through an upgrade. If clients have connected to their shares or they're, you know, right actively writing or reading data that they should not be disrupted at all when you're going through the service upgrade because all we're doing is probably toggling metadata to take advantage of the newer features that have come in this new release. So, Gotham, we're out of time and I noticed that someone else just took over as host and we had to clear the room. But the Etherpad is still open and Gotham and I will put our email addresses in there as well as IRC. So please join us. This conversation can continue on the Etherpad. Great. Yes. On the Etherpad as well as please, please do join us at the PTG next week. And I'll send, I'll put up links on the Etherpad for that as well. So we can continue this sort of questions and questions. Thank you to both of you. Appreciate your time. Absolutely. Thanks for joining. Thank you.