 Good morning and welcome. My name is Petter Einaha. I'm from LVM team from Red Hat, but currently I'm working on storage instantiation daemon or SIT for short Which is a new initiative, a new project And that's why I would like to introduce you what it is all about, what the goals are, what problems we are trying to solve Before I start talking about SIT itself I would like to talk a little bit about some background which is important for us to get to why We started SIT at all. So It's about events in UDAV. So Just to recap What are U-Events? So we have kernel space, user space, and we need to Have user space know about some changes that are happening in kernel to a device. It's either Device addition and removal change, renaming and device and something like that So we have an Ellinger phase for that where we send U-Events These are so-called kernel U-Events. They are multicast. So we have several receivers in user space When it comes to Where who provokes this U-Events? It could be the kernel itself or it could be the user space as well And this is done through the U-Events file which you can find in the CZFS File system. It's also possible to send U-Events between two processes in user space and also to send U-Events Before the yellow one, that's the Unicast way and the green one. We can also send multicast in user space We'll see how it is used. So as I said, U-Events are even notifications We have three basic types kernel, user space, and user space multicast and Unicast Each U-Events comes with some environment, which is in a key value pair that tells you more about the context What the U-Events is about, what it is notifying you about or any other information you need to do to process this Notification, this message in user space We have ADEUVEN action types. This is actually telling you What you're being notified basically about So it's the device addition, removal, change, or move This is one of the key value pairs, one of the basic key value pairs And as you have seen, all U-Events are going through the NATLINK socket Then we have UDAV. UDAV is UDAV-Daman, which is the primary listener of these U-Events in user space It uses that information to support dynamic device management So what does this mean? It means it monitors the NATLINK socket for kernel U-Events It processes UDAV rules I just mentioned all the possibilities you can do with UDAV rules here for reference But what it all boils down to is that you need to set device node permissions and create some links to these device nodes The nodes under slash DAO. That's what primarily UDAV-Daman was designed for Then it stores records in UDAV database, which means all the key value pairs that came with the kernel U-Events And as well as all the key value pairs that you might have added while executing UDAV rules And then it regenerates these kernel U-Events as so-called UDAV U-Events Which are actually kernel U-Events plus all the context you have added while executing the UDAV rules And others are able to monitor and they can choose whether they listen either to raw kernel U-Events or those enriched UDAV U-Events So let's see this in picture in action We have that linear face the U-Event file You have UDAV-Daman, which is one of the Primary listeners user space for these events. There could be other listeners, of course There's UDAV worker running or when the new event comes it just resends the The Notification from the main Daman to the worker It runs UDAV rules which include built-in commands external commands Rides the UDAV database and regenerates the UDAV event So there are also user space listeners who can listen to that. So that's the basic UDAV in action Now let's talk more about storage There are some specifics when we are trying to deal with storage and This is the ideal The ideal scenario is that when you have a single device Which is usually just right after the add event. It happens. It's fine. It's great when it happens but usually that's not the reality the reality when you're dealing with storage and all the various Storage virtualization Schemes like device mapper MD and so on you usually have more actions to perform when the device is usable It's it's simply not right after the add event. So the add event doesn't tell you that the device is usable You need to perform some initialization sequence For example in some cases that might include zeroing the data area of newly created device or Adding some signatures to the device and so on then there's multi-stab activation scheme For example in case of device mapper you have your Device added to the kernel the representation is added to the kernel, but it's just a dumb device It doesn't actually do anything you can't perform any IO on that device You need to load some configuration to that. That's for example device mapper table And then you make that configuration available. So that's three actions and it doesn't map to add event So after the avi even you just have the device which is not usable. That's important to know So simply you need to tackle this and and you'd have somehow and then we have grouping and layering these two are actually related together because When you're dealing with storage virtualization techniques you group several devices together and When you group these devices together you actually want to create a new layer on top of these this group of devices To represent this group. So grouping and layering together actually you create stacking and When you're dealing with storage you can stack devices Arbitrally, which means you can go iteratively and stack one layer on top of another So how we do this usually it's that devices contain signatures I mean the storage device itself contains signatures metadata or it can be external configuration That defined the next layer in the stack of what we use for that is walk ID That's the very basic scan. We currently perform in within UW rules This is only used for signatures that are or metadata which is directly stored on disk Now we have multipath. We need to consult multipath to detect multipath components Because this is external configuration. Multipath doesn't have any Any signatures or metadata attached directly to disk or it might be detached header location Just like Ondra mentioned before for locks encrypted devices and any other special days for storage specifics Okay, so on one hand, we have UW and U events and on the other hand we have all these storage specifics and This brings us to a clash and there there are some problems. We need to deal with so over the years When we've been dealing with the UW and and storage we've run into certain issues. So I just collected a few of them Might be the overall did you even action type as I said, we have a given action types But four of them are actually usable or used Frequently for storage that's add change remove and move So add remove and move are quite clear because they're they notify you about device addition removal or rename But then you're left with one change even to notify about all the other changes. So What you need to do you need to add just more context to the environment to the even environment the key value pairs And that's sometimes a problem This why this is quite clear because When you're trying to react to a U events in UW rule language UW language is simply not a General purpose language. So it's restricted. You can't do everything what you need. That brings us to next Next Next item, which is that you need to extra actually call external commands to perform some extra action because that either one which is restricted and There are other things like the rules and keys are global You can you can install either rules from various sources. They may overwrite each values and There's simply no no order in in this and it's just about how those teams To communicate together whether this works well Then of course sometimes you need to access UW database directly from within UW rules This is not quite easy. We have UW rule for that which is import DB UW rule, but It's clunky and error prone This brings us to another problem, which is identification of current state Because if you don't have enough information, you have problems with identifying in what state the device is What you need to do currently on that device based on the notification. You just received and There's completely direct no direct support for grouping all you that does is that it Execute actions per device you get notifications per device You don't have any abstraction on top of that in the form of grouping and that's what we said We actually need for storage virtualization techniques when we are dealing with storage There's also no standard way on marking device as ready usable public private temporarily private We use that but everybody use its own way. There's maybe those of you who Looked into UW rules. There's system. They ready. That's DM activation and various keys that you need to know about when the device is ready and usable and Also work done in within UW context context may not be appropriate Because you perform lots of actions just to know in what state you are you need to call external tools And that brings us to another problem You you're just running you need to complete or all your actions within certain timeout Because you do have has a timeout for all its actions for all its UW root processing and if you hit the timeout You don't just kills the worker and you don't have any fallback action which you can perform So simply you just lost the notification you lost the state and it makes it very hard then for the next even you receive to just make it working again and Then it's also about scheduling separate work. So even if you if you would like to make The work as minimal as possible within UW context and you want to do some delight actions Then you need to have a way to schedule this action and to synchronize with them somehow So simply UW is not primarily designed for this. It's as I said, it's designed to handle nodes and simulings and slash That does that just fine, but we need a different approach for Storage for our needs. So how That does that look like let's look at our picture again, this is the usual Sequence of what happens when you receive a you even so we need a change in this area to minimize you the rule processing and to try To make this much more straightforward for us to recognize the state to To react to events and so on we need we need to change this bar as well because we need to be able to do So what these changes are That's the sit damon new sit damon sitting on top There's the sit built-in you'd have you'd have built-in come on which communicates with sit when it receives the event it written some information back and Also sit listens for these You'd have evens as well. We'll see why and then this part also needs needed some change It actually is already in kernel since 413 And basically it's about adding a possibility to define More keys when you when you synthesize the new event when you when you provoke that you and from user space before it was just possible To create a you event with certain action, but you couldn't define any more key value pairs that you need it So let's look at the storage instantiation damon and its components. So it's the said sit damon Which is layered on top of you that It keeps its own database We'll see that too. It executes strict specific even handling and processing. We also see how it does that Then we have the you'd have built-in command. It's such as the bridge between you'd have and and said and for the future, of course, there's the library interface that we might use to access see the Database and sit see the alchemy line interface. So let's look at the Sid and it's and it's which I call stages of processing That's the identified when you when you have rules are executed. That's the Sid you'd have built-in command Which creates a bridge between you'd have and sit and it calls sit identify you that built-in command So this is this is the upper part of the picture. We've seen before. That's you'd have the worker I just Cut the the other half because it's not important here. So we have the sit identify that Actually creates a sit worker with its own sit database snapshot This is important because you're working with a database snapshot. So we have a consistent view on the state And then it executes stage a we'll see what it is actually and then it returns the results back and then we have said checkpoint at the end of UWD worker processing which causes the sit database To get synced with the master database Yep, that's here and then it's We receive the UW even after the UW worker is finished and that Executes then stage B, which is which contains all the delayed actions. We have scheduled So let's look at the identify a sit identify stage a That's the First bar. I also call that a you'd have you'd have a stage just because it's runs At the same time when you did the worker is Running so we are actually processing that You even at that moment. So it's stage a we're starting at the idle state Which means that it hasn't it's not doing anything. It hasn't received any notification yet when we received the notification We are in the initialization state, which means we need to perform some Initialization or access the database set up some some values. This is this is actually Belonging to the core part then we have the identification phase which identifies the device What type of device is it and it's separated on at block and type Our face the block and type I need to explain that a little bit the block is Executed always no matter what the type is and the type is only executed when for certain types the green parts are actually The module hooks that you can define to handle these notifications Qualifications then we have another meta state which is scan all which is separate further into scan pre again block type That is before you start any scanning or accessing the device so we can make decisions here Then you have scanning. This is the core part of the scanning for current layer Which is again block and type and next layer block and type now. I need to explain the current and next Imagine you have you have a stack of devices and you're just processing right now Current device for which you have just received the U event. So that's the current and Then you have next which is just right on top of on top of that device So that's what's going to be activated next. That's what you are expecting next and that's actually a sliding window So we are just sliding through that stack with the current next So what this next what next now is next time. It's the current and so once we are sliding window And then you have scan post meta state again current for current Layer block and type Module hook and Next block and type module hooks in scan post you simply schedule your delayed actions or You schedule what? Simlings you that should create or anything you need to read them back to you that so and This is just going round and round. So you have clear states going from one to another and Actually, what you can do you can map you that rules to these To these states so with you the rules you actually don't have any order defined anyone can just insert You the rules in any order But with this type You have the identification state you have pre scan scan and after scan and That's when when you return the results back to you that when you schedule your actions and then when you'd have the worker is finished then sit receives the information from you that the notification the you'd have the The you'd have you even actually and that starts the stage B stage B again starts with idle and it's just performing actions for current and next layer so that's that's the stage B and Maybe a little bit about the database just Briefly, it's a key value database with various backhands. That's what you have available when you're defining sit modules There can be either simple or vector types As we seen it's the there's snapshot separation. So we have So you have a consistent view on the state on the database state It also supports a delta synchronization of vector values What does this mean is that you don't need to synchronize all the vector, but just the deltas the vectors vector values are actually the array of values you need to store and So sit does that for you. You have separate key namespaces So called you'd have namespace which maps to you though So whatever you read from you that namespace is actually what you would read in you that rules and watch whatever you write to you That namespace will end up written to you that at the base At the end of stage a then we have global and that's the difference that these are the additions when you compare that to you module namespace and device namespace so you can Separate your database records into these based on what you need to do and Then there are per module protection flags That's also important because you can have your values protected So another modules won't overwrite your values. This isn't possible with you. Do which sometimes cause problems You can make it that even private and you can make that reserved So for example if you have a module which you know that is going to write certain value you can reserve that key and No other module is going to override that so this this is this is the protection that's it adds so these are the enhancements on top of you doubt we we actually use to to handle storage and where we can define modules in a much more straightforward way and Yeah, that's then you can make Your records persistent which means that they are actually getting synced with the Master record in the city database for next use. You can also use temporary values if you need to but Makes sense to make them persistent Okay, so that was about Database and that's that's probably it So it was just a brief overview of what this project is about what it is trying to do and what problems It tries to deal with so if you have any questions, please Yep The stage a is written. There's a github page It's at the end of these slides. I'm currently working on stage B So it's still under development My my ultimate goal is this to be usable for all storage So if there are any other specifics that we need to deal with and if there are any Storage developers here, please consult that with me I would like to map that ultimately I would like to make make this central for all storage and Ultimately what I would like you to have to do the only thing to do is to create those sim links and everything else to Be here so that would be a central storage and the game from here This would be that as you have seen it There's the seed database and the seed database keeps the storage stack how it looks like and if you have the Library interface you could access this information. You could have for example notifications like Please let me know when this group of devices is complete It's just enhancing how The notifications happen. It's adding more structure to that. So it's later on top of it So yes, ultimately you'd have just creating sim links and all the storage handled here by modules We already have that in you depth database, so I don't need to that I have access to that database by the UW namespace, so you you don't need to Think about this because it's already Import it automatically by sit for you. So you can access everything what you can access in UW rules But there's the extra namespaces that you can use as well. You that is one of them. So Yep And yes, like Number one is if you have a You're gathering up the different ones and maybe one of them is missing. There's gotta be a timeout or something Before you give up and say that device is dead I'm gonna boot up anyway in a resilient mode without that Yeah You might see the first half Sure sure as for the first one like Waiting waiting for completeness of certain group of devices That's actually the stage B where we can schedule some timeout and waiting loop Of course that could be configured that's still work to do because I'm just working on the stage B, but yes, that's going to be support and The second one the multi path. Yes. Yes Actually, that's that's the very first module that I have written it together with Ben Marzinski who is dealing with the multi path and He created a library for me. So Well, I just get that information from multi-path and I know that this certain device is configured as multi-path components. So I know that it can be used I can mark that device as please don't use that device don't do any any right to that or Simply marking that in a standard way. So yes, this is a plus when you compare that to you That you can you can create a library libraries to access From module the modules can use libraries to access this information with you the rules You could just run the external command. So it's just narrowing down the possibility of timeouts happening So yes, these two are Yeah Okay, if you need an information that's here