 Welcome everybody to the next talk Muse MTD in user space from Richard Weinberger. Please find a seat and then let's get started So then let's get started I'm a little loud. I feel That's better. Yeah. Hello. My name is Richard. I am co-founder of Sigma Star game We are a small conducting company based in Austria. We do mostly stuff focusing on Linux and security and I have special interest in Linux itself and a lot of the components and security I Ended up also maintaining various components in the Reynolds channel and one of them is UBI and UBI FS Subsystem and fire system based for Ronan's and today I would like to talk a little bit about Muse actually the goal is not that I am Promoting Muse the goal is that I will Explain you the journey how I end up with Muse and how this can help you also because I'm news is based on fuse fire stem in user space and In my opinion, it's rather Underestimated how powerful fuse is and what cool stuff we can do with it so first I will talk a little bit Why fuse then why I chose fuse to implement few amuse now confusing words then a few details on the limitations and then More details on fuse itself and a small guideline how you can use fuse to create your own subsystem is a way. So that's also bad name fuse and muse and anyway as I'm one of the Ronan guys, so Empty guys I do a lot of testing of of empty device drivers and also empty fire systems that means Somebody has a problem and UBI or UBI FS or JFS is acting weird and then they send me a bug report and Sooner or later. I'll get an image of the of the Ronan device and have to make sense out of this Usually I don't have the hardware so I don't can flash it to the exact device. I can just look at the image Looking at the image as a hex edit or is kind of tedious Do it sometimes, but usually I try to to have the customer device more or less simulated that I can load in the Denand dump and and then try debugging UBI FS or UBI and see why it's not working We have already plenty of Simulators in the kernel for some with empty run with block to MVD. We have Nancy, but we've also Support for virtual flash devices in QIMO. So there is already a lot of tooling there, but it didn't really fit for me first a few details on Existing tooling we have So empty run is as the name states an MDD device that operates in a run It's actually Vmelloc memory region So when you configure MDD run to use to simulate for you a megabyte sized Flash device it does Vmelloc on a megabyte. I think it's obvious that this doesn't scale It's good enough that they can emulate a spying or device Although it's of type empty run, but but you can use it good to Deepak issues for JFS 2 because then JFS 2 was designed for really small nor flashes and When you just need to Vmelloc for megabytes, it's good enough one of the Bigger drawbacks is is that you can only have one instance. So you can configure the empty run Emulator by loading the kernel module and pass as kernel parameter or as convic option This this size of the chip you want to have so only one And there's no way to have fault or error injection it works the laser and That's okay for inspecting existing land images, but I would like also to run time testing So when I change something in JFS 2 Then I want to do testing in a virtual and my environment and inject some error So I find them bit flips or bad blocks or whatever Some of the behavior you would expect from a dying flash chip and this is something that is not there There's a second device that can emulate us a Raman typed entity. That's good enough for for spy nor No, in general, it's blocked empty D as the name states its operates on a block device So basically you can turn an existing block device into an empty D. So you Usually you don't use your real hard disk as empty. You use a loopback file and then fit it as empty. It's More or less the same as empty run it also implements the the plane MDD interface so it does not implement for example the the nor flash MDD subsystem or the man flash MDD subsystem. It's really an MDD In contrary to empty run it allows multiple instances But also only to be configured at module load time There is some hackery around that you can configure it using this FS So if you can change the the kernel parameters in this FS But this doesn't really work because the the MDD subsystem itself has No knowledge about hotplug. So when you remove the empty device, you are in deep trouble But it's good enough for you be I and you be a fs testing because you can have Large hot devices with a few megabytes But since it's emulating a north style device You be I will go go in nor flash mode and you don't have any support for real evidence So you buy it doesn't do any real evidence or when you want to Debug the real evidence. I'm thinking of you be I blocked every D is not you want you want to use Same with empty run you have there's also no support for fault or error injection Then there's a third one. It's called Nancy. This is the most used I would say It operates completely in rum, but instead of allocating a whole memory region for the whole device It's it's came allocating individual land pages There is certain support that they can swap out pages to a file, but this is a Little bit hecky you have to pass a file name to the kernel module by loading and You can cannot Configure Nancy in a way that it can use an existing image. It can only swap out data It's a little bit hecky and it breaks over the time The biggest difference between empty run and blocked and the days that Nancy means actually Emulating and then chip itself It emulates all the nan commands the then and frame was Then and framework would send to a chip. So it's really the the chip hardware implemented in software and That's why it's directly interacting with the empty the NAND framework So the the NAND framework is then sending regular NAND commands to the to the simulator and it will accept on that Since it's simulating a NAND flash. It's really nice to test UBI and UBFS So you have we're leveling and all that NAND specialty this It has also support for ECC. So it supports and it's all of them on and busy age busy age Of course only in memory Now you would you know, you think why would I have ECC of the NAND chip in memory? It's it turned out to be really useful to find a Double writing of the same page because then the ECC checks some doesn't match anymore and you can catch it also using NAND sim There is also partial support for error injection. You can tell NAND sim to Add randomly bit flips or other faults, but it's configured in a in a static way You can also configure NAND sim to delay certain operations For example that NAND erasure should take a certain amount of microseconds or also page read and write should be should be delayed and You can configure partitions. So using the kernel command line, you can configure NAND sim to add multiple partitions so it's More or less able to have multiple instances, but also again in a static way As you have guessed emulating a NAND chip in software is low That's why NAND sim is also not the fastest one. It's good enough to test UBI and UBI FS, but it's really not fast and It turned out to be rather erosome There's also some interesting effect about it You cannot say NAND sim give me an M device of size X The thing is NAND sim is really tied to the NAND framework itself and The the NAND geometry is commuted from a NAND identifier So you have to say NAND sim please emulate me a NAND chip with these identifiers And then NAND sim does the identifier parsing and constructs from that the page size the erase size and the number of erase blocks Using the real ID command this is then communicated to the to the kernel. So it's really the chip itself and There are some some tables even in the in the utils we have a script that will generate you some NAND identifiers but the problem is Modern NAND chips have also the The on-fee identifier NAND page and in this page There's the real geometry and then the identifier is not the real geometry So when you want to emulate a chip where the NAND ID is not right, but the on-fee page then you are in trouble three or four years ago I Created a patch for for NAND sim is actually on a mailing list that NAND sim can emulate an on-fee identifier page But we decided to not merge it and better think about a new way how to do NAND chip emulation That's why I'm here NAND sim was good to find errors in UBI and of course also in the in the empty the NAND framework because the NAND framework is exchanging NAND commands with NAND sim but these days we find mostly bugs in NAND sim into itself So it's more or less a maintenance burden and I try to get rid of it The third class is QEMO. So QEMO can also emulate flash devices Mostly used to get you if you are support. I Think it supports only Several no flashes, but I'm not so sure about that but for my case it's too inflexible and It's not really a gain and there's also no way to to have fault or injection So at the end of the day, I was unhappy with the state of how can we have a virtual MDD Having all these MDD devices in hardware is also not an option because I cannot not endless flash real hardware It's slow and will burn down my NAND chips and they will die soon or later. So I won't have it as a virtual machine so First I would like to have an option that I can and remove that I can add and remove MDD devices at runtime. So Add me a few NAND chips, remove a few just to have a decent test form That was my my first topic on the list Then I would like to have emulation for no and main style MDDs And I've also would like to have support for various image formats when I take an A dump of NAND chip I use the NAND dump command But there are many other ways to capture the contents of a NAND chip when those specific ones or low-level ones from the bootloaders Sometimes you get it without a fan data sometimes without sometimes the format is is when the specific So I thought about a way how how can I add an interface that will support many different formats? And also, I would like to have a way that I have more control of over error injection and fault injection So not just make every ends read fail with a bit flip May make it more controllable But that's our stuff that you cannot do in the kernel space So that's why I thought okay, let's think about a new MDD simulator and First I thought okay, let's Split it make the kernel part really simple and stupid and try to do all the hard work in user space When I have a user space library then adding a vendor specific format should be easy when everything is done in kernel space and Adding a new proprietary format is not so nice and takes a lot of time, but it is a space. I can do whatever I want That's how that was the main idea and then I thought how can I implement that? First I thought okay, let's let's create a new MDD driver and add some whatever interface to the user space for example a Michelinia's character device and do all the communication While implementing the first proof of concept I talked to myself know that can't be the right way There must be something I can reuse and I don't want to add yet another kernel user space interface That's too complicated Maybe I can reuse something and save time And I thought yeah, let's use QEMO. There's this awesome virtio stuff and let's do a MDD virtio driver In theory that should work But then I need something that I can control from the user space also QEMO Then I need a second channel also then I need a channel between QEMO and MDD as a kernel and I need an interface between QEMO and my control program and it turned out That's way more complicated than I wanted to have it and And binding it to QEMO also sounded not that nice So I said, yeah, let's try something different This was the point where I was chatting with David one of my co-workers and he said Kind of just mocked the MDD characteristics in user space using Q's and I was oh Q's there was Q's What's Q's again? Oh, there's character device in user space. There's a special mode of fuse I completely forgot about Q's So who of you knows Q's? That's more than expected so the basic idea behind Q's was At a time where Linux gained support for ILSA so the advanced Linux sound architecture There was still the old framework. It was OSS the open source sound system And they have been many programs that that could only communicate to OSS and there were certain of control devices in depth to control OSS and then the then the ILSA guys thought, okay Let's create something new that we have a wrapper That then we can actually emulate OSS in user space and that's how was Q's was born So they have been able using Q's to implement OSS in the user space So it turned out that Q's is just a special mode of fuse or a subset So it is a character device that can Can ask user space for every read write IO control on whatever Command and you can then implement this like a user space files them so for for bare character device that works really well, but the problem is an MDD is More than I'm playing character device when you even when I really Reimplement all the MDD characteristics Using Q's the colonel still doesn't know that this Q's device is really an MDD and when I have Something like UBI that stacks on top of an MDD device. It won't work on a character device What I thought that's a nice idea. Maybe I can do something like Q's in different way and Then I looked into the implementation of Q's and thought it was just a few hundred lines of code that's made Use of the Q's framework to actually emulate a character device and that's how News was born so as short recap Q's is files them in user space That means you can implement a files them as user space program for every files them Operation user space has to answer so read write IO control start. It's a very generic framework I'm sure many of you have used SSHFS or NTFS 3G. That's a very common file systems that are implemented in user space using Q's and By looking into what Q's actually already offers. I realized that's good enough to implement also a flash device To be honest Flash devices are rather stupid and simple so there's there's no zero copy IO and other fancy modern stuff It's just a plain write a plain read write command using buffers and that's doable with fuse also Then I thought okay. Let's give it a try In views you have multiple operations for example you have fuse read you have fuse write for the read write commands and I Have added just a few more commands that they are news specific so a read a write and erase and a method that you can mark bad blocks or check for for good blocks also in sync and and I made also make sure that we support out of band data and easy see Actually one of the hardest part was the empty lifetime. So if you remember on my wish list one of top Post was that I want to have them support to add and remove entities at runtime and the empty framework had no No idea no idea about that, but when I have fuse the user space program will Start and crash at some point. I mean when user space program crashes then the entity will be gone So I had to make sure that entity can deal with all these Dynamics So that was one of the harder parts What so far My feature list is that I I want to have snapshots That can be done completely in user space That you have custom in image support. So not just named up them So just them any format you can think of There so there's an in an interface that they can add your format if you want And also that they can record and replay all operations that is used for for power cut testing so when for example record all named operations of our file read write and then stop at some point and Let's say I had hundred named operations and and I replay only 50 Then you BFS has to be in a in a position to do a to recover from that Currently we do this just by randomly power cut cutting and hoping for and hoping to catch the Bad code paths, but when I can really replay every single command and then check whether you BFS can can recover about each step then I can catch every every corner case also fault injection that I can From user space to say now after this type of right at me a random bit flip so I can make it much more sophisticated and Not also just based on randomness And of course when I when the implementation of the MDD is a user space I can also do fussy so I can actually have a Fossa harness that will fuss you BFS and change the then end content on fly This is just in progress. So right now I have a working emulator that can record and replay commands and It's good enough that you BFS is happy Um, it was less work than expected. So I had the first prototype within within a single day So fuse had really a lot of stuff. I could reuse. I was really impressed Yeah, I said that already so that I Considered the kernel part is done. It has roughly run thousand line of code It's not mainland yet because I'm still playing with user space and as long as I'm not really happy with you space, I don't want to fix the kernel part because adding a more fuse commands is something you really have have to To justify and I want to be sure that this set of fuse commands I've added are really what I need and not After half a year I say and it then more commands So I'm playing with user space currently. It's written in C. It's based on lip fuse low level But I'm also playing with rust just because I've talked now about a fuse for me for testing and inspecting NAND issues, but there's also corner guys. I Have to talk about I don't want to but I have to the thing is When fuse is mainline, then it's also possible to implement a real mdd driver in user space so for example when you have a spy nor device you can use spy def and Communicate using spy commands in user space and then implement also the mdd in user space So in theory then you can create your mdd driver without making it open source People have asked me for that. I do not recommend it, but technically that should be possible then That's one of the things me and the other NAND guys or flash guys are not so sure whether we should merge fuse or not It depends so you can create then Of course, I'm really slow drivers, but you can keep them close to us That's one of the downsides the same as with with fuse many fuse for SM drivers are not open source because And only the kernel has to be TPL So now I have talked a lot about fuse, but actually I want to tell How you can do this for you kind of the device first we need to talk about Fuse in more detail. So fuse is a client server architecture Where the the user space is a server the kernel is the client It's in this is sometimes really strange when I read libraries that the the kernel is the client and you are the server on the kernel side you have always a rather generic driver in currently we have just for fuse we implement the VFS in interface of course to have a system and Qs implement implement a miscellaneous character device in my case muse implements a generic and detail in detail driver and In the new case you add your own stuff All the communication between the client and the server is based on requests The requests are made by the clients in this case by the server Each request contains an operation and this is artist operation. I've talked before and these are part of the kernel API, so you cannot change them later so you have to come with a set of of Supported operations and you can only add operations It's also important to note that that always the client makes the request and the server is just reacting on that it's not in both directions and Each request contains a pair operation and input and output structure So the type of operation is only it's only an integer what you can send the structure with it that describes your operation in detail and then you reply with an Another struck that's that that's then the answer for example The command fuse write that's the fuse command that's executed when you Try to read from a file. I'm gonna try to write to a file. Sorry so when the when the write system call happens through the kernel VFS at some point the generic fuse Files and driver in the kernel will create a fuse write request to the client and It will send that to the server and it will send to the server a Struct fuse write in its its input for the server and it describes the The right request for example, you have the file handle the offset this the size a few flags and the bedding using this struct then the Fuse server user space knows what operation and which parameters should be done and Then it it does the operation and it and it it will answer with the outstructure and in case of a of a ride is just Size it will answer with how many bytes it wrote It's important to note here that you have to add at bedding because everything has to be 64-bit aligned It's also important to know that you have use here data types that are valid for 64 and 32 bits That's important when you have a Combat mode user space so currently runs in 64-bit and user space in 32 32 bits So it's really important to get these structures right when you have to add have to have Changes here you Need a new operation so the request is actually a set of multiple IO vectors and usually The the answer contains about at least three vectors. It's the fuse out header It's just and very generic header that tells whether the operation at all works or not It could also state an internal error Then you add your very specific Output structure for the operation in this case. It's fuse right out It describes the actually results of the ration and optional you can add a payload so a buffer with the links so For a fuse read command you would answer with a payload because you have read something and It's also important to note that also the request can can contain a buffer. So when you have writing Then the then a request for the right command contains a buffer with the data you want to write So this is a Little guide for you how to create your own framework first you have to To look at all the existing fuse operations and destructions whether you can reuse them The chances are are good that they can reuse some of them. If not, you have to add new one they are in a chrono tree in the UAPO folder that means they are they are you AP they are not changeable if you break something there then Linus will shout at you so make sure that when you define this and upstream them that Everything is good. You cannot change them. You can just add new operations Second you need a control to calculate device No, I mean it's for example fuse has death fuse When you open this device then you can mount and you fuse five of them. You can create your own It's it's just setting up the fuse communication channel and then the current will send the Init operation to your server and then the The communication is ready to process requests Next step is you have to add to the kernel at generic device driver So in my case, it was a very generic mdd device driver that is just implementing the mdd subsystem and Taking all the mdd Read write commands and transform and then in the fuse requests And then you have to glue all this in in lip fuse low-level of course you can also handle the request directly so Then then you have to do a lot of more work because I may have to encode and decode all the fuse operations on your own But that's the that's the main way how to do it So I will have to speed on a little bit Here's an example for the fuse is bad command so the Input is just an address and the address is the it's a block and user space will then decide whether this block in the emulator is good or bad and then it will Answer with a result it can be zero or one and here again betting because you have to to make sure we are aligned to 64 bits in the kernel part we have the the is bad callback of mdd and All it's doing it's creating a fuse request There are plenty of helper methods in the kernel So here I'm reusing the fuse simple request function to create me a request and instead of talking to the hardware I'm sending a request to the user space In user space the library checks whether the program is implementing this function. If not, it's replies with a noses and The application then actually has to implement the function here. I'm really sneaky. I just say based on randomness whether the block is good or bad and send a replay and in the fuse Level side you have to have a helper that is actually constructing for you the fuse request. You see the all the IO vectors But that's just an implementation detail so as quick summary Fuse offers a really nice and powerful framework for communication between user space and kernel space It's it's really versatile and offers much more than just five systems. You can do really complicated stuff and adding support for mdd was way less work than I expected as I said my first proof of concept was done within a day and For the for the case if Q a more and we'll die. Oh, I gave up after I guess three days Just as a rough estimation So that's it on fuse. I hope you had some some fun. I had it If the question comments, please Go ahead Yeah, that's a question So Yeah So I am Pavel Machik I wrote network block device a long time ago I wanted to ask do you have any solution to prevent deadlocks? To prevent what deadlocks? Yeah, the deadlock prevention happens in fuse the good thing is fuse has a way that the Communication cannot block when the communication fails then on the kernel side You get an error code It is connect and then you know that the user space part is gone or deadlocked So it's basically the same with five systems Okay, I'm not sure but you really don't want to use this in production and You may have problems if you write too much data to your device because you will run out of memory and Yes, that's why I said it's main purpose is for testing and that so And of course when your user space server is hostile and decides to to block every the right request Forever then you and Robert sure It's more case than if the device is in heavy use Can never run out of memory and sweat out your server not a hostile server We may want to talk. Yeah, sure, but that's them. There's a big problem with any kind of user space But it's a very same when they when you decide to use ndfs 3g as you're good at this Then you're in the very same trouble Okay, terrible idea any more questions. I don't see any hands. Let me check the virtual The virtual hands the virtual hands, but unfortunately there are no virtual hands though Yeah, if there any other questions just talk to me. Thanks for your attention. Yeah. Thank you again. Thanks You