 I guess we start now so welcome to my talk and introduction to CFS as a solid foundation for your data. All I'm talking about today is about the Open CFS project so depending on whether you use Open CFS or the commercial implementation by some or oracle there might be some differences but since Open CFS originally branched off the oracle implementation what I talk about today was should apply should apply to both. So maybe to give some motivation let's think about what it means to actually store data on disks because it's actually not a solved problem disks are spinning rust you have they are faulty all the time and most times you even don't notice you can have phantom writes where this physical disk armed with the right head on top maybe moves slightly too far right so it will overwrite data in a different track. There are many reports by the different disk manufacturers actually giving some statistical foundation to such problems. There are speed rods so maybe radiation from space and we are now to what this could lead you could have flaky power for example on your home NAS on your laptop just drain its battery too quickly and you will have to run FSC case with traditional fast systems after that to verify the integrity of the data structures on disk which is very time-consuming. Frameware bugs those shiny new SSDs they have a lot of new code inside and there are these bugs they actually do occur. Then volume management you think well I use soft rate and LVM to mitigate all these problems volume management is just ugly on at least traditional volume management there are exceptions but in general at least that's my impression and I'll give an example on that soon. This partitioning it's very very inflexible I assume almost anyone in the room here had or will have to some day resize a partition that just wasn't big enough or there's need to have another partition for something else and this list could go on and on I think we would find plenty many examples. So an example on how to set up traditional moderately redundant storage on Linux would be this which is rather ugly we'll walk through this so we will create a soft rate using the MD atom utility the syntax is okay but you have to learn it so we create the MD is zero virtual device we name it MD zero we create a mirror with two devices as they one STV and this and the zero is exposed to LVM physical volumes we create a VM physical volume we create a logical volume group on top of it and then we create a logical volume of the size of 100 gigabytes which will be exposed as DevMapper VG zero root and on this virtual block device we create an X4 or file system which is a fairly standard setup enforced by many installers so what layers are we putting together here we have the kernel device driver which exposes the the raw block device in DevFS usually and then we stick together some volume managers so first MD atom, Linux software and then LVM and each of these layers exposes an interface similar to this one so a virtual device like in Dev MD zero dev VG zero dev VG zero minus root and data on these virtual devices is addressable through an offset into the device fairly fairly simple and the beauty of this setup is that you can just put a file system on top of it and the file system implementers will only have to worry about block addressable devices and to the file system all these disks you just stick together they look like a like one like one contiguous big block device very easy for the implementers but it comes with some significant drawbacks in particular the the inflexibility we talked about you have you have problems with or the aim of providing data integrity features isn't that simple to do in these setups and to save some time I think we can summarize we could do better we can do better and I think we can do better certainly with ZFS and ZFS what we what we had in these slides it's exactly one command so we will disassemble this a little Z pool is the tool to create a pool of storage the pool create creates a tool of storage tank is the name of the pool and what we want now what we want is a mirror of device sda1 and sdb1 and that's basically all we're asking for so maybe to have a look on how ZFS will present this to the user super status tank very simple syntax and we see we have this pool tank here no problems reported those can request it so far and it's composed of mirror device grouping together sda1 and sdb2 no check something errors as expected so this maybe as a little teaser what how nice ZFS can be to the administrator so the goal of this talk is not to give you one hour of command line examples because I think I would fail dramatically at this because command line examples are actually hard I will give some in the end but the goal of this talk is to give you an overview of how ZFS is designed internally how the how key features are realized and if you want to play with it practically then you can visit the workshop I will give a note about this later on so ZFS design goals I basically copied them from the original paper from 2003 which are simple administration pooled storage and always consistent on this state and error detection and correction we already saw the part about simple administration and we already saw this pooled storage thing but we'll go into that more detail later so maybe the feature checkpoints so to say have many many many points but I think the most important one no fsdk whatsoever you with ZFS you will not have this because there is no need to have an fsdk if you somehow make managed to to have an always consistent on this state you have a very simple administration interface and very powerful semantics inside you can organize big storage systems and to individual file systems you have an inheritance model that is very flexible you can delegate data sets to unprivileged uses on your system if you're not on Linux Linux does not have this feature yet you can delegate data sets to containers or zones it's the Solaris thing or JS and free VC so the the administrator in the zone can can create data sets inside but not outside or this support for this is implemented fairly generically you have this clever redundancy so as we already saw before you can create mirrors you can create rate like systems we're going to more details later on you have data integrity full stop basically you nothing on ZFS is not backed by the data integrity unit inside so in an age where storage gets bigger and bigger and we use more discs the probability that some bit will flip in your search system becomes actually significant and you might not want to risk that and I think for the next next year's data integrity will be will become even more important and also on consumer consumer systems and user systems the FS gives you dynamically allocated post expire systems you don't have the mess you have with LVM or where you actually allocate 100 gigabytes of your virtual storage and have them a fixed partition on top of that that's not how the FS works you have the pooled storage and the file systems dynamically allocate storage from the pool of course you still want to limit certain certain usage of files systems so you can have quotas or reservations on these files and so for example if you have a dedicated file system for lock files because you want to reserve certain space on your system to write lock files that's totally possible so no drawbacks with a pooled storage approach you can have Zvolts Zvolts essentially take all the data integrity and other features I expand here and does not only apply these to fast systems but also to virtual storage devices so like LVM where you can just export another virtual block device you can do that with CFS to but with all the key features of CFS implemented on that as well you have cheap consistent copy on right snapshots and clouds and that's probably the selling point for CFS beginners you can solve many many problems if you have fast snapshotting so a key example for this would be system updates where you want to freeze you the state of the system quickly then perform the upgrade and roll back to that state if something goes wrong during the upgrade and you can take that even to a new level and say well I will snapshot the state of the system fork the snapshot into a writable copy perform the system update on the writable copy and try to boot from the writable copy and if this went well I will just promote this writable copy this clone to be the new root file system you can play with with this feature later if you want it's really impressive these snapshots they of course are not bound to the the individual machine where where the instance or this pool of CFS runs you can replicate snapshots via a pipe to another machine so you could say send the snapshot through this pipe which might be an SSH connection to your server somewhere else and receive that snapshot there and either you keep it as a stream dump file or you pipe it back into the CFS instance on the other side and then you have replicated your first and this is way faster than our sync and you actually have consistent backup because the snapshot is copied and not the live data an application might be working on you have per data set configuration and deduplication and these are other very nice features in particular the deduplication parts very important to certain applications so about the compression maybe log files are quite repetitive as you might know so this system name in the beginning for example or if there is a dog file that only contains the same string all over again which should not happen but it does this is easily manageable by CFS and this won't produce much extra storage over it you can speed up pool operations with CFS because if you have an all HDD pool then you're basically bound to the individual configuration of the pool and by the speed of the individual hard drives so you can either try to strive hard drives together and then mirror these you can create complex setups but what you can also do what you can what you can also do is fasten up your I operations by designating devices to be read caches or write caches and this is integrated in the ZFS this is not some crazy heck like flash cash and finally variable block size often underestimated feature but it's actually very nice that you can adjust the block size of your individual files system to the workload you have on the first system so for example if you have large files that are read in a stream like fashion so for example the movie streaming server then you might go with very large block sizes to to save overhead in the in the metadata and housekeeping stuff in contrast when you have a database application you might want very small block sizes and this is totally doable on a single Z pool and I could continue on and on but we are already 40 minutes in and the implementation chapter we are coming to right now will take a little bit longer so again an overview over the traditional layering we already had that we have the device driver the lower part which basically serves as a hardware abstraction and it exposes the physical device via logical block tracing you have the volume manager on top of this or in case of the Linux demo I gave before multiple volume managers stick together and these all work or mostly work through logical block tracing all over again it's virtual devices but the interface doesn't change and only on the very last part the file system it actually implements on this data formats before it's dump block base storage and the file system exposes itself through this is called interface to the application so applications I assume you are familiar with the with the standard C library for interfacing with with files in this case the applications themselves see files as offset addressable storage or they can mount it as memory map but let's stick with with these standard codes for now so the traditional on-disk format or one example for a traditional on-disk format is index base allocation with inodes so usually this is meant to be a disk I did my best in drawing with inkscape I hope you forgive my so this is a traditional should be a traditional disk and the file system usually keeps some some headers spread over the disk to introduce to implement on disk redundancy at least for a little bit for the header structures and then individual files are represented by I notes and these I notes are a contiguous chunk chunk of storage on the disk containing some some basic metadata so for example the the mode of the file so whether it's executable readable readable the the owner of the file so they actually use ID the group ID or the other metadata you're probably familiar with and also information about how much bytes are actually stored in that file then how the extra data stored you use levels of interaction for this because you won't store the data alongside with the I node is because I know itself is stands for index and you really don't want to have your index have variable tens of data in the end so what we do is we introduce a level of interaction we have the direct blocks and these point directly to data blocks on disk so if we have very small files which could fit in the first three or n direct blocks then accessing byte 512 for example would mean we walk through the list of direct blocks and just just know because we have a fixed block size well I need byte 512 and I have a 512 byte block size so it won't be in the in the first block on disk but in the second it will be the first byte so we work for this list look up the this reference and find find the actual block on disk well works if you have very large files then these direct block references won't suffice and you will have single indirect double indirect your triple indirect block pointers so what will happen is for example if we want to reference one megabyte we walk to the we walk through these direct blocks see this offset won't suffice we walk into a single indirect block dereference this block and this block contains even more block pointers so it contains the same information as stored in these three or four direct block direct block pointer slots so we walk to the to the correct address here we can do computation so this is fairly easy and finally we will dereference the appropriate block pointer walk to the block on disk fetch it again and then we have our data block at one megabyte offset so this is a very simple and understandable approach but this all happens in on the file system layer so what if you had a second disk that is not addressable by i-nodes on the same disk you have to find some solution to this you can of course but it's not really the appropriate place in this format so coming back to the traditional layering we'll have a look on how how zfs is layered differently as you might see here we have more layers and differently named layers just want to give a quick note beforehand these three layers are part of zfs the others this one is in the kernel and this one is this is called interface which is part also part of the operating system and part of the kernel but not part of zfs so we should focus on the interfaces between between these different layers because these are very different in zfs so on the lower part you have the current device store by exposing the block device then you have the storage pool allocator which as the name suggests cares about pooling the storage you have available and the way it does this is provide an interface similar to to virtual virtual memory exposed applications will come to the data on but for now must suffice that we implement data virtual addresses that are not specific to any individual device it's a bit like the SPA if you give the SPA digital data virtual address it will look up the appropriate block on the appropriate disk and it will if you configure it to use disk redundancy it will also care about this part so if you give it a digital address and it will resolve whether that's on the on the one mirror on one side of the mirror or the other side of the mirror and hence you also need I orchestration in in this particular layer next the data management unit it heavily leverages the storage pool allocator because if we solve these redundancy problems and later we will see also the check swimming problems in a lower layer then the upper layers don't have to worry about this actually very nice because what the data management unit does is also complicated because it implements an interface to the upper layers similar to how you interact with files in in a regular C program it's it exposes a flat flat object and this is address addressable through an offset and the modifications you do to an object are kept in transactions so the data management unit actually implements the on disk on disk integrity and on disconsistency features of CFS and above the data management unit we have the CFS POSIX layer which is relatively thin because it can leverage all the layers below and it implements the actual files with POSIX semantics and these are exposed through the traditional venote interface to the kernel so how do we do this how are these individual layers organized we need some key techniques which should be familiar to the two people who did more than one semester of computer science but this is really important to recap that these few little key features enable most of CFS features so first one block pointers we already had something similar with the inodes described before but the block pointers are at a lower level they are at the SPA level so this is how a CFS block pointer looks like don't be afraid it's very big this is actually these are bits so 64 times 16 bits is 100 128 bytes so both statement one could say and we should think about what we can see here so a block pointer points to another block of CFS storage and maybe the first thing that comes to mind is well where is the pointing pointing data stored here so to say what do we see we see a reference or field that is designated for a virtual device we'll see about this later on and what do we what else do we see well an offset a very big offset actually 63 bits that's about and we have three times this much I'm sorry I didn't move the mouse to the right screen so three times 63 bit pointers big address for vdevs we want we can point to a lot of storage here I'm sorry for this one what else do we have some birth times this we come to this later on but this block pointer in in itself carries when to which to which transaction group or to which checkpoint of the file system it belongs to and this is very important feature for snapshots a checksum well these are 256 bits for a single checksum and this is the checksum of the block that this pointer points to and I think this is all we need for now because with these pointers we can build something a common data structure which is a tree of block pointers and the idea is to have a root of the tree and point to lists of block pointers and these and each individual block pointer in the list points to another block and this continues until you get to the leaves of the tree where actual data is kept and you might already realize that with the block level checksumming we have in the block we can implement a recursive data intricate check on this tree so if we walk the tree from top to bottom and each time before we do anything with a block that we just with a block reference we just dereferenced we check the contents of the data we read from disk validate that the checksum of the data we read from disk matches the checksum we had in our block pointer and if this checksum matches we know we read the correct data if the checksum doesn't match it means that either the checksum was corrupted or the block was corrupted and we can report this error to the upper layers so we won't have silent bit flip sign more and since all this happens in the low or most layer of the of us the the search pool allocator all the above layers building up on the SPA benefit from this feature so solve this problem once solve it right and you benefit from it a lot in the upper layers and you don't have to worry about it so block level checksumming we talked about this final thing that realizes the on this consistency in set of s is the copy-on-write approach copy-on-write is fairly simple and is again an idea stolen from virtual memory so to say but is it's a very good idea actually so what we have is this tree of data of block pointers on the disk and let's think about what happens when we modify a block of data we have some invariance or one particular invariant in the block which is the checksum and when we modify a block and we do not want to violate the the invariant we need to actually write a copy of the block somewhere else where it's free space because if we modified the original block then the tree would contain or then when walking the tree we would read invalid data so we copy a new block and put it somewhere where we have free space well the checksum of the data changes so we need to copy the first level of index first level of interaction upwards and compute the new checksum put the checksum in the copy write the copy and as you can see only the original the leftmost block it still points to the original data block that was leftmost but the right block it points to the to the copy and we ripple this changes up the tree until we reach the root of the tree and well what we do what do we do there it's this this block point is called the überblock in zfs and we override it in place and this the asterisk is there because of course we don't override it in place it would be would be terribly stupid because if we lost power in the middle of the the right then all the data integrity features we want to build are destroyed basically so in zfs we do a ring buffer approach so to say because it's practically will block over different different parts of the disk so this approach is similar to the traditional files as the matters okay so with these basic ideas we can walk through the modules that make up zfs and I hope I don't scare you with a big diagram but I think when looking at it in a bit more detail it's actually fairly simple in particular what you should see is this straight line we already walked that once when we talked about the the layering in the beginning so we have the kernel device driver we have the SPA I'm sorry this should be an A group it actually splits up into two layers but for now let's say we have the search pull allocator here we have the arc which is a cache in between we have the DMU we have the zpl so the fast system implementation this is actually not part of the zfs but this is the what was described V node interface earlier it's basically how the operating resolves operating system resolves a particular path to an extra file system because it can have different file systems in an operating system and then finally we have the the syscoil boundary user then redirect was so this line should should be more or less familiar to me to you by now but the other ones there are some symbols we already talked about this but the other ones we should we should have a look at these and actually we should have a look at the individual model it's in a bit more detail so vdevs what are vdevs virtual devices and the search pull allocators tribes its set of root vdevs so come to that later on so the SPA will was spread writes to the pool to all root vdevs assigned to it and vdevs implements the access to the actual physical storage and also discrete redundancy features and we come to these in a moment and to realize many different kinds of setups we use a building block approach and i hope you can can see this um yeah i think it's large enough you can compose vdevs to a tree of vdevs so you might mirror your situation might be you might have 100 gigabyte disk and 250 gigabyte disks and you want to somehow achieve redundancy would be relatively hard with with traditional with traditional volume management but in zfs you will just create a mirror that is composed of one disk and composed of a stripe of two disks and uh i think it's fairly very intuitive interface actually and um with very few building blocks you can realize a lot of different setups and these are actually the the finance where the individual we just implemented and you can see we have disk devices so this is basically a direct adapter to the to the kernel interface for the block device geom this is a free bd free bsd specific thing you can have file baked vdevs which is very convenient when you are learning zfs and we'll use that later on you can have cache vdevs so for example the the read cache uh in the arc is implemented about uh with this feature and you can have data redundancy uh disk redundancy through mirrors or read z maybe what read about this you can have traditional mirroring and striping inside um this is a very fast approach and suitable or you can you can get to very large IOPS maximum IOPS using this approach on the other hand if you want to make the most of your storage available then you might go for rate z1 or rate z2 depending on your um on your redundancy needs so rate z1 is a little bit like a rate three or five but with logical blocks and without the traditional problem that comes with rate three or five and this case rate five the right tool where in the middle of the right to to the rate power dies and the rate doesn't have any way to decide well the checksum of the or my parity doesn't match the data what should i do the the rate doesn't know about this because it doesn't know about the semantics of the of the bite it bites its doors so with a more integrated approach like zfs where the individual layers lowly a little bit more about each other uh we can circumvent this right tool and if you want to have uh two or three disk redundancy then you just pick rates two or three and there's works on rate n but so far this has not done it and finally you can have very faster reservoiring times if your pool is not very full because if the volume manager or the sba actually knows about the on disk structure then in a reservoiring process where one just died and you replaced it with another you only have to replicate the actual bytes containing semantics and data or metadata and data not uh like a rate just stupidly copying all blocks because it doesn't know better so the search pool educator implements checksum compression block-level duplication disk redundancy and the ior orchestration and this i think makes a lot of sense because checksumming has to be implemented at this level would be stupid not to compression since the design goal is to have block-based compression because everything else would be more complicated uh also makes sense to to do it at this layer so even metadata is compressed because metadata is kept in the the upper parts of the or zfs metadata is contained in the upper in the upper layer so zfs block-level duplication and the duplication is also realized at this layer and well finally we expose all these abstractions through an interface similar to virtual memory so uh to a heap uh to heap memory so we just ask the sba for uh for storage of certain size and we allocate appropriate uh appropriate storage and distribute it according to a vdev configuration so with this all done we should have a look at the dmu again so when we talked about this we have the dmu interface are objects and offsets into these objects the term object is a little bit irritating because it has little to do with object-oriented programming or anything like this think about it as flat files that are addressable through an offset so you just when interfacing with the dmu you just address i want uh i want byte xyz at uh i want i want the byte at at offset one megabyte and i get it the objects organized by the dmu are kept in private namespaces called object sets or data sets they there are cross-surferences possible but these are only realized by the dmu it's not possible to realize another object set from out from their place from the dmu and as already told you the transaction interface is also implemented here so in code what this would look like and this is an extract from a header file of of the dmu you would create a new transaction then maybe perform some some operations like like a write on the object set in the on object address through a 64-bit integer at offset offset write write the contents of buffer at of this size to the object and the dmu will just realize this as soon as you call dmu tx commit and which tx should we use which transaction this one so as soon as the transaction is committed it's grouped by the dmu and after a certain amount of transaction has been collected the transactions are committed to disk through a single checkpoint as described with this rippling tree schema this copy and write ripple tree schema so next up actually a very simple layer the data set and snapshot layer because the dmu is used for many many different purposes as you might see from from the all the arrows pointing to here it's really used by a lot of different modules so what we need is some kind of of meta layer to organize the individual object sets and this is done in the data set and snapshot layer so we have file systems in their zvolts in their snapshots of files and so zvolts clones and maybe someday there will be an interface that a database application for example can directly use the dmu instead of walking all the crazy stuff through file system implementing the transaction safety itself and so on there are plans for this but so far not really less but since we have so many different types of access to the dmu we need a container and for these object sets and the name people came up with this meta object set so don't be afraid um this is uh this an excerpt from from a talk even by Kirk McHusig you have one of a block pointing into the meta object set layer so uh you have the root of the of the meta object set containing maybe a file system zvolts clones some additional metadata spacemen i won't go into detail on this one and each individual one of these meta object set data structures references an object set where it organizes itself appropriately and all these objects set uh are concerned of the dmu and these ones are mostly concerned to to the uh dataset and snapshot layer i think this is fairly straightforward now the thing i have omitted the adaptive replacement cache the l2 arc so this one uh the arc and the l2 arc of course you can imagine with all these layers of interaction if we did synchronous disk access all the time for each individual dereference zfs would be terrible so and the most elegant solution people came up with is fitted with a lot of RAM that's basically what you have to do because if you keep much of the of the uh of the metadata structures in memory then you can resolve these interactions very quickly and then only access the disk for for actual data so the arc is the adaptive replacement cache it is physically indexed so um not indexed by uh dmu object i.e. something but it is really in just above the SPA so yeah utility of the cache is somehow maximized and the eviction strategy has to be adjusted a little because if you use traditional least recently used eviction then streaming a movie of the disk would evict the entire cache so what you do is you implement the strategy combining most recently used and most frequently used lists and also tracking which blocks were just evicted from the cache and so-called ghost list and read it up in wikipedia if you want it's a originally invented at IBM I think very interesting approach and it solves specifically the problem you have when implementing file system um important to know because the arc is so tightly integrated into into zfs and it's also very tightly integrated with the L2 arc naturally it's separate from the system page cache so for example memory maps uh which can be optimized if you have a traditional file system utilizing the the system page cache memory maps are naturally more expensive because there are two copies in memory and you can configure it but in the default configuration the arc will consume much of your system memory I think default is half of your system memory and some minimum value so um you might consider tuning this variable but in general the more ram you feel is the better and finally the L2 arc just basically a side note also the implicate implementation is a little more complicated if your memory does not suffice or for example you have a pool consisting only of hdd's then you can use an ssd to speed up read operations by caching additional data on the ssd as well as syndrome and zfs support for this integrated very easy to set up so with these basics or these lower layers all done we can think about how the user facing features of zfs are implemented so in particular the zfs posix layer as script here as we already know it implements files with posix semantics and to do this it heavily leverages the DMU and the zip and this what these mean we'll come to in a second but I think from an engineering perspective it's a very beautiful thing that you can implement a file system with very few lines of code because you build up some some proper restrictions uh below so the zfs attribute processor the zip is heavily leveraged by the zfs posix layer but also all over the place over zfs because what it realizes is the most basic data structure you have key value paired data structure so it's nestable so you can can represent complex data inside and it's used to store the metadata of individual files in the zpl layer it's used to store directories also in the zpl layer and there are special algorithms and data structures inside to to speed up directory lookups data set attributes like compression compression what else do we have the mount point stuff like this it's all stored inside there and used internally in zfs and pool attributes are also stored using this layer so it's really a really general purpose they are providing infrastructure to other modules and the zil maybe worth mentioning the the this tree copy and write structure I talked about them in the beginning rippling up from a data leaf to the overblock and creating a new transaction on disk a transaction group on a checkpoint on disk for every single posix operation that is meant to be synchronous and a challenger consistency like for example an fsync it's just not feasible to do this with dmu checkpoints because they are too heavyweight because there are applications that do fsync every few hundred milliseconds so you really don't want to take a checkpoint every time and the the solution the zfs developer zfs developers came up with is implement some sort of journaling but implemented in the semantics of the zpl and zvolts because that's really the only place where this is needed so the zpl will record write to dmu objects x at offset xyz and then contain the modification in a separate block so this block can be used later on to integrate into the tree structure and like with a twerk you can create a designate the specific vdev to be the zil and although I don't use it personally many people use it use nvrm or ssts to speed up their write operations with this because it's intuitively if you put this zil to every posix syscall that like fsync this is this would become a bottleneck in your pool and you have to prepare against this somehow but it's only the zil is very small you don't need much space for this so how it works is operations are queued up until enough enough changes collected or enough time has elapsed and then the changes in the zil or accumulated in the zil are integrated into a transaction group so nothing to really worry about as during daily use but if you encounter performance problems this might be a way to to solve this finally some some specialties of cfs the zivolts short they also leverage the dmu and basically a zvol is a block device what does the dmu do it's it exposes objects with offsets block devices are also addressable through offsets so implementing zivolts is really a matter of a few hundred lines of code and what you get with this you get all the data integrity features described earlier you can export these zivolts via iSCSI or you can use them as local or remote VM storage very convenient okay the management of all the zfs configuration is mostly done through devcfs so the two userland utilities zipool and zfs issue iocdl calls to devcfs and devcfs acts accordingly you can configure the vdevs and this device is also used to send or receive snapshot streams it's just a call an iocdl call to devcfs and then you get get the snapshot bytes out of it so this was the walkthrough through the zfs modules and to to recap all this do you need to recap because otherwise we can give more live demos maybe raise your hand if if you want a walkthrough through the engine modus again or whether you directly want to see zfs in action how it actually looks on the command line so option one please raise your hand option two please oh okay okay yeah i'm talking yeah maybe i'm interesting and if you have really performance problems here and how you can really be a bit more stupid you have to look at the problems really comfortable okay yeah i will i will go into that later in the question section because it's a fairly specific issue okay so we will skip this example and directly uh jump to how zfs looks in action and i have a question yeah um will you talk about um the necessity or the but you don't need ec0 because um maybe if you look about what it has on the internet you really yes it comes from the large use ec0 or everything will burn okay so i repeat the question question was whether ec0 or not i think this is a discussion that really exceeds just the zfs community and it becomes religious very fast i don't have the the experience uh in data center operation to give any any detailed comments on this i use ec0 in in my personal uh personal search machine but i i don't think i am qualified to give give any statistically statistically found it found it hints about this okay so uh zfs in action i will reconfigure need to reconfigure my screen so i can mirror it to you so i hope you can you can see yeah okay so what we will do is we create create a few sparse files which will represent the the disks in our pool so what we do is we create six drives six sparse files each 100 megabytes and recreate them in the temp directory so now we have these files in the temp directory next we create the z pool and because this is linux i have to use uh pseudo because on linux zfs does not work with uh zfs user delegation does not uh work in zfs on linux right now so we create a new pool uh called tank which is a rate set of disk zero disk one disk two and we maybe want uh spare disk uh three so and uh i opened a new watch on the right side i i the font big enough yes i think so you'll see see the see an overview of of how of how the the z pool is architected okay so the question was whether the individual disks have to be equally sized i think it will if you feed it with different disks or partitions it will uh choose the minimum if you have mirror configuration and i think for rate set two because you need to spread the parity so you can always guarantee that one disk can fill and you have have the data um however there are auto expanding features in in z pool it's basically a property you can set but uh the basic idea is that you feed it with equal sized disks and if you want to grow the pool later on you will buy new drives put them in a separate device connect them to the server or your machine where zfs is running and then grow the pool and remove the old disks they don't okay so we created the pool and uh maybe we uh put some data inside so what uh what i would suggest is we write a bunch of of zeros to the pool um before we do this we might want to look where the pool is actually mounted so so we'll see if as list tank and we see tank uh has used data some available storage and uh the mount point is at slash tank um you might ask why this is so little available storage because the 100 megabyte disk i think we are past this nowadays um there is some metadata that is stored in the disk uh i assure that you won't have you won't pay a penalty of half half the storage uh if you have four terabyte drives or something it's this really because of the the demo right now and because the the files are so small so uh we have a file system mounted at tank and we create a file there containing lots of zeros so we use the dd tool for this tank zeros and uh we use create a 100 megabyte file so because all files in this default tool being owned by root to do this is root 2 so so in tank we have a 100 megabyte file containing all zeros fine so what happens now if uh one disk in the pool uh has corrupted data and because we have files it is easy to simulate data corruption uh by using uh dd again but this time writing some random bytes into the disk file which is basically data corruption so we will use the view random and this case we will actually also uh corrupt the corrupt the zfs headers so the entire drive will be marked as uh defective so i am sorry that i just blanks the entire screen i hope you remember the context a little we write random bytes to disk zero uh disk zero we maybe let's make a practical example say uh 50 megabytes on this drive flip and uh most of these 50 megabytes flip in the beginning so will this work be surround by the user think this will work so uh randomness takes a little of time so naturally because this data corruption happened silently the zfs doesn't know about this how should it the disk would need to inform zfs that it's flipped and that disk isn't that clever so what you usually do is you either access a particular file on the pool and because zfs in the proceedings of the file access as we know from the the laying architecture will walk through the this tree of blocks it will sometime encounter a check summing error and then return an appropriate error through this score to the user the other ability to detect errors and i strongly encourage you to do this if you use zfs in your personal setups is to do periodic zfs crops scraps are basically zfs command to zfs that it shall walk its entire on disk data structure and verify all the checks some so it's just walking the entire tree and verifying checksums and we will tell zfs to do the the latter one so we tell the z pool to scrub all its data on this data and refresh time is five seconds we should see corrupted data now okay well we got that so um maybe i can increase the font size a little so what do we see here pool tank is degraded we get a very nice status information actually zfs status information is mostly usable and these urls actually actually contain useful information and disk zero where we just put the random bytes is marked as unavailable and data is corrupt there well luckily we have the rate so the pool is only degraded and it will continue to surface possibly with the y ops because it has to reconstruct reconstruct the data each time but we are still operational however we might want to fix this particular problem so so we will replace the 40 just zero with our spare disk three and to do this we issue the replace command as also described above here z pool replace we pick the pool which is tank and we replace the device disk zero disk zero with the spare disk three and let's wait for a new watch to update itself this will okay and what have we seen here we see the spare so we have currently added a spare for disk zero and zfs told us that the that it received that 50 megabytes of data so basically we overwrote 50 megabytes of data and these were corrected again and since we now determine that this zero is permanently defect we can as well detach it from the pool and it will disappear so and now we have a fully operational pool finally we can clear all the error messages we got from tank and now the scan okay the scan data lasts okay but but otherwise if if an error occurs that you will ignore for a second then you can also clear the messages also interesting you can display the history of commands that were applied to the pool so you can exactly follow how the pool was created which operations were performed at which particular point in time which disks were replaced very convenient feature so now that we have set up the pool correctly and replace the 140 disk maybe we can look at how the interface of zfs looks like it's similar and similarly into it okay we have a question now you can attach additional spares afterwards the spares basically meant us to to hop around between different pool sets from what i know mostly convenience feature maybe heavier zfs users can report on this differently but you could just as well add another disk to the pool and then use that one yes repeat that again was followed up to the question so you can that if you can either use the st or i think in more recent implementations there's also the auto replace feature on the z pool that you can turn on they can enable and this will automatically replace filling drives yes another question is whether you can repair a pool without replacing the drive it's self-feeling so as long as there is the ability to do repairs without replacing a drive then yes the pool will self-feel and automatically deliver correct data so for example while the pool was in the graded state it would have still answered IO operations correctly it would have still delivered data that you want to read okay in this particular okay so the question was in this particular case yes was a non-optimal example because there was not just a single bit flip in some of the data structures but I actually destroyed the the overblock the the part where the overblock is stored and this renders yeah permanently permanently damaged pool and this has to happen automatically yeah question so but what is for example in a real system the overblock would have been misread because for example a faulty cable and I would place the cable and the disk is still okay could I tell that it has to just restore the drive from what it has left yes for example if if if a cable breaks or something the only structure is still still present and with the birth time in the block pointer and also the birth time in the overblock a disk it reattached after it was detached from the pool for a while would just need to be reserved from or would just need to receive the changes from where it was detached to the current state of the pool and ZFS does this automatically another question and the question is whether you could make the pool smaller by just removing one disk yes you could do that but you lose the redundancy so depends I also did some some some hex with with this one disc redundancy then removing the disc to attach it to a new pool because you don't always have spare disks available to to create a new pool you see this is really meant for you to scale operations and with the big block pointers that you have seen ZFS is really meant to address a lot of data that the data structures inside their scale to scale to up to zeta bytes theoretically so it's this is the expanded name of the file system so and where you are storing these huge amounts of data a few disks are not that big of a problem so it's little enterprisey okay automatic deduplication and the question was whether the 100 megabytes of zeros uh how much how much they actually occupy on disk and this place and this pool I did not enable the duplication because the deduplication feature actually uses a lot of more RAM than just the arc which also uses a lot of RAM so because you can imagine how deduplication works you use the checksum as an index into the deduplication table and you just check well just before you write the node to disk the block to disk well this checksum have I already written it and if so then you are only stored that that you are a reference to the original block on disk so this feature is not enabled on the pool however if I enabled compression on the file system where this pool is stored so if I enabled compression on this pool on this on this data set then the 10 the 100 megabytes of zeros would have been a few bytes uh essentially the metadata I think the metadata uh one level up uh up the the actual data wouldn't have been so compression looks like enabling compression is very simple operation if you want to specify the algorithm you could for example use gzip at its various uh uh yeah gzip as it adds as its various levels or the the default which is when you just turn the flag on is a z4 or a zjp which is a stream compression algorithm and it's specifically designed for us in cfs uh any more questions on this one because otherwise I would continue with individual data set features that's the question it's uh append only compression so to say so everything every block afterwards will be touched again an argument why this does not happen that you recompress everything when you enable the feature on the one hand you can't really estimate how long it uh you could estimate how long it takes but for this incident would produce a lot of io load and you would create a lot of new transactions at the same time so uh it's append only so if you want to convert the data set with a lot of data to be compressed then you would create a new data set and then send a snapshot of the uncompressed data set to the compressed data set but not apply the the compression compression setting of the old data set to the new data set it's very easy and this is the way to do it other questions okay so maybe we can get a more complete picture of the individual data set properties really a lot so zfs get always gets you the properties of a data set z pool gets the it gets you the properties of the z pool so for the data set tank well we get this is a file system it was created at this particular date you get some some basic statistics these are actually fairly tricky and uh this is due to the copyright semantics of zfs I always have to look up the main page to exactly determine what amount of data is used by which kind of by which kind of blocks because if I took a snapshot how do I account the storage that the data from the snapshot to my current state of the file system how do I account it yeah this is also a problem with links to zfs and especially df-h is really unreliable with zfs is it also hard in zfs to determine how much free space I have left and there is a real remark on in particular on how df-h integrates with this uh in short I don't use df-h for this I use z pool get but at least you should see it in my uh my free bar here I think this is report I don't know I don't know if it's reported by the file system or by df I would have to look it up but my machine runs on root zfs yeah remember okay so the remark from the the audience the storage reports from zfs are apparently more accurate than in butterfs but yes some kind of background story but it's not because they have experimented the image has some very small files and they wanted to have a very accurate and then the accounting was that made very accurate so they had everything correct okay well let's walk through them because some are actually very interesting so here the quota and reservation features I talked about earlier so you would actually set these properties per data set so and maybe because I'm no time is we're a little bit short on time all the data sets I'm talking about you have not really seen them yet because the flexibility to create new file systems uh basically looks like this so I want to create another file system inside the tank pool and zfs uses uh your annotation for this so for example I want to create a logs file system and uh I just make this hit this command and uh let me show recursively all the data sets of the tank data set and you see now it's not just the tank data set but also a child which shares the same amount of storage the same the the same pool of storage and this is really convenient because for example I could create another data set called movies I list it again and I and I have all these different file systems and to to the operating system these look like different files and I can set all the individual properties we saw before on them so for example for the logs I know logs are easily compressible so I uh set it to gizm minus nine because computation is cheap and this just happens and I know that all logs written to the tank logs directory all data written to the tank logs directory again will be compressed using gizm nine saving lots of space uh in contrast I could increase the the block size of the movies data set to to consume less data and on systems other than linux so for example on 3pc which I use on on servers you have all the delegation features available to you so you could designate a data set to a jail and all this is also handled by uh by the zfs utility we are uh I think time is up I would be willing to accept more questions if there is no consecutive event in this room okay other question yes I have read that so the encryption topic so maybe this is this is a good closing point so zfs was originally developed at at zhan and the release and zhan decided to to open source it with open solaris then was zhan was bought by oracle and the version of zfs that would have been released just after oracle bought zhan actually contains encryption and so the the commercial version of zfs supports encryption the open zfs implementation on most platforms doesn't so far however the infrastructure for this is in place and the most natural fit for this will probably be the the search pool allocator and I think this is I think there is one implementation that is fairly recent where it's implemented there however there is big discussion among the the in the development community how to implement this property and not necessarily regarding the actually crypto implementation but more of the the key management implementation because many operating systems actually boot off zfs and you would have to integrate all this as well no oracle doesn't doesn't share this okay I think time is up we can talk talk later on and thank you for your time