 You know some people might have noticed it. The slides say Philipp Reisner CEO Lin Bit. That's not me. I'm just using his slides My name is Robert Eitner. I do work for the same company though at least that is true so the company is Lin Bit and We are working on a product that some of you might have heard about Maybe some have used and this DRBD. It's a Linux kernel module and that has had a focus on block-based replication of storage and Today's presentation is about the management layer that we're developing right now, which is called LinStore That is a cluster management system that will allow you to manage the DRBD resources and in the future also resources that are not based on the DRBD So all kinds of storage technologies A few words about Lin Bit. Lin Bit has been around for about 19 years now starting about 2000 with the DRBD as a master thesis project and Has since made its way mostly into high availability setups That's how it all started for us So most of the clusters that I was talking about were two node clusters traditionally where you had one system Where you would write some data In a high-valubility setup and you would mirror the same data to another system that's where we originally come from and Everything is moving to cloud systems virtualization systems Systems like OpenStack Kubernetes open maybe life. I've heard about those products maybe Some of you use smaller setups proxmox as well and all those projects and products based on Multiple node architectures. Those are all clusters. So that's we're moving to and That's the change that I would like to present today Linux has a couple of storage technologies already built in and that's what we use for dbd. You've always used those Storage gems and storage technologies as the back end for dbd So that's what we replicate, you know any kind of block device block device being for those who are not too much familiar with Unix terminology block devices anything on the Linux that Provide storage basically every storage device like SSDs USB sticks hard disks even old floppies All those storage devices show up. It's so called block devices and dbd is able to replicate anything that is a writable block device Some of those Technologies just go over them quickly so that you know what's already available But we basically use as the basis for Linstern for dbd So we use those existing technologies to create back-end storage for our replication product One of the more Probably widespread ones is LVM logical volume manager, which originally was implemented in AIX Let's say Unix clone just just like Linux from IBM and The implementation on Linux is not a part. It's a re-implementation So it's written from scratch, but the idea is basically the same You have a couple of physical volumes again block devices. Those might be hard disks might be SSDs You can also use Other storage technologies that Linux provides like a raid system and pluck the entire raid into LVM as a physical volume And you can combine multiple physical volumes into a volume group And on top of that volume group you can create logical volumes That what that's what you actually use to put your data on and that's what we replicate There's also a possibility to take snapshots of those logical volumes The advantage here is that the logical volume can have a different size than the physical volumes that is backed by So it's very flexible. You can have like three or four physical volumes You can create 20 logical volumes. You can only do one that is as large as three physical volumes It'll just spread out to all those volumes. Most people are probably familiar with that And I'll always skip a few details because version presentation was a bit longer Those slides will be available for download So you can look up all the details that I will skip What's new and what's being used more and more is the thin provisioning feature You can create a logical volume that is Used as a thin pool for further logical volume so you can have mixed set up you can Create some logical volumes that are backed by actual storage and you can Create a thin pool and put logical volumes on top of the thin pool And you don't have to have all the physical storage that you pretend you have you probably know that you can Like for example, if you have only physical storage of 400 gigabytes You can still create a volume and pretend it has one terabyte of storage as long as it don't write to it As long as it don't actually fill it up It will appear as if it if it was a one terabyte volume and that's what many people do nowadays They deploy its storage to for example for the virtual machines or containers and looks a bit larger than it actually is Then the rate feature you've probably seen that as well MD. That's also in the Linux kernel Software rate rate levels zero rate level one four five six ten You can use that on top of LVM you can use it below the volume group so all those technologies are Able to be combined with each other because it's all just block devices Then There's also a caching layer That's a bit of a newer feature DM cash cash And that's another layer of block devices virtual block devices more or less a filter where you Get SSD caching for your hard disk drive. So that's a tiered setup you have slower hard drives where your data all your data is on and The SSD cash will cash some of the data that's used frequently so that you have faster access to it De-duplication is another new layer that has been bought by redhead. It was originally a Commercial project and has since been open-sourced and put on the GPL by redhead And it's another layer that you can you can all combine all those layers to build your storage architecture And that's what we make use of in lin store now we're finally Approaching the topic of Replication and dbd and this is about ice casi dbd is somewhat similar to that You know with ice casi you have an initiator and a target and There's even a list of different technologies and implementations on Linux the current one being lio That's what everyone uses right now And that is somewhat similar to what dbd does There's one slide in between for CFS on linux, I'll mention that very quick as well and I heard there has been a problem Linux 5 I think license-wise because one of the functions that CFS uses in kernel Has changed to only being available to GPL code. It's already being solved by the CFS developers It has been available in the Ubuntu Distribution so Ubuntu comes with CFS Some of our Ubuntu customers use that Works pretty similar to LVM if you look at it from a superfluous point of view Has similar features it can actually do a lot more things than LVM But the basic idea of having largely volume management is pretty much the same Now here bd. We're going back to ice casi dbd you can think of dbd as a raid one system between two nodes As if you had an ice casi initiate and an ice casi target and your local disks are building a raid one with your initiator side And then your target basically does the same you're mirroring from one node to the other node It's like raid one to a network and As I said it used to be between exactly two nodes and That has changed since with dbd 9 Instead of initiate and target we call that the primary and the secondary so we have at least One secondary node that replicates everything that the primary writes and The interesting thing about it is that it is synchronous replication So everything that gets written on the primary everything that you write to the disk or to the SSD or Whatever kind of storage technology you have below Is also written on the secondary node and by the time that dbd returns control to your application So in more technical terms as soon as the sys call has ended the right sys call returns to the application You have to guarantee that it has been written both of those two nodes. That's the idea behind it That's the default mode You can also use it in an asynchronous mode where It'll avoid the network latency and make it less of a guarantee as to what's been written and what hasn't been written So the basic idea is you have a primary node that has local storage You have a secondary node that has a similar kind of local storage It doesn't even have to be the same kind of storage you could have an LVM back-end on the primary side and they could write directly to an SSD hard disk partition on the secondary side if you want so It doesn't even have to have exactly the same configuration on all nodes. That's also what makes it interesting and You can also create consistency groups. So we're not only dealing with volumes and replicating each volume Completely separated from all other volumes But there is also the possibility to create a resource that contains multiple volumes and All those volumes are replicated across only one link and a link in that case. It's just tcbip connection The idea behind that is that some data sets like for example Something that a database writes the actual database and the database locks Those need to be consistent, but you might want to have those on different volumes So the idea with consistency groups is that whenever replication is interrupted because for example One of the nodes is not available anymore might have a power outage for example The idea behind the consistency group is that both volumes stop at exactly the same point in time. So they are consistent The database itself has the same state as the database locks. There is no time difference in those two volumes That's the idea behind the consistency group. And then with dbd9 dbd has also learned to replicate to more than one peer Now it's up to 32 actually and that's probably a somewhat theoretical Value because no one is going to replicate the same data 31 times. That's a bit of a storage overhead Let me have some That's where the harm comes from So normally what we see at our customer sites is Three times replication maybe four times and very seldomly maybe six times if you have multiple data centers The typical case is still two or three replicas. So we have a primary and one two Secondary something like that per resource So you can have multiple resources and each of those resources can have multiple volumes that's the idea behind dbd and And another feature that only makes sense with dbd9 It was actually possible in dbd8 as well, but it didn't make a lot of sense with only two peers is Disclass nodes or we call them dbd9 clients You can use that almost as a replacement for ice-cazi So the idea is that you can read and write from storage that you do not have locally You only have the actual storage the physical storage on a remote node So it reads and writes to the secondary reads from the second writes to the secondary or actually all secondaries and You don't need an additional layer like ice-cazi to get your data from a storage node to another node That doesn't have storage So we can basically replace the ice-cazi layer by just using a dbd9 client And the same thing also happens if a disk fails suddenly if you have a dbd9 node That used to have local storage and disk fails or the entire storage subsystem fails It can just detach transparently and all your applications still keep running as if they're very local storage It just writes to the secondaries and reads from the secondaries obviously and Let's keep most of that on you have some additional features built-in for clustering because most people obviously use it in high availability setups or As of today in cloud setups, so there are mechanisms for fencing for quorum groups Those are all to guard against certain cases where one of the nodes nodes fails or the network fails Split brain detection, you know if you have two different forks of the data because you activated multiple copies Without them being able to see each other. We can recover from that even incrementally. So we do not need a full resynchronization those are dbd features and Our further roadmap for dbd itself is mostly focused on some performance optimizations. That's also one of the Pro arguments for dbd. It is very fast because it only deals with Block replication. It doesn't deal with any file system details It doesn't deal with any layers on top of that like key value stores or databases or anything like that so it is just another block device layer like a filter and We are building an Erasure coding layer into dbd so that you would not have to have Full replicas of all the data you could have multiple nodes and spread the data so that you have n plus one copies things like that Long distance replications also something that has been available for quite a long time in dbd There's an additional product to support that And That's all the new It's currently in public beta test. There is a windows version of dbd coming up right now Used to be a linux product It will be available on windows as well As I said, there's a public beta out by now on you can download that and just test it And it is compatible with the linux version So you can replicate from windows to linux or from linux to windows and it is still Very tightly tied to the linux code. So it is not a complete fork of our code It still Merges from the linux version. So there's a compatibility layer On top of the windows and t-kernel that allows us To keep most of our other code pretty platform independent so that it works on linux and on on windows So all the features that go into the linux version or into the windows version will still end up on the other platform with only one day offset if we Do a new linux version you get the updates in the windows version the next day and There are some details you can get those from the slides afterwards Booting from win dbd and things like that And then finally what is all this lin store? Talk about that's the interesting question today. So as I said, we come from a High availability market originally where we only had two nodes Now we have lots of customers that have cloud systems like open stack systems kubernetes containers Open nebula systems proxmox systems Systems where we need to plug our product into some other product like As I said open stack so that you cannot even see it With dbd in a two-node setup people just wrote a configuration file on the first term and copied it to the other server And then they were good to go That's nice if you have two servers and maybe 10 resources because you set it up once and then you're done but if you have 100 servers and you have like 2000 resources Then you probably wouldn't want to write all the configuration files yourself You probably wouldn't even want to touch the storage system itself. You would like to have a plug-in where you just Select create new virtual machine in open stack or something like that or you have it automatically deployed in kubernetes Just create a container that has a persistent persistent volume plane And it'll automatically create a replicated storage for you. That is the idea behind lin store So what lin store does is it takes all those technologies that I just briefly mentioned like lvm The deduplication layer the bcash raid dbd itself and allows you to Set up the system so that you can deploy Storage resources that make use of those layers in different configurations Completely automatically and plug that into some other kind of product like open stack or kubernetes So it builds on the existing linux storage components Or if we use it on on windows in the future then obviously on the windows storage components and There are not as many as the linux ones interestingly um, and Automates the process of creating a logical volume Putting for example a an encryption layer on top of that then putting a deduplication layer somewhere into the stack maybe creating a raid system out of it and then Putting that on one node and replicating it to other nodes that might even have a different storage setup It's also built to be to some extent able to run Multiple tenants, so it's I'd rather call it multi-user capable so you can define different identities and roles in lin store and Give them certain permissions of just with access control lists and privileges and we have a That's split. We have a discretionary access control layer and the mandatory access control layer So some of the you can more or less Isolate some of those roles from each other by administrative action That they can't override with their access control lists, even if they own the object And all of that is open source software, of course. It's under gpl Hosted on github and also our own servers, but most people just downloaded from github nowadays So you can find the source code online on the lin bit github Well, let's look at some scenarios that Lin store makes possible That is one possibility where you have a storage Layer a collection of storage nodes and you have a collection of hypervisor nodes where you run your actual Virgin machines. That's what you actually want to do storage nodes are supposed to Provide storage remotely so that you can run the Virgin machine as if the storage were locally available on the hypervisor What we do in that case is we deploy the actual physical storage On two or three nodes depending on your setup Most people do like two or three replicas of the data And then we move a dbd diskless client To the hypervisor that runs the virtual machine So the virtual machine from from the point of view of the virtual machine It seems as if there is local storage in the node There's a block device that it can open and write to But in reality that's on the storage nodes And obviously you can easily move to another hypervisor by just moving The diskless client from one hypervisor to another hypervisor and migrating the Virgin machine over there and it still connects to the same storage nodes For example, if the hypervisor fails that would be one scenario The other scenario would be one where one of the storage nodes fails Temporarily because it has lost power or permanently because The storage itself the hard disk failed the ssd failed Or it was lost in some kind of disaster That is something that is completely transparent to the hypervisor running the vm because The diskless client is connected to multiple storage nodes So if one of them fails it will just use the other one You don't even notice that locally And then the more interesting setup is probably for most of the people who use cloud products Is this one which at least the large companies call hyperconverged Where you have physical storage In the same node that is also the hypervisor that runs your virtual machines And even in that setup You could have local storage Below your vm you could also still use diskless clients So the storage does not necessarily have to be on the same system where the hypervisor is Normally it is because that's a bit faster if you have local storage and you can read directly from local storage So normally you would keep your virtual machine wherever your local storage Of that virtual machine is But there are also some scenarios that might be interesting for example if you migrate a vm virtual machine Might be because you want to rebalance the load Or you have too many virtual machines on one hypervisor that take up too many resources too much Main memory or too much CPU time Then that might be a case for live migration So one possibility to do that is just to move the Virtual machine to another hypervisor without having local storage there because you can still do the same thing that it did With the split system where you have storage and hypervisors You can still run in diskless mode, but what's interesting is that you can also Put a new replica under running virtual machine So there is physical storage available on the hypervisor And you're running a virtual machine on the hypervisor, but it doesn't have a local replica You can transparently create a local replica So even while the virtual machine is running on the diskless dvd You can basically Insert the disk Below the entire stack and it will re-sync with the other replicas until it is up to date and we'll even Go into a mixed mode where it reads all the data that is already available locally from the local disk And what's not yet available locally because it's still being replicated or re-synced in that case Will be read from the other nodes that have all the data And then obviously you can remove the storage from one of the nodes where I don't need it so that you Don't create more and more replicas. You know, if you have three replicas, you can create a fourth replica temporarily And when the re-sync is done, it will automatically remove The storage from one of the other nodes so that you always have the same replica count For example, three replicas in this not only two replicas in this scenario So whatever your replica count is And the idea is that all of this is done automatically So there are commands for this in lin store. There is a command to create the resource There's another command to move a resource to migrate it Automatically so that it'll create a temporary new resource Then do the re-sync wait until the re-sync is done and then remove the replica where it came from And this is the architecture. This is how the the system itself works There are two components lin store Has a control and a satellite module Satellite module is running on every node that is able to provide storage Or is supposed to consume storage by running diskless dvd So it doesn't have to be physical storage could could be virtual storage diskless And there is at least one controller node and there's only one active controller at any time You could have standby controllers so that you can fail over if the controller node dies for whatever reason Both components are written in java It's just a Java standalone process. It does not run in any kind of Web application container or something. It's just plain java And it comes with a with an integrated h2 sql database. That's where the entire setup is stored But there is the possibility to use another database instead So you do not have to use the integrated h2 database if you already have database servers somewhere We also support postgres MariaDB and db2 the two that three databases that we Support currently So you could have some central database and use your company's database server Which makes Failover somewhat easier with the lin store controller because it's just connect and reconnect to the database whenever it is moved On top of the lin store controller We have all the I'll just mark it up here. That's where all our Infrastructure and cloud products plug-in that's on an api library layer. So there are api libraries available for lin store that Allow you to create plugins and drivers quite easily By just using the api library. That's basically a library that's object oriented So you can like create a lin store resource create some volumes Prepare everything and then then then tell the library to just send that to the lin store controller And that will execute your request The lin store controller itself does that in a transaction safe way. So either your Transaction to create a resource or a resource with multiple volumes will either Be accepted with all the volumes that you configured or it will just roll back both in the database as well as in main storage So there are no inconsistencies And the protocol is also another interesting feature here I don't know whether anyone is here who knew our previous product, which was only supposed to Manage dbd that was called dbd manage and use different protocols for communication between nodes and communication between dbd manage and the command line client And that has changed in lin store. There's only one protocol That is being used for all communication. So communication between lin store And command line utilities as well as communication between lin store and open stack open navel or accommodate this proxmox And also communication between lin store controllers and lin store satellites That all uses the same protocol, which is also just the tcp ip based protocol with ssl if you Run it in production You could also run it in plain text, but the protocol is still the same And uh, what's coming up Probably quite soon. It's currently being worked on is an additional Protocol and that's just a hdb or hdbs rest api that will allow you to Basically make the same requests to the lin store controller As the lin store native protocol allows So there will also be a rest api directly available in lin store without requiring a web server around it or something like that um So the the architecture of lin store itself is modular in many ways We can exchange the database layer. We can exchange the protocol layer. We can load different storage plugins There's a lot there are lots of possibilities Lots of plugins in lin store even the apis itself all the functions that lin store has Are just loadable more or less plugins that plug into into lin store and do something Um, obviously data placement is interesting You wouldn't want to select nodes manually if you have an open stack cluster of 60 nodes and you want only three replicas You obviously have the problem of selecting which nodes you would like to have those replica replicas deployed And that will be done automatically by lin store normally depending on how much space there is on each node So obviously it would not choose a node. It doesn't have enough storage to even create backend storage for your resource but there are also Additional features to allow you to control where resources are created And one of our customers is already using that or mentioned that in a minute You can choose and that's completely arbitrary you can create any kind of tag on node that you want That's completely under the control of the user and then it can make rules based on those tags like for example Put certain resources in a certain fire protection zone or put them Put those two resources on the same server or put those two resources never on the same server Or not on servers in the same rack Something like that. So you can form zones and create rules To tell lin store how to deploy all those resources You can also create some rules for network paths There's the selection of the network paths because we allow To configure multiple network paths For different resources It doesn't you don't have to run every resource through the same network path because you might have resources That are less important performance wise and you might have resources that are Need a another level of performance or you might have certain security Requirements that require you to run those resources through different networks things like that I recently also support steppity's multi-passing and There are various possibilities to select a path for a resource you can Put the rule on the resource definition itself Or you could put it on the storage pool. So depending on Which storage pool the resource uses for example the fast ssd storage pool One of those storage pools might use a faster network than another storage pool that might be based on on hard disks for example and I think that Probably most people have already Seen what it supports. So we have Lots of connectors kubernetes open stack open nebula proxmox xen server Those are all available and what's new is that we also have A csi plugin for kubernetes. That's the most recent addition To the supported plugins Container storage interface Currently both are available. You can still run the flex volume Plugin or driver But we are obviously recommending to move to the csi driver And that's our roadmap The plan is to support setups that do not even use dbd so even It started as a product that was supposed to support Automated configuration of replicated storage, but we'll also make it available If you just want to use lvm on lots of clustered nodes Or if you just want to build an md raid or if you want to use NVMe over fabrics instead of dbd or you can even run mixed setups, obviously And a lot of discovery for hardware automatic hardware software volume group discovery So we're trying to do more automation in the product the rest api already mentioned Yeah And that's finally A case study that we're running with intel a pretty large company. I think you know that one Intel is Has based its rack scale design intel rsd. That's a big project that intel that is based on lin store And currently also dbd that will move to NVMe over fabrics And that has driven much of the development in lin store Due to the requirements that intel had With selecting different network paths different fire protection zones. So many of those features were driven by real life Requirements that we had in the project with intel So many of those features have been tested quite thoroughly already although we're Officially still in the better better status But As it has been used more or less in production already by some of our vanguard customers And That's about all about lin store. We have I think we have a few minutes for questions three minutes So if there are any questions, then and there will also be available afterwards outside Yes A Okay, so the question is Whether there is some recommendation from lin bit as to how configure the backend storage that dbd and lin store uses And I'd say in general It depends a lot on what exactly your setup and your purpose is but So if we know what exactly the setup is we would be able to make some recommendation But generally what we see and and what seems good to us is Since you have Especially in those cloud systems since you have Rather cheap hardware with lots of cheap disks and or maybe ssds And you're replicating to different nodes And your entire design is based on the idea that a single node can fail or can even get lost permanently It makes sense to just use a single disk or A stripe set Some kind of volume group and that's what most customers do nowadays In the high availability setups we sometimes see customers that even do a rate Rate one mirror ring locally And then replicate to another node. So that's four copies already So all of that is possible. It depends very much on the setup and and the purpose of your of your cluster But what we see a lot is just a bunch of disks in various storage pools And what most customers also like is the ability to use Different storage pools like either to just organize their data or Make sure that one pool if it overruns So if it exceeds its storage capacity, it doesn't stop another pull from working correctly Or because the pools have different back ends like a faster one based on nvme and the slower one with large and cheap archive hard disks Time's up So i'll be available outside if someone has some additional questions and I still remember I should ask the audience to please rate the talk on the foster home page Yep