 So my name is Hannes Rannicka. I'm working for SUSE in, well, various capacities. I'm team lead for storage and engineering. I'm kind of storage architect. I am a, well, upstream contributor for Linux SCSI and Linux NVMe. And I'm member of the NVMeXpress group and various other things which I forgot to mention. So if anyone has any question, don't hesitate to ask. I'd happy to answer them if I'm able to do so. Probably will be answering them anyway, even if I don't know what I should be saying, but okay. Right. So, and thank you again for coming all. That was a rather quick rescheduling, which I was planted at another slot, but then, well, I messed up and so they had to reschedule. So here we are. Okay. So this talk is about NVMe over TCP encryption. And right. So now the question obviously is, right, so you're doing encryption. Oh, and you're doing TLS. No, what a surprise. That is a novel topic. Why bother? Well, so thing is there is already NVMe over TCP. That is already there since a few years nowadays. And so we really want to have the connection encrypted because, well, it's normal TCP and literally anyone can listen in and anyone can actually forge package. We had a very nice incident where someone was blanking on packages. So we got packages of the right size, but the content was zero all over that confused the hell out of the protocol. And we had a fun time tracking that one down. So, and the problem is, yes, there are TLS libraries, but we can't really use them. So, so NVMe, NVMe over fabrics. I mean, you probably will all have heard about NVMe and most of you will have an NVMe in your laptop nowadays. And NVMe over fabrics is basically the logical next step. And that is expanding the PCI interface out in the fabrics such that you can connect to remote devices. And that is precisely there to make a storage array vendors happy because suddenly they can connect the storage arrays by NVMe and thereby being faster in charging their customers even more money. And we move TCP as well. Pretty obvious. Thanks to Sagi Grimberg from LightBits Labs who invented this whole thing or standardized it. This is just a way of how NVMe packets can be mapped on TCP. And this is purely in software. So you don't need additional hardware. You can run on any device, but however you choose this. And you don't even need to reconfigure the switch. That's unlike like Roky, where which is similar because it's also as well. Roky is RDMA over combined Converge Ethernet, which means it emulates RDMA on normal Ethernet, but in order to do so, you actually need to have a special switch configuration such that the whole thing works. For NVMe or TCP, this is not really required. And this makes it relatively similar to iSCSI. And that makes it ever so appealing for cloud vendors and virtual machines because you don't need any configuration, or rather to be precise, all configuration which you need is carried within the image. So you can easily move your image around and you will always get a connectivity from that image to the target, which is quite handy especially for cloud instances. But unlike iSCSI, the protocol itself is paralyzed because that's NVMe. And NVMe is gaining most of the performance advantage it has due to the fact that it's simply paralyzed. If you just run one queue, you're having a similar performance than iSCSI because that's, well, at the end of the day, there's only so much you can do and so much you can do wrong. So that has similar performance, but then as you're using more queues, you increase the performance. And so, as I said, one of the targets are VMs and VM instances, which means you move into the cloud and deploy your image there. But then you have to get your storage from somewhere also within the cloud. So you know where you're coming from, but you have so no idea whom you are talking to. It could be literally anybody. And what's more literally anybody can listen in. That's a funny thing which happened to us as my company decided to move to Office 365. And just by moving the mail server over to Office 365, the spam volume increased 10-fold. This has nothing to do that there's someone listening in the cloud. No, this was sheer coincidence, obviously. So you really want to do something to validate A the other end and B validate that the data which you got is really the data you're expecting. So encryption is the way to go. Okay. So let's do it. Let's encrypt, because as I said, well, this is TLS. How hard can it be? Just use normal. I mean, TLS is a standard things. There are standard mechanisms of what you can do. You can either wrap your entire program with something like NGNX. I probably spelled it wrong, but anyway. Or you use the standard library, which are there, like new TLS, open SSL, package and modify your program to make use of that library and that is it. So again, why are we talking here? Well, this is a standard and you can't possibly use the ways which had been used before. You have to come up with something else. The problem here is that NVMe over TCP has an ever so slightly orchid design, how you establish a connection, because it's not just that you open a socket and then fire packets across it, but you rather have to go, have to jump through various hoops to get everything aligned and create it. So you start off with creating a TCP connection. That is for the control. In this case, for NVMe, speak the admin queue. So you get the connection. Then you get the connection and then you can create the queue things, like how many indexes you will have on that queue and so on and so forth. Once that is done, you're required to send an NVMe command over the wire, which tells you how many IQs you should create. If you get that back, then you create a TCP connection for each and every queue and start the uncreated queues on each of these IO queues. So there's a bit going back and forth before you're able to have a valid connection. And all of these steps are encapsulated into one user space call. So the user space is calling out to the kernel and say, create the connection. The kernel does all of this and then you get a status back. Yeah, I did. Which is okay, but it also means you have created several TCP connections, all of them running on their own socket. And you're not even seeing these sockets because these sockets are completely internal to the driver, which means all the nice user space libraries, which we have on all the nice user space tooling, which we have don't work because all of them required to have a socket, which we don't have. So this is some schematic how the whole thing works. So you send the command, create the TCP connection queue, then you establish the admin queue, create the NVMe device, send the connect command down, know how many queues you have to create, create the queues, open the connection for the first queue, open the connection for the second queue, and so on and so forth. And once you're done with all of that, you return to your user land and say, yeah, I'm done. So the specification says that as we are being a modern specification and everything, we just do TLS 1.3 because everything else is deemed unsecure. So don't bother with that. Okay. And as we are clever, we are designing two ways how you could do things, either you start TLS before starting the connection, or you start off asking the server and if the server says, yes, you should be doing TLS, then you switch over to TLS. Okay. Right. But it also means that you have to create the connection inside the kernel, which means you can't release the user space libraries. But the kernel already has an internal TLS library, so we could just use it, right? Where's the problem? It would be easy if we had a full TLS library in the kernel, which unfortunately we don't, because we do have TLS, but that TLS is only for encryption. So if you know how to encrypt, you can pass the vectors into the kernel, and the kernel will happily encrypt. But you have, the kernel doesn't tell you how you generate these vectors. So the intention is that you run an additional program in user land, which does the TLS handshade, generates the vectors, and then you can pull the push the vectors into the kernel, and then everything would work. So, yeah. And the other problem is that this was originally a proof of concept from Google, who wanted to speed up their own engines, their own things, and they did that, but then figured, oh, you know what, we can do everything in user space and being even faster. So they essentially disbanded this whole implementation, and that's why the implementation itself is still stuck at one dot two, which makes it ever so orchid. Okay. Right. So, well, this is a problem, so we don't really know, so what do we do? Shall we use, can we use the in kernel TLS or not, or what shall we do? We have been thinking about it, and so, and then Chuck Lever from Oracle had a cooperation with a company, a company called Tempesta, which actually had an in kernel TLS implementation. So we gave it a go, we had to look, and then we said, yeah, okay, it might, but it's getting really hard because we have to negotiate the parameters, the TLS parameters inside the kernel, and quite some of these parameters are actually policy, which means you have some policy engine in the kernel, which is really not something which we'd like to do. Plus, it is a really massive piece of software, which we have to push into the kernel, and we would have to validate. So we're not sure where this is a good fit. So in the end, we decided to indeed have a combined approach that we will be having, that we are having a user-based socket, a user-based demon doing the TLS negotiation on our behalf, which is the design which was originally, how it was originally designed for, but the problem we had there was that we had to pass the socket to user space. And so we had really been, well, it has taken us about a year to come up with a protocol allowing us to do so. But finally, the protocol is there, so we can pass the socket to user space. And now we have a demon running user space, and this is handling the TLS encryption on our behalf. And this whole thing is even on is geared up for inclusion in 6.4. So, okay, so that's the TLS side, what do we do for the actual implementation? So the spec in its infinite wisdom decided to settle on PSK for TLS. PSK is a pre-shared key, which essentially is the same you have for your wireless where you just have to know what the key is, and then you can use that key and everything will work. Okay, good. In order to handle this, in Linux, I've decided to use the key store and put the key into the key store, and then take it off from there. This has the advantage that the key can be externally provisioned, meaning I can provision the key and I have to provision the key before starting up the connection. And I can also have an external entity like a management program or something, which is providing us with the key. We can allow for, well, other vendors to have a centralized management program, and thereby reducing the management cost of distributing the keys to the systems. There's already support in the NVMe CLI for generating the keys and also for storing the keys in the key ring, and that's how the keys do look like. So that is a prefix there telling them, telling you, right, hey, you know what, that's an NVMe TLS key. And then that is just a base 64 encryption of the actual data. So, and this is how it looks like. That's a dump of the key ring. Then you see all the keys which are there. So you see, hey, we actually do secure boot, right. And there's a key ring. So I added it's around key ring. That's a dot NVMe key ring. And there are two keys in it, a two PSKs in it. And this is the identification of these keys. The reason why we need the identification is that this is the identification being sent over the wire during the TLS handshake. So during TLS handshake for the client halo, the client halo sends over an identification. And then the server has to figure out is this a valid or is this an identification which I want to use or not. Meaning the server has to figure out which key to use. So the server will just look at the identification here and grabs the key matching this identification quite simple and quite easy. And so that's where we end up. So, and that is the entire handshake is done by the TLS HD demon, which is a TLS handshake demon, surprise, which is a GitHub program nowadays. And that does the handshake and all we have. It does receive the, it's using this, the netlink protocol for getting the socket, receiving the socket from the netlink protocol and then execute the client halo server halo, depending on which side you're starting this on. And it will then set up the TLS connection to the handshake and push the negotiated IVs into the kernel to configure the socket for TLS. Once that is done, it pass control back into the kernel and the driver can continue. And upon completion, it even sends down the key serial number to the driver that the driver know which key has been used for this connection. So, and this is how it works. So this is the normal NVMe command, NVMe connect command, which used to start up the connection. And the only difference to a normal connection is the minus minus TLS there. Just saying, okay, use TLS. Good. Then it opens, then TCP opens the connection and connection for the admin queue and starts off with connected to sending the message to the TLS demon. So that's how it looks. There you can see, right, start TLS. Yeah, okay, right. And that's the key serial number. Then the handshake demon kicks in. Yeah, I got a message and I got this identity. All right. And now I start the client handle. And then later you see, all right. And these are the parameters. These are the algorithms I'm using. Hey, and I'm doing TLS 1.3. And yes, I'm doing PSK. And that is the HMAC which I'm selecting for doing the action encryption. And then the handshake is done and control is being passed back to the kernel. And then you see kernel says, hey, I'm done. Yuhu, handshake complete. And then it figures out, oh, I need to create two IOCUs. So the whole thing restarts again for the IOCUs. The driver and even TCP opens the connection for the IOCUs and starts the handshake for the IOCUs. Same story here, connects the queue, starting TLS. TLS demon kicks in, does the handshake thing. Figures out, all right. And I'm using this identity and then handshake is completed. All right, this is what I came up with. And then handshake is done for. I am passing control back to the kernel. And this continues for each and every IOConnection which we have. And then say, yeah, okay, I'm there. And then finally you see, yes. And now started map queues and I'm done. And then everything works just like a normal connection. So once TLS is established, it behaves exactly like a normal NVMe connection. So there is really no difference to the application or to the IOC. This is completely transparent to the application. Which is okay. So is that all? Well, quite. There is something still left to be done. So as I mentioned initially, the spec has two ways how the connection can be established. The one is starting TLS first and then do the NVMe thing. But there's then another way of starting unencrypted, figure out what the target does, wants us to do, and then start TLS. This is even more complex because, well, you first have to start off unencrypted and then switch over to TLS, which requires a bit of coordination between the host and the server that both sides agree on what should happen now. So this is something for which it turned out that the spec had some, well, issues put like that, which made it impossible to implement this way according to the spec. As the spec actually did contradict itself in certain parts. So there is the spec update ongoing right now. And once the spec is finished and published, I will be doing the implement, will be implement this secure connection that as it's called to have the other way implemented too. And then the problem of TLS offloads. The nice idea which I thought initially before I started that whole thing is that there's a TLS implementation in the kernel. And that allows you to engage with the offloads which are there. Some network cards do provide with a new provision you with a TLS offload. That is at this time, we have support for the Melonux, that's the AMLIX 5 card, and for Chelsio. That's a good thing. So if we're using the internal encryption, we can transparently switch over and engage the offload. That is the good thing. The bad thing is that both of them are running at TLS 1.2. And there's actually a check in the code, are you running TLS 1.2? And if you don't, go away, you can't do it. Thank you very much. Because then we looked at the protocol and it turns out that the wire format for encrypted data is identical between 1.2 and 1.3. So question really is why is this check? And then digging further, unfortunately it turned out that on the Melonux site, you have to program the TLS version when setting up the offload. So the Melonux card actually knows whether it needs to do TLS 1.2 or something else. So there might be issues, I need to talk to Melonux to see whether this is a real issue or just something which one has to do. On the Chelsea site, however, you can't even program the TLS version. So the Chelsea card would never know which TLS version you're supposed to do. So why again do we need this check? The Chelsea card would just work if we hadn't this check. So something left to be tested. And then let's see whether I managed to get this to work. The reason why I'm so intent on the TLS offload is that we or rather I want to go to higher speeds. As you are well aware, 10 Geek Ethernet is not the end of the line. In fact, quite the contrary. It is the very, very low end of the line. 10 Geek is quite common. 25 Geek is becoming more and more prevalent. And you get quite a lot of 100 Geek devices even now. And we have several of those in our lab. And Melonux has even a 200 and a 400 Geek bit. So which again, yeah, where's the problem? I mean, these are just fast Ethernet. So why can't you use them? Well, it turns out that beyond 25 Geek, you are faster than your CPU. Because the frames arrive too quickly for the CPU to be able to process the frame. Which is a tad unfortunate. Because what exactly are you going to do if you can't process the frames? So the only way how you can remotely reach bandwidth with higher speeds is by engaging offloads. Typically, I mean, this is the normal TCP offloads like LRO, meaning large receive offloads, close during whatnot. So all these nice things to reduce the number of frames, which you have to process, but you really need them. Otherwise, well, your system will freak out beyond 40 Geek. But if that is the case for normal traffic, which I just need to, well essentially just read the frames, encryption will be completely out of the game. Because encryption is a really, really heavy operation in terms of CPU. It's not something which you do lightly. And so if you want to reach any decent speed, you will have to engage with offloads. And you will have to have crypto offloads just to get to any decent speed, which is why I'm so intent on using these offloads just to reach a decent speed for them. So the other thing is these are pre-shared keys, meaning some data which arrived at my end on mysterious ways, and I just know it will be fine. Don't worry about that. Well, don't worry about a bit of data for which there is literally it is just a dump of data. So you can't even inspect that whether the data is correct. So there's a fair chance that eventually this key might become compromised. So you really will do something like a key refresh, i.e. replacing the key which you're using and restarting the connection just to ensure that well, even if the key had been compromised, you're still running with it. You can replace and you will have a new key. That again is also something in the spec which tells you yes, you should be doing it. It just needs to be implemented and then to make this work. And the other one is X509. PSK, as I said, is just a data dump, which is horrible because you don't even know. Right, so here's the data, but to which machine does it map? Well, you know, just look at the data and you know that this hex dump is for that machine. Yeah, sure. So plan is to move to X509 certificates because then you will have a proper identification within the X509 certificates. You will have proper structure so you actually know what this certificate is about. This is still an item pending with a specification committee to get the specified and then once they have it, we will be moving it over to. So, and that is already it. As you've seen, we now can do TLS encrypted connection for NVMe, which I personally really, really like because finally, yes, we can have encrypted. We can move into the cloud. We can have secured connections, IO connections in the cloud. And what's more, the interface to use it is actually quite simple. So it's not that you have to complex configurations and settings parameters here and there. No, you generate the key, distribute to the other side, use the TLS switch, that's it, done. That is really, really nice. And we can use offloads. And more importantly, this is the very first. So you are the first in the world to be informed about that. So we beat everyone, including all hardware vendors. In fact, quite the contrary. They have been asking me, are you done? Can we start implementing? We just thought, hey, that's cool. They are actually listening to us and what's more, we are setting the standard. Suddenly, now everyone has to look, all right, can we connect to Linux? And if we can't, we have to fix the implementation. Hey, isn't that cool? So I thought that was really, really grand. And yeah, and no, we haven't. And now it just needs to go upstream. But this is, I have been informed just a matter of time. So thank you very much for your attention. And I won't hold you longer for your beers. Any questions? Yes. There's a mic, if you would be using it. Yeah, this is a really tiny switch. All right. Thank you for that. That was really interesting. Our infrastructure right now is currently running ConnectX 6, all 100 gigs on our servers. We've done some some offloading stuff, but not with TLS. I'm curious, what was NVIDIA and Melanox's response when you went back to them with the, we were findings, were they receptive to making changes to support this in the future? That is something which I would like to know too, yes. Currently, the response was nothing. So we, I got in touch with the developer who's well, normally responsible for the implementation. But it is really slow going. So yeah, they're apparently, TLS encryption isn't that high on their priority list. Let's see whether we can engage with them a bit more to get things started. But you're welcome to try. I would love to see it, how things would hold up 100 gig. What is the bandwidth we are getting out of it? Yeah, absolutely. Because the numbers you were saying around 40, 50 gigs, that's exactly what we started seeing. So we tried to start looking at offloading. Oh, I'm so good. That was completely out of the blue. I have never done any measurements. Yeah, it's just something. So big thumb. Yeah, we've seen that on 10 gig on 20 gig. So I guess it's around 40. Yes. Yeah. So we started offloading things like, like our virtual switching and open flow rules down onto the DPU in our lab. Anyways, we've not rolled that into production, but different ways that we can try to get over that ceiling, even though we've got all this capability, we can't really leverage it. Anyways, thank you very much. It was very interesting. Thank you. Of course, he turned it back off. We're in the wrong conference to talk about Nvidia and whatever their opinion is. This is the open source conference. It's not really what they're here for, same for cybersecurity. So thank you for the presentations. I only caught the end of it, but not surprised to hear that. You're doing cybersecurity. Yes, I'm the cybersecurity officer. Don't run away. Because we actually tried to have a session about security and key handling with that one over at LSF, but the organizer forgot to invite the security folks. Well, it happens. You still had a nice discussion. Might be a bit pointless though, but yeah, we had the discussion. So yes, definitely this is something that we need to look at because the problem really is so how do I get the keys and how do I ensure that the keys, which I get, which essentially is just the data blob, are the keys I had been expecting. You might want to look at how SF did their encryption as well, if you want any inspiration. SF has recently built-in encryption in the last few versions. No, that is not so. The encryption itself is, that is relatively trivial. The thing is, what do I do before I have to do the encryption? Because then the key would need to be stored somewhere. I mean, I'm using the internal key store, meaning once in the key store, everything's nice and dandy because it's key store, which hopefully is protective for me. That is fine. What do I do before putting the key in the key store? It needs to be somewhere. And as this program runs on my computer, chances are the key will actually be on my computer, such that I can run the program. But the key being on my computer, more often than not, means that the key is on the file system. How else would the program able to read it? I see what you mean, checking in the egg kind of problem. But if the key is on the file system, why again do I bother trying to protect it? Literally everyone can read it. You see, there are some, well, ever so awkward things which we might want to look at. But yeah, don't run away. Okay, thank you. Anyone else? So I think one of the things you mentioned earlier was that the problem, you have the socket in the kernel space and user doesn't see that. And eventually you have the kernel forwarding to user space. So does that open up more user to kernel interface, like exposing more security risk? I have no idea about the context just curious about. You spotted it. No, there is no security reason. We're just having another protocol exposing kernel internal to user space. Where could that be a security implication? That's completely safe. Don't worry. We have no idea. Yes, of course, it opens up another security issue because we're passing kernel internals to the user land. Yes, of course. And yes, of course, we will need to look at because we're passing a socket, a structure, which is not only exposing internals but also allowing you to talk into the kernel. So yes, it is. So sorry, one more question. So the other thing I was curious about is if what if you have what if you have a user space implementation of the NVMe stack, would that just solve the problem that now you have everything in the user space? No, it would not because SPDK is dead. Sorry. Okay, so it's more of like practically. So for practical things now, I'm sure. I mean, as soon as you have a user land implementation, all of these things don't really apply because you have a user land application and you can just link into QTLS and you're done. What's the context about SPDK being dead? Because I saw our company we use SPDK actually pretty heavily in the infra. You have the drawback that you're speaking to a kernel developer and we're having this eternal battle with the SPDK folks because well in the end we are we are continuously improving things and making sure that not only we can talk to everything but also that we can talk fast with everything and the argument of SPDK everything will be faster if you move to user space doesn't really hold because well sure it's faster if you burn an entire CPU just for listening to a single device. Yes, of course you will be fast but you're burning the entire CPU. So and what's more the other thing is so sorry that this is a bit of my favorite topic so we might get a bit longer so if you feel to relieve if I bore you. As I mentioned the performance increase you're getting nowadays is by paralyzing heavily. In fact most modern cards will allow you to have an interrupt per CPU so essentially queues per CPU. So each CPU can do direct IO to the card which means yeah you have optimal performance because each CPU can do things grant okay fine but who exactly is generating the data you're about to send there must be someone else doing generating the data but hang on we just use all our CPUs to send the data so who again does it and that is a bit of a bit of an issue nowadays so we can be fast by using all our CPUs but if we do that then we can't do anything else than just being fast so there's and that is basically the SPDK argument yes you are fast if you're using CPUs for doing IO but someone has to do the data or in the case of the kernel someone might want to run things like bash or DHCP for which you surprise surprise need a CPU doing so so you can't just be as fast as you want so they'll always withdraw a cut off right okay I need some resource for management task so really need to do scheduling but that's not what the SPDK does it's probably totally out of context but years ago I was in a team that we were like doing some POCs about NVMe where Samsung just came out it was like brand new NVMe stuff that they said they can do like a million alps yeah um so what we did at the time was we are now at 15 million alps in the kernel sorry so so that was like vSphere ESX VMware stuff and then I think we tested like with NVMe driver without using SPDK or user pooling when we want to run into a million IOPS we're like maximizing four CPU cores or probably more already just for that yeah um because of the context which and all this stuff yeah so and this is as I implicated implicated with that that is an error which has been well heavily improved since then and so as I said so the latest results from Jens Akspo at Meta Facebook whatever who's driving the performance effort are now with 15 million IOPS with the kernel and so just to prove so no it's it's nothing to do with kernel user space something it is more a matter right how careful are you are when designing your system and lining everything up because that's what you really have to do you really have to look at your hardware to ensure everything you do is most optimal laid out if you don't that's your performance gone all right thanks for information thank you for the speech would you please clarify is is that solution is included in SUSA distributed or yeah so I can check SP5 and what about cross platform well this is an upstream implementation I with a bit of luck it'll learn it'll land in 6.5 and then it's in 6.5 do whatever you like okay thank you all right thank you very much it was a very interesting discussion afterwards and see you