 Our next speaker is Claudio. OpenBC hacker has been all over the network stack and now it's attacking storage with iSCSI demons the OpenBC way. So take it away Claudio. So Good day everybody. I'm gonna talk about iSCSI D and VSCSI or Okay, I'm moving over here. And VSCSI, the device that actually makes it possible to actually build this iSCSI initiator. All of this more or less started in 2009, so it's actually quite a long time ago. And I decided to actually do a talk about this now because I finally think iSCSI D is in my opinion production-ready. I'm running it now on my private server and I have no issue with it, so This is why I'm talking about it now. So What will we cover or what will I cover? First of all, I will give a little bit of an introduction what iSCSI is. I will talk about VSCSI, our virtual SCSI controller and how that one is used to actually build up iSCSI D which is the initiator implementation that we now have in in OpenBSD and the last point is actually the magic needed to actually be able to boot and shut down with iSCSI in in the whole set. So what is iSCSI? It's in short, it's SCSI over the internet. There are two RFCs that more or less cover the base protocol. It was mainly designed as a cheap solution for sands because everything that was fiber-channel was extremely expensive. It also was the idea that you actually could reuse an already available network infrastructure. So if you already have a lot of ethernet ports and all the technology around, you can actually use iSCSI much easier than you can do the same thing with fiber-channel. It's network disk storage, so it's block access. It's not a network file system. For that we have NFS and SMB and some other things like AFS. What is SCSI? So SCSI is the small computer systems interface. It's really old. It is a protocol to access whatever IO devices there are. So it started with disks, then CD-ROM, tape drives, scanners. There was even the possibility you actually have ethernet over SCSI, so you could actually start doing everything over it. There is also various physical implementations of SCSI, so at the beginning it was parallel SCSI. In various forms, then there was SAS introduced, this serial attached SCSI. Fiber-channel, IEEE 3094, Firewire. The USB mass storage is normally using SCSI as a transport encapsulation. There is even software emulations in OpenBSD. Our SATA and RAID controllers are actually showing up as SCSI disks, even so they're probably not. So the driver itself is then actually translating these SCSI commands into the native, either ATA commands or for the RAID controllers in their whatever kind of messages they need to actually talk to the actual block devices behind it. So that's a SCSI device, and that's a SCSI device. SCSI itself is a request-response protocol. There is multiple targets. Normally a target is either like a disk or can be like a chassis or whatever. It's possible to have actually multiple logical units. There is normally an identifier for them, the LUN, the logical unit number. The initiator is more or less at the beginning trying to discover all these LUNs and then is able to address them by these logical unit numbers. The initiator sends these command structures to the target. People normally call them the command descriptor blocks. They have the information in it. It's like, this is a RAID, this is a RAID, this is a discovery on all this information. The communication, this is important, I think, is always initiated by the initiator. This is why it's called initiator. It's quite logical. This is a little bit different to some other protocols, and I think it simplifies things, but it also makes other stuff harder because it's not really easy for a target to send something to the initiator to indicate an error or something. What SCSI is doing, it more or less packs all these SCSI transfers and everything that's normally on either the serial line or the parallel lines of a SCSI chain, it packs it into TCP streams. There is more or less a one-to-one mapping of a TCP session to a logical unit number or actually a nice SCSI session to a logical unit number. Every disk that you export through iSCSI normally has its own session. It's possible to actually have multiple connections per session. I have not really found all that many systems that actually provide that, but it's actually possible, and it would actually give you more performance and error silence because you would be able to actually issue multiple commands at the same time. An important part is also the authentication and capability negotiation. When the session is established, there is a handshake going on figuring out who can provide what kind of features. This is the iSCSI connection state machine. By default, you're either in free as in the session is unused or you're logged in. All the other states are just there to actually handle the connection setup which is going through here or actually error recovery, which is all these various states over there. A lot of the complexity of iSCSI is actually coming from handling error conditions. What happens if the TCP connection is turned on because there is a problem with the Ethernet or whatever? How do you recover from these things? Can you still be guaranteeing the reliability that SCSI has that a command is issued and actually is finished? There is various ways of actually handling these error cases. Interestingly, most systems that I know are implementing very simple error recoveries as in normally they just abort all the transactions, say they sit and work out and hope that the kernel then actually reduce the command as a second time. iSCSI goes over TCP. It normally uses port 3260. There are 18 message types defined of these 18 message types. There are five that are actually SCSI specific. There are eight messages for session and task configuration. Then there is a little bit of special messages that are also going a little bit in this session handling. Two of them are not messages, more or less, that you can ping-pong and see if the session is still alive. The as in Kronos messages are for error handling. The bulk of the messages are actually not about transporting data. They are actually about bringing sessions up, bringing sessions down, extending sessions and all these things. There is also funny that the base RFC already defines 22 various buttons that you can push and twiddle and tune and hope that actually you get better performance or worse performance or that you can interrupt with the device and things like this. There is a lot of various code passes that you need to try out. That was an issue for a long time in the beginning. At the beginning I had an issue in this negotiation of these options and it can happen that you are actually talking to another side and it actually seems to work. Suddenly you are sending a message that was a little bit too big or something like that and suddenly nothing worked anymore. It didn't make it easy to actually have all these options to tune and it sometimes would be easier to actually simplify, have a simpler base version that you can implement easier. Now let's talk about VSCSI. VSCSI is our virtual SCSI controller. It more or less exposes the SCSI bus or the SCSI subsystem to the user land. This means the user land actually then has to implement what would be a SCSI controller. In our case this is the SCSI initiator which is then transporting these SCSI messages to the other side where then the actual controller is sitting and is then doing all the work. For us it's actually, for iSCSI this is just a handling stuff like moving information in and out but it would also be possible to create a user land SCSI controller that talks to a local file. As an example if you want to do some crazy encrypted disk storage or something like that this would be a possibility. The kernel passes the SCSI commands from the mid-layer to the user land through this VSCSI interface. This means the SCSI disk from the kernel is creating a SCSI command, issues it into the mid-layer and then the mid-layer delivers it to VSCSI and queues it and transports it up to user land by a fairly simple interface. So it is all IOCTL based. It doesn't have any read or write function which is a little bit funky but it makes sense for this device. There's six IOCTLs and they're mostly there to first of all dequeue the SCSI command. Then there is one IOCTL to actually send back the response message of the SCSI command. There is two IOCTLs that are there to read and write data and then there is the two last commands which are necessary to probe and detach SCSI devices from VSCSI and from the SCSI bus of the kernel so that we can actually, if you remove a session in ISCSI D it actually then would detach it from the kernel and by that the disk would then actually disappear and no longer be available. It's also interesting that the VSCSI data read and write commands need to be looked at from the view of the device or of the disk device or the disk device driver actually. So if I'm writing to the disk this is a VSCSI data write and if I'm reading from the disk it's a VSCSI data read and it's a little bit reversed from what you would expect from the user land side because I'm actually, I'm calling read but I have to pass in data on the read so the naming is a bit funky. Here is an example how you would talk to VSCSI in a very simple driver. First of all you need to create all these various IOCTL structures. You then do a probe request which more or less you specify the target you're normally using glon zero because this is the first ID per target and you tell the kernel like probe in this device. The kernel will then start actually issuing VSCSI I2T commands to do the discovery. So this is then the next thing that actually happens. You're getting the VSCSI I2T command and then you're actually looking at this command structure and by looking at it you actually start to know what it is. If it's a write command or if it's a read command and then based on that you can then do various things. So here this is the write case so we're seeing that it's a write command. We're actually allocating the memory that we need to actually fulfill this write command. We then do the IOCTL to get all this data because the write data is already available from the kernel. So we're getting all this data out and then we'll send it to our write function and clean up. In the case of a read it's actually a little bit different because here we know actually I want to read something. So I first actually get the data, I'm reading the data, I'm filling out the tag information so that we actually can still... The tag is important to bind all these various commands together. So the I2T at the beginning gives me a tag and I need to tag all my messages that is part of this command with the same tag so that the kernel can then build stuff together because it's actually possible to do multiple reads to finish a one command and once it's finished and then as a last step I'm doing the T2Y IOCTL to tell it now the command is finished and let's go on. So this is the read case, we're reading the data in, we're filling out the buffers that we then pass in in the IOCTL and by that the kernel gets the information. At the end we're finishing the IOCTL, we're finishing the command by the T2Y IOCTL, there it is again similar, we have to tell it the tag which command it was and we're giving back the status if the transaction was successful or if there was an error condition and by that we more or less give the kernel the feedback what goes on. And the last example is more or less like how you detach a disk, it's very similar to actually attaching to it, you tell the target, you tell the logical unit number and you call the VSCSI detach function and with this you're actually able to more or less do everything that you need to do with the SCSI device. Where are we there and what is needed? VSCSI is actually pretty much finished, it's actually funny when I was doing this slide, two days later there were two commits to VSCSI, so the no recent changes is a little bit of a lie, but it actually works and it does the job well. Sometimes it would be nice to actually support multiple VSCSI consumers, it would be nice to actually be able to run maybe two VSCSI Ds at the same time, one on the VSCSI 0 device and one on the VSCSI 1 device so that you could restart one while the other one is still keeping up a second connection to your same disk and then using the multi-pass features of our kernel to actually being able to restart demons and things like that and not losing connection in the meantime. Sometimes I had the feeling that it would be nice to actually have proper blocking IO modes and use like read and write and things, but I came more or less to the conclusions that the IO controls actually work very well. We're using in ASCSI D, we're using LibEvent anyway, so everybody is polling or KQing and that makes it not necessary to actually do any kind of blocking IO mode. The idea here is more or less that you poll until you get an I2T command from the kernel that more or less starts the whole initiation of getting some data or writing some data. So why did we start with ASCSI D? We're definitely not the first ones doing an ASCSI initiator. I know that FreeBSD has an internal initiator. The time I started at that time, that BSD was working on a user line initiator using Refuse or a generic ASCSI implementation using Fuse and Refuse. But as far as I know, they actually now also have an internal initiator. We think we can do better as I would normally say. We're trying to do a simple reputation. We wanted to have less kernel code, so less parsing of messages or anything in the kernel, privilege dropping, like not having to use SuperUser for anything, all these things about what BSD is normally about. And yeah, we're also having, I'm saying here a mild form, but sometimes it's not really mild form of not invented here syndrome. Now, what it is not? It is not yet another OSPF or BGPD clone that uses multiple processes and sends iMessages back and forth and all these fancy things that we have copied many demons now. But it uses some parts of it. So we have a control socket, we have an iSCSI CTL command, but we're not really using iMessages where it's done a little bit easier and a little bit different. It's also not necessary to have multiple processes because we can just change root and privilege drop and everything is fine. So you only need SuperUser rights to actually open the VSCSI device. And after you've done that, it's done. You don't have to do anything anymore. We can change root away, we can drop the privileges and that's it. There is no need to do stuff like DNS resolving or anything other that would need to actually be outside of the change root. And one of the tricks that we did is actually that the whole config parsing is done by the iSCSI control program and then passed through the control socket to the demon so the demon doesn't need to reload the config and actually keep away to actually read that file open. So the change root actually works. Yeah, the reconnect is... He was asking how we are handling the reconnect. So reconnect are in a way simple because we actually don't do DNS lookups. So the config that we're getting from iSCSI CTL only has IP addresses. So on the reconnect we just open the socket binding connect and get a new session. No, we don't do that. This may be something to think about but at the moment it's not done, it's also a little bit... It doesn't make sense to drop the limits to a minimum because you actually don't know if there is more sessions coming in. Like if you afterwards reconfigure your iSCSI command if you change your iSCSI config to actually open up more sessions you would not be able to do that. At the moment we just allow it to have as many FDs open as it wants or at least what's the hard limit off the system is at the moment. The good thing is and this is why we have vSCSI is it actually reduces iSCSI D2 very simple tasks. So for iSCSI it means that the messages coming from the kernel are completely black boxes. I'm never looking at what actually the kernel sends me I'm just taking it, packing it into TCP and sending it off. So the impact on the kernel is extremely minimal. So the vSCSI code itself is about 600 lines of code and is extremely simple. And iSCSI itself is also fairly simple because it's only responsible for very simple tasks. So it needs to have the session and the connection find the state machines it needs to make sure that it can open the connections it packs and unpacks the SCSI messages and that's it. It tries to be as simple as possible. It also probably is the reason why I'm actually was able to write this thing because I have no idea how SCSI itself as a protocol works. Now after years of working a bit around with it I have a bit of a better feeling but I'm definitely not a SCSI expert. That's for David who also wrote the vSCSI implementation. He probably knows way more about SCSI than I do. The important thing here is iSCSI D actually has to handle a lot of data copying in out and so the buffer management is somewhat important. The idea was that the iSCSI messages are very well structured also because of the idea of being able to use RDMA and other techniques. So they normally have well defined sizes and stuff. So the buffer management in iSCSI D is based on the idea how a PDU is built up. The goal is that we don't copy data more around that we actually need. We're already copying a lot of data from the kernel to userland and back and so we want to try to not copy it in userland around just for fun. And in the end the idea was that we actually also have an easier data management but I think that's normally always a goal. On the right we have the more or less the SCSI message format on the buyer. So we have at the top the basic header which is always present. This is this 40i8 byte thing that's always there. Then there is a possible optional additional header segment which has additional stuff in it. This is especially used during capability negotiation and similar things. Then there is a header digest that is more or less a checksum over the basic header and the additional header. Then a data segment that can be around if you're actually passing data in and out and the data digest. There is a lot of things that are optional. For iSCSI-D we're actually not using the digest fields as an example. The plan is there. They're defined but we're not using them at all. So the idea what I came up is I looked a little bit at how the vectorized IO commands work and I was trying to use them heavily for this. This is why struct PDU was also built similar to it. It consists of five segments of data that then more or less combined give you the iSCSI message. By that I can use one write command or one read command to send one packet at a time. As I already said, most transfer are initiated by the initiator. Especially once everything is up and running. It's just the initiator sending or actually our kernel giving commands and passing them all the way over and waiting until a command is finished. So iSCSI-D always starts with the I2T message. This is the first thing that comes in and based on that iSCSI-D then starts to transfer. It allocates the PDU buffers that are necessary. It does the scheduling of the task. It assigns the command to a certain session or connection and then starts running it. This is more or less how it roughly looks like. So at the beginning we have this IO event coming in. Then iSCSI-D does this vSCSI-I2T command. In this case this is a read operation so it tells me direction read. The iSCSI-D creates then the PDU, assigns it to a task, queues it on the task, which is part of the session. Then the scheduler more or less decides which connection is available, puts it onto this connection and then it actually starts sending the SCSI request over the wire to the target and the target then prepares everything and starts sending the SCSI data in commands over back, which then are translated by iSCSI-D in the vSCSI data read IOCTLs. This is more or less where the data is actually then moving back. Then at the end the status is coming back. First of all we're closing the command for vSCSI and the SCSI subsystem. We're sending the vSCSI-T2I command and we're cleaning up the task inside iSCSI-D. For write this is similar but is a bit different. It starts more or less the same way. It's a write but then we're sending the request. This is now a non-immediate write operation. In this case we're actually sending the SCSI read request saying, we're going to write to you now and then the server more or less sends back like yes I'm ready, you can start writing and then I'm fetching the data, sending the data to the other end and getting normally like you can send more data now until all the data has been sent over if it was a very large buffer and at the end again if everything is finished we do the vSCSI-T2I and we close the task. There is also a few shortcuts because as you see there is a lot of roundtrips going on here by using some of the buttons in the configuration options. You actually can do an immediate write. That means once you get the vSCSI-I2T that is a write, you can immediately do a data write part of the initial SCSI request. You can include already a bunch of data that you then can send with it over the writer. In the end if the buffer is small enough you can actually do just one roundtrip. You would just send it over to the target with the data already in it and then more or less you just get the status back. So it's much less work to do. But as I said because there is multiple ways of doing it you also need to more or less implement all the ways of doing it. So where are the problems with the startup? At the moment iSCSI-T2I can only provide non-boot disks. This is important. This is necessary because it actually needs running network and it needs to have a running userland because it's actually a userland daemon. This means you need to have at least in it an RC already running so you're somewhere between single user mode and being completely multi-user. You need to have in our case at the moment root and user need to be mounted because iSCSI-T is in user aspen at the moment but it would be possible to actually make it a static binary and move it to the root partition so you actually could be mounting user of iSCSI. In my opinion it's not really the biggest issue. Because you can easily boot off local disks normally on servers or you're using NFS to do a network file boot to at least get the root file system with root and user. The actual tricky bits are that even so you're actually starting to mount additional stuff very late in the boot up process you still need to make sure that you're reliably checking and mounting the iSCSI exported file systems. So if there was an unclean shutdown what needs to be happening is that the file system check is running and while the file system check is running no other daemons are starting up because they may actually depend on the file systems that iSCSI provides to be already mounted. This was a little bit of something we had to figure out how we're going to do this. We came up with a somewhat easy way of doing it. We already had a point because of how NFS works it was fairly similar. You normally mount the local disks first and then afterwards actually remount everything even the network devices once the network was up and running. So we just modified there a little bit around before we're doing the second mount of all the disks, the mount-a we added another file system check run that actually checks the iSCSI provided disks. Then we extended FSTAP mount and file system check so that they more or less have a better understanding what file systems are provided by are only available if we actually have network running. So FSTAP now has a net option similar to the no-auto option to more or less indicate this file system is only available if we have network. And mount and file system check then have now an additional minus N argument that tells them to only, if they're run with big minus capital N then they will only exclusively check the net file system so everything that has a net in it. If they are not run with capital N they actually ignore all the net lines. This is not completely perfect at the moment especially when file system check needs manual intervention but it works good enough to actually be reliable. On shutdown the stuff was a little bit more tricky at least I thought so at the beginning. So in it kills all user land processes before it sings disks. That means we're killing iSCSI before we actually flushed the file system so we were never able to actually shut down a machine or reboot. That's not good. And especially then with our multiplies support what actually happened was the file system disappeared but the kernel was still having 30 buffers and then it was just waiting and hoping that the disk would reappear again. And this never happened because iSCSI was gone. So there were two options of how we could handle it. We could try to unmount the disk before iSCSI is killed or we keep iSCSI running until the disks are synced. Then at the beginning I was like yeah the second option will never work but I did a quick test. I went into our kill code and more or less had a magic thing in it that just said if the command is iSCSI just don't send it kill signals. So you had an unkillable iSCSI and did a shutdown. And funnily enough the file system was clean when I booted up the system again. So this seems to work. So it's good enough to actually just keep iSCSI running on shutdown and then on reboot everything is fine. So what I did then is I added a process flag to indicate that process should not be killed by killmin minus one which init is using to kill everything. And iSCSI more or less sets this no broadcast kill flag on startup and by that everything is fine. The funny thing is this idea of having processes that keep alive even during the shutdown phase could help to keep some other processes even running for longer. So it would be possible to have like syslog the more or less being running in the system until to the end like to the dying gasp of the machine. So this may be something that we will do in OpenBSD. The thing is the kernel actually forces the unmount at the end. So he just goes and uses the big hammer to just close all the file descriptors. So it actually works. You may not get all the messages but you will get more messages because syslogD will be killed lost. So you will get all the kill messages from all the other demons that you killed beforehand. We don't really have like what we can do is we can kill minus one like you can kill, you can go to single user mode but it will also kill all the userland processes. We don't really have a system where you can go and selectively say kill everything that uses this and this file system. We don't have support for that. In the end what is what still needs to be done. I think there is still need to actually clean up the code a bit more. The file system machine, the finite state machine that I was using is a bit funky. I wrote most of the code in about four days or something during a hackathon and it has some strange ways of how messages are being passed back and forth. There is still a few extensions that we don't really support. I would really like to test multi-connection support. I think our initiator actually supports it but I haven't found any target that actually does more than one connection per session. So that's a bit sad. Maybe I need to find some really expensive gear that actually supports it. And then long-term there is definitely the idea to actually not have to copy in data, copy the data in and out of the kernel all the time. So passing the messages in the kernel, keep them in the kernel and maybe even do RDMA. Funnily enough on my laptop, using the loop back, I'm actually able to do about 160 megabytes per second through iSCSI-D. So the speed is actually there even so we're actually copying all the data through user land. That's more or less the last slide. I want to thank David for actually pushing me to write iSCSI-D and providing me with iSCSI that actually made it possible. And I also want to thank Theo and Philip for helping me figuring out how to do the startup and shutdown dance. Yep. Any questions? iSCSI-D and iSCSI-D allow a remote target to communicate with the SCSI subsystem in the kernel. Yep. What measures are there to prevent the remote target from exploiting some bug in the kernel? So you want to know what prevents us that somebody is abusing this SCSI and iSCSI-D to actually inject something into the kernel? Yes. There is not all that much extra checking around. I rely on a working kernel that TCP-IP is safe. So if I'm opening a connection to a target that that connection is not intercepted by somebody else, the idea is also normally you don't run iSCSI-D over the internet. And if you want to do that, or you should use something like IPsec to actually make sure that nobody is injecting traffic into your data stream. For vSCSI, the thing is you need super user privileges to open that device. And if you have super user privileges, you already can read all the devices you have on the disk and fiddle with the kernel and do whatever you want. So I think there we are actually safe. Any other question? What kind of targets did you test against? I did a lot of work against the NetBSD iSCSI target code. That was mostly doing like being able to work on it on my laptop. There were people playing around with the FreeNAS implementation. I myself now have a TransTec iSCSI disk shelf that I'm using also to, that more or less provides a lot of disk space for my server. And this is where I'm actually doing most of the traffic now on it. Apart from that, I know that some people were also using the Solaris iSCSI target implementation, but not much more, I think. Any other questions? Let's thank our speaker then.