 O'r ffordd. Mae'n ddweud bod David yn ymddiadol iawn i'ch gael ei hunain cynlluniaid yn ymddiadol i'r ddweud, ac mae'n ddweud yn ymddiadol. Felly, ydych chi'n gweithio'n gweld i'n gweithio'r gweld i'n gweithio ar gyfer y cymdeithasol. Mae'n gweithio'n gweithio'n gweithio. Mae'n ddweud yr unig yma fel ymddiadol i'r cyffredinol. Felly, mae'n ddweud ei fod yn ymddiadol, yma'n ddweud yr unig yma, mae'r ffordd i gael ar gweithio'r ffordd i'r gwneud y cygonol. Mae'r ffordd i gael ar gyfan gwybodol ar gyfer y cernol, ac mae'n ffordd i gael ar gyfer y cernol, bod gwybodol yn ei ffordd i gael ar gyfan gweithiol, ond mae'r cernol yn ei gwneud, yn y cernol ymlaen, a'r ffordd i gael ar gyfer y cernol. Ie, mae'n bryd bod yw'n gwybod hynny'n amlwg yw'r cernol yw'r ysgolion yn ymgyrchol, ond mae'n ddigonol ar hyn y cernol yn ymgyrchol. Rwy'n meddwl'i wneud fwy gynnal, oedd ymgyrchol yn ymgyrchol i'r gweithio i'r gweithio. So, mae'n mynd i gyd yn cyfwyr ymgyrchol. Yn cyfwyr ymgyrchol i'r ymgyrchol, Rydyn ni'n amlwg nad oerodd o'r scriptwyr oedd yn cyfnodd peth, d newsiwyr oedd yn adeg y gwybod, a fyddodd, sy'n gwybod gyda'r cyllidol ffile-systemau rlynedd. Ond a rhywbeth o bobl yn wneud. Y f Wonder i'r wneud o'r hyn yn deall gweld uch hpwysau yn cofaddwyr peth yn cyfnodd sylwyr, oedd yn cyfnodd o'r cyflon o bobl gyda'r gweld. Clau yn amlwg hynny o hyn syllwyr o ddiweddedgoch pan oedd ydych chi'n gwneud o ffoc a gweithio yn ddweud o swydd ffodd. Felly, mae'n gweithio'r proses, mae'n fwy gwych o'r proses. Rydych chi'n gwneud o fflags, mae'n fwy ffodd o'ch ddwy'r proses yng nghymru, mae'n ffoc i'r ffoc oedd o'r fflags, mae'n fwy ffodd o'r ffodd o'r system filoedd o'r ffodd o'r certhffordau neu ffodd o'r ffodd o'r ffodd. Felly, mae'n meddwl i'r ffordd o ffodd o'r ffordd o'r sefydlu unsiwyr sydd yma yn cael ei gweithio i fod yn gweithio'r sefydlu. Mae gair yn oed yn ymgyrch yn ei cwestiwyr cysylltu fy ngynnwys i gweithio'n gweithio i gweithio'r sefydlu. Rydym yn oed yn gweithio'r cwestiwyr sydd yma yn cael ei gweithio'r sefydlu. Maen nhw'n gweithio, gallwn i gweithio'r cwestiwyr heb y cwestiwyr, mae'n gweithio'r cwestiwyr. Things like sandboxing, you might actually create a process, and then use Unshare to basically limit its set of privileges. It turns what you're doing. The main purpose of Namespace is to provide isolation for various resources. If you add all the namespaces together, this means you can actually begin to get virtualisation. Network Namespace is a fairly simple one. Unwch chi'n ymweld ymlaen trwyddon a'i'n hoff ychydig i maen nhw'n feddwl am blaen i gyllideb o'r oedd hynny. Mae'r hyn sydd wedi'u gweld yna'r fyddechrau trwyddon, lle fyddai'n hyn yn ddweud y maen nhw i fyddechrau'n hynny. gyda'r gael y fflag iawn, felly roedd y gweithio gwahaniaeth o'r ffordd ar gyfer y new nietod. Rwyf i'n gwneud hyn ond rwy'n wneud unrhyw o gwahaniaeth, felly ein hyrwad gwain yn gwahanol iddyn nhw yw gwahaniaeth, ac efallai ychydig yn gwneud drwy'n gwneud y new nietod. space. So it's a blank namespace, there's nothing in it. If instead I just run a shell inside this. So now if I run I have config minus a, you'll see here that there's actually a local loopback interface that's sitting there but it's not configured. I can actually, if I want, give it an IP address, whatever, and say up. It actually doesn't matter what I give it. So in this case I can now ping that and I'm now pinging my virtualized network interface. I can't actually ping for a random example that. I can't get to that because this is a totally empty namespace, the root table is empty. So yeah, it's quite a powerful thing but it does mean that you actually need to do quite a lot of setup to make a network that works and we'll get to that a bit later. So then there's various other ones. There's IPC, so system v, IPC, you can message queues and semaphores and things, you can virtualize those so that people can't access the queues or whatever outside their container. There's the mountain namespace which, like I said earlier, was the first one that was added to Linux and it's actually quite a useful one to use because if I just go out of here, right, that's me. So if I now just add another flag here so I've got new mount, I can basically then go into this and I can then do weird things like begin to unmount file systems and that will not be reflected outside this container. So if I want to virtualize the proc file system, for example, I can do that and I'll get to that in a second. So UTS is also an interesting one so this is, it's called U Name which might give you another hint as to what it is. It's basically the host name of the system and related things so I can actually, within the container, change its host name and that won't be reflected outside the container. So if I just actually run this now and so I'm now root inside this container, it's not actually very isolated from the host here because I haven't run with all the namespaces yet so there are things I could do, like if I typed a reboot now I would reboot the host machine but in theory, if I run host name, you'll see that this is actually my machine, it's got a host name here, if I just change that, I now have a new host name and in theory, if I exit this, I haven't actually changed the host name outside the container so simple things like that which you don't really think about but there's lots of little details in Linux that contain state of the whole system so these namespaces help with that. So one of the most interesting ones is actually the process ID namespace so at the moment if I'm actually inside this, if I go back inside this container, if I run this I'll see all sorts of crap that I'm running as me so if I then go back out of this and add a pid namespace, then if I just look at what my process ID is, I'm now one so my shell is essentially in it and there's nothing else running in here so the one slightly confusing thing is if I now run PS I still see a load of stuff that's because actually what PS is doing is looking at the slash profile system and I haven't actually touched that. I've created a new mount namespace, I've created a new process ID namespace but I haven't actually virtualized the proc file system so what I can actually do very simply is just mount proc and because that's done from within the container, the kernel now knows that that is me and it then gives me a different view on slash proc and if I then run PS again I see that there are two commands so that is one of the thing about sort of this low level of containers you can very easily get into a state where you make a mistake and you somehow don't virtualize something that means that you know something that you expect that should be isolated from the outside world isn't so yeah you have to be a bit careful with that. So also now we are in a process ID namespace we should and this is always a bit of a scary thing test but why not it's a live demo it's not going to go wrong. I'm in a process ID namespace it now says this one's going down for reboot now except nothing actually happens so what it actually did is because there's a process ID namespace it has sent process ID one which happens to my shell in this case it's sent it to hub signal that's that's all it does if you're in a process ID namespace in theory what you should actually do is have some program behaving as in it that handles hub and exits the whole container for you or whatever it needs to do so yeah and then kind of the most interesting namespace which is actually quite a recent thing in Linux is I don't know why it's doing that doesn't matter so there's another namespace called the user namespace and this is where things actually get quite interesting because if you've noticed before I've been running all these as sudo so I'm actually running as root on the host in order to get some limited set of permissions as the user but if I get rid of that so I right so I'm not I'm not allowed to run clone when I'm not root that makes sense however if I now add another option which is new user which is going to add to the clone system called a request to clone the user namespace as well I'll then find that it just works and now this is a bit weird because I wasn't root outside the container but I am now root inside the container so um what what's actually happening here is that the kernel has been told to map um the user from outside the container to a different user id inside the container so this is a bit complicated because it means the kernel actually has to be keeping track of two user id's and there was quite a lot of work to make this work so for example you'll find most Linux distributions this won't actually work on right now because XFS took a long time to get patched to deal with having this two user id concept and they've I think now got that patch into the mainline Linux kernel but it's not yet appearing in distributions so it'll be a while before that is available but this is a really powerful thing because it means that if I want to sandbox a process I can essentially have a container that is a full Linux system I can it's a little bit difficult to do some things because you're not root so you can't really root an IP address to a container that you've created as your user because you don't have permission unless root is running some special thing that lets you do this so there are some gotchas but basically you can treat this as an isolated container so it means for sandboxing if you've if you've seen what chrome does it has a setu id helper at the moment that means that sandbox is done via that so that it can have root privileges potentially if you once this becomes more commonplace actually you won't need a setu id helper and the kernel will do all this for you which is kind of nice so does this interesting comment which maybe you can't read so I'll read it out it may however mean that unprivileged users may now have access to exploits in the kernel that were formally accessible only to root as this mail on a vulnerability in tempfs mounts notes so it does mean that actually the kernel is now a bit more exposed and the attack surface is a bit larger because things like file systems which you couldn't previously directly mount as non-roots you can now actually use actually you can't mount most file systems because they just don't support this and you need a real block device and various other setups so it's not too bad but there are sort of things that weren't accessible before which now with this could be accessible so yeah whether whether that turns out to be exploitable I don't know we'll see um so this is also kind of interesting the the namespaces are just created by the clone flag but they don't really have a name so you can't sort of go in and debug them so there's an extra entry in proc under the directory ns for each namespace so you can actually go and look at the process and work out what namespace it's in so here I'm inside the so if I just go to prop self ns so this is just the ns entry under this particular sandbox so if I that's not that readable uh whoops so maybe that's better so you've you've got here a name of each of the relevant namespaces and then their their simlinks to uh inode num to well to a special name which includes an inode number so you can then actually use this with another tool called ns enter so if I just um remember where I am um so if I then go and find that shell which is that process ID so if I then go ns enter whoops I don't want to do that so this is just the help for it so um there's there's an option whoops too far back um there's an option for various things so um set the working directory and things so if I then ask it to basically take the process ID that it that I've got sorry I mean screen I can't scroll properly uh give me a second okay no mind uh so if I ask for this particular process ID and then so I've actually just confused my shell here because um it doesn't know what I'm doing but I've now run this ns enter command and with the process ID of the other process inside this sandbox I've now actually essentially become that user from another process so for example if you're a debugging uh container this is a way you can actually get a sort of entry point into it without needing sort of special support from a higher level thing um yeah and if I just run ID here I've suddenly become root because I've just changed into the user namespace of this container um so yeah the this is kind of an interesting thing in the Linux container is not really isolated from the host in the same way that a VMware guest would be I mean obviously you trust the person running the the host machine for a VMware thing because there are many ways that you can manage the thing but you can't really go in at the level of a single process and say oh what's that process doing and you know you can technically run a debugger from outside the container on something inside the container so you're the you obviously still have to trust the administrator but it's very easy for them to go and poke around so that has a benefit but also you know it's if someone were to be malicious it would be very easy for them to just inject something into a single process inside the container or something and you know you yeah you you can't really do much about that but then again it is useful so just to step back a slightly the the user namespace support actually has been disabled in various distribution kernels as well and Debian has a custom sys control that turns it off because they're a bit paranoid about the potential security impacts of that so you know I think I think it will get to a stage where it's cleaned up enough in the kernel that people are more willing to rely on it but maybe not yet the the other interesting thing actually about this user namespace thing is if I carry on poking around in slash proc there's also sorry let's have completion there's also this interesting thing called UID map so this this actually tells it to map UID 1000 outside to UID 0 inside the container so it's kind of a way of mapping press IDs sorry user IDs and you actually can use sequential ranges of user IDs and stuff and there's the support in recent versions of the shadow password tools to deal with having sub user accounts and other things that you need to make this work so there is actually quite a lot of support across the whole of sort of the Linux utility user land type tools for containers and it's quite nice in a way in that it's sort of spread out and it's a generic thing which means you can just use one bit of this if that's useful to you but it does mean that there's not actually really a central point of documentation for all this and it's a bit very sort of disjointed in a way so going back to the slides so that's kind of about namespaces another thing that's used is cgroups so cgroups as opposed to namespaces are a bit more rigid in that there are special file system that you have to mount and this can then have various sorry this so yeah so you mount it under sysfs cgroup by convention and so you have various controllers so there's a memory one there's a devices one there's a block IO one and there's a freezer one freezer I'm not going to talk about much but it's quite interesting in that that actually allows you to freeze a whole unix process so a bit like putting it in the suspending it but you actually then can get a special external tool that will serialize the state of that to disk and you can then go and save a process including things like its active TCP connections and restore them on another machine so the support for this isn't yet very good but it's something that potentially could be used to actually make a sort of container based virtualization platform where you can actually switch processes between machines transparently and things like that so that that's kind of interesting but if I just exit this so this is just the cgroup file system here and as you can see there's the the ones I mentioned there's a few others as well which support various things like huge tables and other things like that um so one of the kind of interesting things is there's a memory limiting bytes so this isn't set up at all that's just a very large number I don't know maximum 64 bit int or something like that um there's not that much memory on my system unfortunately um but if you actually then go and create a cgroup so the way cgroups work is a bit weird in that um I just basically make something in here so if I make something called foo and then I change it into foo um nah that didn't work so eventually it will populate things with files um I apparently haven't set this system up quite right yet um so so if I use um contain again because this does it for me so I can go memory limit in bytes and if I give that 10 megabytes so um now cgroups need a name so I have to give this a name so I'll just call it test and if I then just run something very simple like bash um oops ah um cgroups I do actually need root for um so yeah so I'm now inside a container which should be limited to 10 megabytes only so if I do something like uh so that's just allocating 10 megabytes I'll just print okay at the end and then it says cured so the kind of interesting thing about this is actually um this is actually using the kernel um killer because as far as the kernel is concerned it's a thing that's out out of memory and it uses the um killer to kill it so you get a cure minus nine just like you would if you actually ran the system entirely out of memory um so yeah it's it's kind of nice in that it's just integrated into the kernel and just works and um if you're familiar with you limit um that's not a great thing for limiting virtual memory because you you have to limit the amount of virtual memory and people might need to map large files which actually don't use much kernel memory but but this accounts for some some amount of kernel memory as well so in theory um if you do do something evil the results in filling up the kernel as well then actually it will kill you anyway so it it should be a nicely isolated container but of course there are things which may not be perfect but generally it works quite well um so um yeah so memory limit invites is um what I just talked about um you can also use c groups to control what devices are allowed so um this says deny all and then allow one particular device and that's that happens to be one colon three happens to refer to dev null so the interface for this is a bit weird and you ideally don't want to be setting this up yourself too often but you can if you want go down to this level and say you know you're not allowed to use dev null or whatever else you want to do sorry you're you're not allowed to use anything other than dev null um so so far we haven't really actually talked about um anything sort of file system level and arguably that's actually one of the most important sort of bits of virtualization which is you know giving people a different view on the file system so actually there isn't really any special support for this in namespaces or c group trying to think but they're they're part of the Linux system so um ch routes and potentially pivot routes if you need to use that um can be used so basically you just create a new mount namespace and then ch routes into it and the nice thing actually is because um we go back slightly so here i'm running this as a user so no no need for any extra permissions and then um i am root though so i have a convenient debby and ch route just here and i can then ch route into that and i'm still root and i'm now inside a ch route um so that that's kind of nice in that it's fairly simple to use that and i can use that then as a user so this means if you're familiar with things like fake routes for debby and building actually you can replace this with something that uses the kernel and fake routes works by ld preloading a library that basically lies to any thing that calls it that it now owns these files and that it remembers file ownership and it's a bit of a weird hack because there are things where that can go wrong so this is actually quite nice and it's using the kernel support but it can be done isolated as a user so you don't need root in order to build a package that needs root to build if that makes sense um so yeah that's kind of nice um so yeah it's it's very unixy really because there are all these parts that you're putting together and if you put them together right then you end up with a very nice system whereby you can play around with the exact aspects that are virtualized or virtualized more or less or whatever you want to do um so really you probably don't want to be using this directly yourself it's kind of interesting to play with but system d actually behind the scenes is already using c groups so if you are using a fairly recent linux system you probably are already using c groups without realizing it um there's a utility thing that comes with system d called system d n spawn which allows you to basically run a particular system inside a container like this and if you actually look at the man page for that there's some very simple examples at the end that say this is how to run dev bootstrap to get a debian root file system and then you can run system d n spawn with the right options and you end up with a file system that is isolated and then you can run that get install or build your package or whatever you want to do inside that um docker and lxc kind of go together so docker underneath is actually using lxc um if you're not familiar with docker it's a very nice system where you actually you basically give it very simple recipes and it will um go go away and install whole systems for you and um it makes use of extra features on top of containers like union file systems so you can have um a very simple file system start with and then you can add a web server to that and um you can then save that under a name and recreate it later so it's it's really quite nice and builds on all this and adds sort of a management layer that isn't really present in the uh sort of low level stuff um there's something called let me contain that for you which um is written by google and sort allows fairly low level control um it's still rather work in progress but um the idea there is to sort of complement docker and various things and then there's something called pflask which is actually quite a nice simple thing that is similar to the contain thing that i've been using as an example but a little bit more full featured and um they have some good examples of things you can do like you could run a web browser in a way that it can't actually see your home directory so you can actually use bind mounts to hide your home directory from the web browser so it's still running technically as your user that you've hidden any files that you're scared that you know some weird website might try and steal from you or whatever so it's kind of a level of isolation but limited in that it's still able to talk to your ex server and things so um yeah that's quite nice um so that's mostly the end um kind of the there's two good article series on um Linux weekly news um the first one is i think a seven series of articles about namespaces and it goes through uh process id namespaces up to user namespaces and all the other ones um and then there's also a quite recent series on c groups because they've actually begun to change how c groups work because they've turned out to be quite complex and didn't really work for everyone and system d is now kind of managing them and there's a bit of there's a bit of discussion about exactly what's going to happen there so um that's that's quite an interesting series to read um um that's pretty much it so if anyone has any questions feel free i think there's a mic over there yeah just wondering what the performance hit is when you containerize a new process like how many containers can you spin up before you have problems um so it's only really a Linux process so there isn't a huge i mean if you do things like create a new uh mount for everyone of course there are kind of you'll be using more space in the kernel for that and one thing i didn't really mention is networking and if you actually start wanting to have sort of isolate them on the network you don't need to give each one a separate IP address and all stuff like that which actually gets a bit more complex but if you were just running up basically a container that had input and output and nothing else so would it make sense to spin up a new container to service like a web request or is it um so so your your it's that would be a bit like running CGI and that you'd be forking for essentially for every request so it wouldn't be great for every web request but maybe you could do something like if you had several workers for your web server you'd keep those in a separate container but usually kind of the idea is you run a particular server in a container and then that's kind of an isolated thing so you wouldn't usually go to that level but they are actually fast enough that if you did want to do something um for example i've got a well it's an IRC bot that way you can just give it random unix commands to run and it runs those in a container and i'm relatively happy to trust that to not i probably won't put it on a public channel but it's on a channel with a few friends on you know but it's quite useful for playing around and it's fast enough to make that kind of thing work so yeah anymore right um are the content the namespaces are they nestable if you create a let's say as a regular user something that you're then root within can that create its own virtual yes yeah yeah yeah that works okay um anymore okay well thank you very much