 Okay, I guess we can start All the topic of this presentation is how are we speeding up? software like ps and top and How we it's a long very long story how we came across this and what we're doing and This is oh anyone in the audience seen that the ps is slow Yep, so me personally I had complained like ps is not working and Then I logged into the system and took a look around it under s trace And I saw that it is working just just slowly getting through all the files in slash proc and Yeah, but from the perspective of the user it was not working because it Gathers up the information first and then it sorts it and so on and there is no put for quite a long time And it does that So the agenda here is I give you a short introduction about Company I work for the projects that I'm involved with and me myself to Then we'll see what is the current interface of accessing the information about the processes and its limitations and Then I'll show a similar problem actually one problem that was solved before and Then I present these solutions both bad and good and We'll see the what is the performance of the good solution. We are coming up with so marketing forced me to Put this slide. I work for virtual over. This is the company Who's been around for a long time and we are under industry pioneer we did containers before it was cool before they were called containers and We have lots of partners and lots of workloads running Some of these guys I believe they Are actually present here at this conference our partners and so the company was founded actually back in 97 and we are in the process of Separating from the parent company and we have the headquarters is in Seattle We have offices in London most community and It's about 170 employees modern hundred of those are actual engineers and we have the 15 Colonel hackers on board Not that Colonel hackers are not engineers They're actually like the best engineers And we are sponsoring some of the key open source initiatives or sponsoring or contributing or otherwise involved in this So virtually the same company as Used to be known as SW soft and parallel in voting. It's the very same company just that sequence of names Me myself This is the first time I'm inserting this light about myself Actually a Linux user since 95. I was back in the day was like where very Old kernels Very funny hardware and software And now I'm involved in development of containers since about 2002 Actually, they were called virtual environments as opposed to virtual machines and the term container came a bit later And I'm the principal author of utilities such as VZ CTO and VZ package the Last one I believe is sort of the precursor for Docker And then I was leading the open VC the whole open VC project from its very beginning 2005 up to up to last year And I moved on and I'm into doing some of the research stuff. I also happened to be the long Time scale speaker my first time here was exactly 10 years ago I was doing an introductory talk about open VC and that was actually my first talk And that was actually my first time in us because I was living in Russia at the time Any anyone here that attended scale for X. Yeah, wow So this two of you Alright from my Twitter account is my family name coliskin And that's all about me. I guess so the open VZ is the full Container solutions for Linux Unlike Docker we do full containers system container. We run the whole distro inside each of the container Looks more like a VM, but but it's a container and it's been developed since the last century And it was open source in 05 and that's How open VZ project was born and we have the line migration for containers since 2007 and One of the goals of the project at the time and it still is was that Back in the day there were no container functionality in the kernel at all so we had to patch the kernel heavily a lot and Then we found ourselves out with lots of kernel patches and we Felt that we better Merge those patches into the upstream Linux to make everyone else use it and to ease our work of porting over the newer kernel so it ended up being that we have Submitted over not submitted but merged over 2000 kernel patches to do the Linux kernel and that makes us biggest contributor to the container functionality in the kernel and That is the stuff that enables Alexi Docker core as and of course open VZ to Open VZ is now being reborn in the form of virtuals of seven and it's becoming yet more open Unless I think less proprietary compliance more open. So virtuals of seven is currently in beta and you're all Welcome to give it a try. I think that the beta Three is about to be released so crew is another project I'm involved with and This talk is mostly about some small work that we've done with crew So crew was born as a sub project of open VZ in order to replace open VZ in kernel checkpoint restore mechanism and I'll tell about it later it's bit older than three years old and The whole point of crew is to be able to save and restore sets of running processes the processes running on your Linux system So you can it's like the sort of hibernation Suspend to disk for a notebook, but not for the whole system for the just set of the processes that set of the process Might accidentally be a container So this is what crew does it saves the complete state of the running system and later You can restore it and then you can restore it on a different machine, and it's called live migration Crew is currently integrated into open VZ of course and Docker and Alexi and if some of you guys I've seen the Docker demo where they play quake and they migrate the quake server from Europe to us. This is all done with crew Being able to Checkpoint and restore sets of running processes as a prerequisite for live migration, but The focus of crew is a bit wider than that for example. You can do things like Saving a period of periodical state of the very long running computational process HPC Like you have something that needs two weeks to be calculated and you have that running And that's a power failure in between and you lose a week of your work So instead of modifying the application you can use crew to periodically checkpoint its state like say once once every hour and In case something goes wrong you can use the checkpoint and Restore it from that state instead of losing one week of work Same thing for games you can have that magical save button in a game that lacks it you can do things like updating the kernel or Doing something with a hardware that requires the reboot or power off So instead of stopping everything you checkpoint everything and then you do your stuff like reboot into a new kernel and then you Restore instead of starting everything up you restore it that way it's faster And if you do it within a few minutes all the network connections will Remain and so your users will see it as not at a downtime, but it's some sort of the unusual long delay Using live migration of course you can do load balancing between nodes in a cluster You can speed up application start top for example we did the test with eclipse gooey That took about minute and a half to start and then we check pointed it and When instead of start we restore it and it takes like five seconds instead of a minute so I Believe some companies are trying to do with with Android phones You do things like ravers debugging like going back in time So you have the checkpoint and then you can always return to that checkpoint and test from that not from the various beginning of the application and also, do you do a feature of one feature that we have in crew and But the using career you can inject faults into the application. So for example you can Close the open file descriptor of any application and see that it handles that correctly for example stuff like that, so the main idea is behind crew is basically We had that task of merging all the open VZ kernel stuff into the upstream Linux and One part of it. I think it was like about one third of The code was checkpoint restore for four containers and We tried hard merging it But the code is in the kernel is across all the kernel except maybe for drivers It's very invasive and no single subsystem maintainer wanted to see our code in their beloved subsystem so and We were not the only one who tried to implement checkpoint and restore in in the Linux kernel and merge it upstream There was a guy who spent a few lives of his a few years of his life in order to try to do the same and he Failed as miserably as we did and so we decided to hack around it and Reimplement the whole thing in user space or mostly in user space So this is how crew was born the idea is For checkpoints you need this state of the applications running and there are a lot of existing mechanisms to gather that states to get it State from the kernel. So there's the whole flash proc with Various bits and pieces of information about the processes running There's the ptrace mechanism used by debugging and there's the net link socket Which you can use to get the information about networking and then there is an interesting feature called parasite code injection that lets you insert your own code into a running process and run it as As if you wear that process some bits of information about the process. You can only get it being that process So this is why we have that parasite code injection and this is how we use it in crew of course, not all the information is there and When it's not we have to amend the kernel we have to add Some functionality to the kernel to provide this more information that we need in order to get the complete picture of what's running and so far we have achieved it with all about 170 kernel patches which is pretty small I guess and As of kernel 3.11 This kernel is sufficient to run crew. It has everything like 99% of it. There are some corner cases that we Attacked later and it might be some more patches from crew project coming into the kernel But it's most of it. It's in kernel 311 if config checkpoint restore option is set so the config checkpoint restore option it appears because Upstream kernel developers were not quite Convinced that it's possible to do checkpoint restore in user space, but they let us Try it and they said like okay everything you put in there should be under this Define and if you fail we just remove all this code together with all the defines Fortunately we succeeded and most of the districts are set this by default now so To the topic of the talk the current interface of getting the information about the processes is Mostly the proc date interface where you can have a directory For each process you have a directory and its name is the process ID and in that directory you have about 40 different files Telling some information about the process And there are some like I don't know about 10 directories and some more stuff on there So it's more than 45. Maybe I don't know 60 70 and This thing is there Since the very beginning I believe and It's working for everyone, but it has some limitations and those limitations are first of all as we found out by Profiling crew we have found out that it takes a lot of time Doing reading the prop and it's because for every small file in there You need at least three system calls you need open read and close and then you repeat add infinitum Because there are so many files and then there are so many processes and this is the same thing that PS is doing So a lot of context to issues that's a lot of syscalls the next problem is That those files and procs Pretty much every file has its own unique format and some files are Presented like tables With a heater some files are like tables without a heater some files are just this Sequence of numbers and strings some files are like name column value and stuff like that and Basically you have to write Your own parser for each of those files and this is what we are doing slightly less problem is that the format is text-based and Basically the kernel has it's all in binary form and then it prints it out for us in text form then we Translate this back into binary for example when reading numbers and reading UIDs and stuff like that And then we translate it back to text for printing. That's a lot of translation back and forth Ideally we would get just the binary stuff right from the kernel But the upside is it lets you read the files from procs using just cat and see what's going on a Third problem is There's not enough information in there For example if you take the prod paint paint FD it shows you the file district or open and The files that they are associated with like you see that zero is that studying and so on The problem there is due to over months those file names can be Irrelevant then there is in no way to figure out what the What the position in the file is and what are the file open flags are and so on For that we solved it by adding pro compete FD info in in there for every file descriptor You have that additional information, but there is much more to it. It's just one example of Kernel not providing enough information Very big problem is some of these formats of the proc files are not extendable So this is the example. So these are the mappings of the current processes the regions of memory that are mapped It has the addresses the protection beads some other information the major minor numbers and the file name Like this is the library that is mapped into the cat process the problem is this last field the file name is optional because there are anonymous mappings and That means if it's optional that means you cannot add any more information after it This is the last field Now we need VM flags in here and and we can't add it Because yeah, this is basically set installed you cannot change it without ruining all the backward compatibility fortunately, we do have VM flags in a different file called as maps So this file not maps, but as maps has the information that we need these VM flags Unfortunately, this also has these statistics this this statistics like how many much memory is used is Also available from that file, but we don't need it. Well, we discard it The problem here is it takes a lot of time to gather this information Which we immediately discard So this is the next problem sometimes files are proc in proc are slow because of things like this Let me Show you an Example here cat all the proc maps file read them all that's pretty fast 0.02 seconds Not much I Guess around 200 or something. Yeah 200 So the problem here is That let's do the same with S maps. You see it's One-tenth of a second maybe even more that's like I don't know five Times slower than that and it's just about 200 process Imagine the system running thousands of containers. You all have that and it would be way over a second to read it And this is for information that It's okay if we need this information It's not a key if we immediately discard that information first We ask the colonel to give it and then we discard it. So the problem is Sometimes files and proc are slow but because of those extra attributes that are not all that we are not always needing So this is the same explanation that I just showed to you And we had a similar problem when in crew when dealing with sockets and dealing with networking usual stuff to get information about sockets are In prognets there's a prop net netling prop net Unix TCP and packet socket and Of course each of those four files have their own unique format They some of them try to look like tables some of them try to look like this There's tons of fields so they all look ugly, but still human readable Now it's basically set same set of problems as I just described They might be not enough information in there the format is complex and It's all or nothing approach. You either request every information or No information at all Fortunately, there is a solution already in the colonel There's so-called netling socket that it is used was used to get the information about TCP Sockets and it was called TCP the diag and some point. I think it was called I net diag because it was Has been renamed so we generalized it and added all the other types of sockets into there and now it Is known as sock diag and we don't need this proc files anymore We just ask the colonel what we need through that netling socket There are two very good things about netling socket first of all the format is binary and extendable it is designed that way you can just add some new fields without ruining backward compatibility and Second is you can specify explicitly what kind of info and about what do you need from the colonel? You don't get all or nothing. You don't get extra beats that you are throwing away just what you need so We looked at that and thought why don't we add tasks to the same interface and that was a bad solution to the problem so basically we extended the netling socket with things called task diag and the We ask information about some processes through a network netling socket We we know we want to know these and these about this set of process and it gave us back the problem are first problem is netling socket is actually designed around network and it knows everything about network namespaces But it doesn't know anything about process ID namespace or user namespace or other namespaces And it's not clear how to amend it with all this Need to extend the format or other ways do something like that and it's not Probably not a good thing because it's a netling socket about network The other problem is netling sockets Has some sort of the weird security mechanism When we create a socket it the kernel Save the credentials And later use those credentials to figure out are we allowed to do this and that or not But then process later can drop privileges like you can start as route open the socket and then drop the route But it's still it is irrelevant to the netling. It saved those credentials and It will still think we are rude and finally It's one interface and it is used to get and set information about Networking and we when we use the same interface to get process attributes we can Potentially abuse it to for example add AP addresses to the interfaces There is no mechanism to you know restrict us from doing that and This is why that makes it bad for makes netling back netling bad for Solving our task with proc There is yet another Example of using netling Abusing that link for the bad thing and it's called tasks test It's basically statistics about Ryan test. It's not available from proc, but it's available available through the netling socket and this I Think needs to go away, but but it's still there and we don't want to add yet another Bad thing to the kernel instead. We are proposing this interface a file called Proc task Diago and it's a transaction file So you write a request to it and you really respond back to it if there are two users Every every user will get its own reply It uses the same good netlink message format. It's binary and extendable and it's good because of these two characteristics It lets you get information about the specified set of processes more more on this later and We made sure that we group the attributes Attribute is some some value like for example Pete or you UID or Program name or something like that And we group those attributes into groups and the netlink only lets you get some information about the group So you specify I want this group and this group and this group and it gives you all the attributes on this group The thing is if you add an attribute to the group, which makes it slower You are doing it wrong. You need to separate this into special group So everything that slows us down should go into separate group So we adhere to that principle Finally, the netlink socket Messages are limited to 16 kilobytes, but you can have many packets So whatever if if there are more than 16 kilobytes is just the kernels pleats that for you And This is what we have right now and we think it solves the problem that we came across And it also solves the problem of slow p.s. And some more But this is work in progress the current status is we are About to send the patches to the kernel We already send a few iterations of the patches using the netlink socket the bad solution We have discussed it. We understood that it's a bad approach and we're going slowly going this one We're about to send the patches to the kernel. So it's not yet there And because of it, it might change. So it's a work in progress On this slide. I'm not sure I Really want to show it it describes the format of the netlink message Basically it says that it's easy to add any attributes it easy to add new groups The format is completely extendable and it's binary and it's pretty simple. It's very easy to parse from say see So what are the ways in our new proposed interface? What are the ways to specify what? Processes do you want to get information about first of all you can use Task biog dump all get all the process in the system. That's what p.s. Would use by default for example for top Then you can do Dump all threads which means all the processes plus all the threads So the distinction is if all the processes mean all the threat group leaders all the threads mean all the other threads as well You can ask it to dump children or the specified P.A.D. Like that would probably want some of the process like a patch it would use like get me all my children You can or or crew use would use that like get me all the processes in this container starting from its unit Then you can ask for Dumping all the threads of a specific P.A.D. That doesn't include children that only includes threads Finally just one process. Just one specific process. I want to get information about this one and Here are the groups of attributes basically what you can ask for base group that includes all all the different ideas and the comment name that Is sort of same as proc P.A.D. status Then you can ask for credentials All all the primary and secondary UIDs and gates and the crew death and the capabilities Then you can add ask for stats about the processes This is same as task stats that I mentioned as another example of that usage of netlink socket and This information is not available in proc and we very much hope that tasks stats would be replaced by this then VMA test diag VMA, which is the synonym of proc P.A.D. maps and The VMA's CAD all those extra information that takes along together that we don't need but someone might Performance comparison For that I have VM running with this very kernel with the patches that we are going to send it's 4.4.0 plus our patches and Let me see what we have here Not too much. So let's For 10,000 processes See the PS is already full. So what I'm about to do now is He saw this task proc all is just Opening proc P.A.D. status and reading it and closing it. It doesn't even parse parses it I think there is no output it's just Yeah It just an output the total number of entries that it read so you can see it's about 0.05 seconds This is the current interface the proc So let's do the same thing with diag Using the new interface so it's At least five times faster getting the same information and Parts in is also easier. Although I don't think it take much of the time because instead of 30,000 system calls we do just just a few So this is it at least five times faster. This is the new proposed interface One another performance metric I'm going to show you is It Comes from first project these guys are Who are writing the first tool? They Had similar problems about proc and When they found out that we are doing something about it They asked to share the patches and they tested it and this is an email from Dave Aaron who Did some testing and in his testing the first test is like five six Times faster and the second test is like ten times faster So this is what he found I I can say like independent performance testing And that's about it as I said the current status is We are about to send this to the upstream and it's not the first iteration It's actually the third iteration cause the first to wear netling based and This is just a little beat of work that we are doing with Creole It's much more of that. I think I can do like a week of talking about such things But oh This is the last slide Okay, I wanted to give you a link to the current code and and I don't have a slide for that anymore, but Generally career is available from career org open VZ is available from open VZ org and this work is in my colleagues Github account This is github on a wagon Linux test. I but you will See it on a Linux kernel mailing list pretty soon That concludes my talk if there are any questions. I'll be more than happy to answer and I think I Do have some crew and open VZ stickers if your notebooks can accommodate one or two more Thank you Question time. Is it also clear any? Generic questions about open VZ or crew or containers. Yes So basically it doesn't go to linear store of himself Usually this stuff is not even discussed on the Linux kernel mailing list But on on some of the subsystem mailing list although This might be a subject for memory management, please, but there's no proc mainly. So this this was actually Stuff for the generic linux linux kernel list But usually it is discussed on the subsystem mailing list first then every subsystem maintainer has his own tree like there's the Linux gadget tree or Linux net tree with the maintainers it is discussed It is agreed upon it gets reiterated over and over you'll polish your stuff and They're pretty sensible as it comes for API and once It's all agreed upon and polished it gets merged to the maintainer tree and For the trees that don't have any maintainer that goes to Andrew Morton That has his own tree like maintains everything that's not maintained and These guys the big guys the subsystem maintainers the lead tenants in other way They send the stuff to linux and linux very rarely complains about it It complains that it gets right before the merge window or otherwise. He just trust the maintainers and this Like tree of trust is there. Otherwise Linx would just die, you know under the pressure No, no, there's no option kernel yet because we we are about to send it so we send this stuff with the netlink socket and We decided at the end they decided that it's a bad approach. This is a good approach. We just need to you know Split the patches provide the nice Read me and so on and we already have tests for that And we already have other people like from Perth team who are about to support us with this and Overall, I think that maybe not in its current form, but it will made way into a Linux kernel in this year Yeah, yeah, but so it would be like four point. I don't know Yeah, something like that. Yeah. Yeah All right. All right. Thank you so much