 Today I'm going to tell you about a solution for OS deduplication. So the project is called, I feel like saying Cytus, but I think you should say Cytus. I mean, it's not a real world, but just like Linux. So it stands for Single Instance Distributing Universal System. It's not my work. All the credit really goes to my collaborator Emmanuel Caminay, who is based in Lyon, France. He works at an HPC center, so High Performance Computing Center. He is a CIS admin and a research engineer. He's very, very technical and has been teaching me quite a few things. And I was interested in this project initially as an academic that I was in the scope of reproducible science. And we applied at SciPy 2013 and we got selected. So tonight this is, you're like a test audience for me. I'm practicing my talk that I will give the week of the 24th of June. So the idea is to run a given OS on several many stations. Really, just like you're used to having a given virtual end for your project or for a given workshop when you're teaching students, then you want to control your experiments. So the runs that you're doing as a scientist, for instance, and know what your OS is. And knowing that in practice, it's Debian with all the base packages and Python and the scientific Python libraries, because we think that's the right, the proper environment for doing scientific research. So the context, yeah, is typically with cluster nodes, but it can be a part of workstations that you manage or administer, or it can be like a computer room for students or something, a bunch of self-service stations. So the idea is that you will have the OS stored remotely, but you still use the local resources. So it's not a terminal. And you can actually make full use of the local resources of your local workstation, but you can also choose the ones you want to use so you can segment those at will. But you can really reduce the amount of storage needed for the OS because this is taken care of remotely. So just to motivate, again, the reason for such a project, so it can have applications in the context of teaching when instead of emailing your students the day before, you can just install this and have this running and with this given version, then you're just able to deploy a given OS. And you have the flexibility of updating it, changing the configuration. You can test new equipment without going through installing an OS on it, and you can probe corrupted equipment, like if it's not booting by itself properly, then you do it over the network. And I'm focusing on what's at stake for reproducible science, so making sure that things run the same way, quantitatively and qualitatively, on cluster nodes in this context. So HPC nodes are, people always tell you they are a clone of each other. So yeah, typically they're identical, and they have the same OS locally installed, and it's identical, but over time you never really keep this identity. And the point with CITIS is that whenever you reboot, you will just reset and everything will be identical again, and you can run your, just like you control biology experiment, well, you control a numerical experiment. Then there's an issue, it's not on my slide, but about persistence, like let's say you really like your configuration, and when you reboot, you don't want to erase and lose everything that you set. And I would not be able to explain you technically how it works, but it's something with the ISCSI storage space, and the documentation is available. I'll give you the link at the end of the presentation. So I'll jump into an application to talk about some Python as well, and the idea was to consider, well, not just the idea, the practice what happened was to take 20 stations interconnected, so you have 10 pairs client server using two configurations, but I'm showing you only one here, and running three sets of experiments, so like on three different days. And because this is Montreal Python, and because that's my language, I decided to do an IPython notebook for this presentation and to analyze those data, because I'm really not a sysadmin, I'm a data analyst. So you... I mean, this, Julie, I taught you, I think, a couple months ago, so I can go fast, right? But you import pandas, I am learning stats models, I saw Mr. Stats models here, exactly. So I am jumping into it right now, because I'm seeing the limitations of pandas, and I really need to do some regressive models and all, so don't worry. I realize I really need to go to stats models. So the CSV, so X2 is the given configuration, 09 is the day, so we'll consider 10 and 11 as the second and third experiment. So we have all these columns, DEST at SRC identifies the given pair, the given client server pair, so there's going to be like 10, 11 with 1, 12 with 2, 13 with 3, et cetera. All these attributes qualify the performance because that's what we were interested in evaluating, there are speeds like kilobytes per second for rewriting, for reading, et cetera. That's what's in the data. I shall describe yield the same thing as head here, and it yields a different thing when you add parenthesis, but anyway. So the attribute right for the first pair of given client with given server has those numbers, so it's in kilobytes per second, and it's like running it at a one, so it's like one experiment, but several times, so I consider it a time series, right? Oops, how do I scroll down? So I plot it for every, every line is for every client server pair, and we see huge discrepancies. I could have shown you like, not like going into detail, but what is the test script that's run because it's in Python? Oh no, I cannot go there, can I? Control C, Control V. So it's a Python script, but it has a lot of bash in it, but still it's Python. So all these tests were reading, writing, and anyways. So here we have like huge dispersion variability, because so, well, all these clients or repair seem to go all together, but those two are, well, completely off, and also within a given client's repair you have lots of variability, like look at the blue line to go a little more, to represent this variability in another way, like I decided to show you a box plot for all the attributes, right, rewrite, et cetera. So only for the first client server pair. So if we want to talk about statistics, my background is in not even proper statistics, but statistical physics, so that would be like statistics over time, and that, oops, sorry, that would be statistics over time for each line considered separately, and like here as well, because I'm considering like one client's repair, and if you want to talk about the variability between the different lines, that would be like an ensemble, an ensemble statistics, so there's like, or good city ideas that I'm kind of fascinated with, but I haven't had time to dig into. And because we used, well, my collaborator, Emmanuel, used Citus, he was able to control completely all the software, so this variability, where can it come from? So he was able to identify that was a hardware issue, something with the configuration, and basically all your cluster nodes, like you need to set them to max performance, so they perform in like a stable controlled way. So he did that, and indeed then he could observe something with a lot less variability and higher performance, like we're using the same scale, up to, sorry, between this diagram, 1.2 to 10 to the power of six, same here, and same for the third set of experiment. So I'm plotting the box plot and you see like this is squished, variability is very much more under control, and I also decided to display the basic statistical properties with the function described for this time. So the first pair client server and third experiment. So basically, if you want to use a cluster and really trust that your runs will perform this, well, will be treated the same way by all those nodes, even if you're told there are clones, they have to be set in a max performance configuration to have this like actually similar behavior. And here we have some work in progress for the documentation. I may or may not, oops, it takes time to load, eventually translate some into English. It's not that I'm lazy, there are different issues, like I was submitted a paper to Lin-External and they keep it exclusive until they decide they will publish it or not. So I cannot just copy-paste whatever we already submitted. But about documentation, I have a really funny thing I thought I would share with you. I was at AdaCamp yesterday in San Francisco and at one of the lightning talks, the lady talking referred to documentation as spoiler. So we had a really great laugh, but that's not going to help everyone who procrastinate about writing documentation finally. To sum up qualitatively some marketing for Citus, it's universal because all this is platform-independent. It works on x86 and x8664 architectures. It's efficient. The install, so you can go into the details if you're admin-oriented and if you actually want to use it, it takes a few minutes. We should make it available soon. I think it fits on maybe 16 gigabytes USB key. So it is scalable. It was tested successfully on 100 nodes. It is multi-purpose. We chose to go for Debian because we thought it's the proper environment for scientific computing, but possibly you can consider something else. I should have given results about the second configuration, but I was doing this aboard the plane last night. So I will do it by sci-pi and we'll have more material. But thank you very much.