 So welcome everyone. Before I go into what exactly reframe is and why we use it at CSCS, I want to give you some idea what is the background that we had and why we use reframe. So we do regression testing a lot at CSCS to guarantee quality of service. So it was very nice that we have just a precise talk and he was talking about the difficulties they have to guarantee the quality of the performance of the software and the sanity of the software. So that's exactly the open source tool that we can provide not only to precise but to the entire community on how actually solve this problem. So at CSCS because we invest a lot on guaranteeing quality of service of our computer services, we test every single computer every day. We test everything. So how we used to do that? We used to write this big shell script that would go into everything and test if the performance was being correct, if the numbers would be produced correctly and it was really tough to maintain because we were doing a nice thing but using a not very nice software methodology. So for that we started to develop a better framework that's called reframe that we can actually do the testing. Reframe is actually written in Python, in Python 3 and we try to make it easy for anyone to use it from software developers to CSCS administrations to any normal user who wants to just run our cluster and can test their software. So the way that we try to do that is basically abstracting away our computing systems or the internals of the computing systems. Some things that are difficult for some users to run in HPC systems because we do have modules, we have different MPIs and the MPIs are highly customized for that computer node, that computer system and when you change the cluster, you have to use another MPI because the other MPI is highly customizable for that one. So you have to be keeping tracking off all these things. So we try to simplify the usage of our regression testing of how testing will reframe by abstracting away these concepts. So that way basically the person who is writing the checks or if I want to write, for instance, a precise check, I don't need to think about it, I just have to think about the logics. What will my output program need as input? What will my program need as output? How can I give it for an output and how can I measure the performance? So we have been developing reframes internally since 2016. We went public and now in 2019 we basically have 18 forks and 35 star gazers. Last week was 32. Last week when I said 32, it's a bit tough to count these things because it's all the time being updated because people are actually using it. And one of the reasons why people are using reframe is because we have a very nice design goals. One of the design goals we have is productivity. So which means that whenever we have a user we report the bug and the user, this bug, we can test it later. What we do, we write a test. So it allows us, reframe allows us to actually on the fly easily write tests. Because we do have systems from basic Linux box with redhead installed, CentOS installed, SLAS installed. From highly customized Linux like Cray systems, we want our tests also to be portable. So we think a lot about portability. We don't want to write on test on one system and have to rewrite it to another system. One thing that I like about reframe is that we have been running this daily in our systems and we have never seen a stack trace of Python. Why? Because we are really robust. We do a highly testing not only of the systems but also of reframe. So we use reframe to test reframe. One of the key features of reframe is actually, as I said before, the separation of the program environment and the system. So you don't really have to care much about that. One thing that for those who actually have experience with HPC, you always have to run jobs in HPC. You don't really log in into the compute node. You have to do it as scheduler. So in general, they do what's called workload manager. So you can ask to say, I want a thousand compute nodes, then you get a thousand compute nodes, and then how you run your simulation there? You got a launcher, SRAM, in our case, because users learn, and then you're going to run MPI calculations across this entire thousand machines, compute nodes. And some of these thousand compute nodes may be GPU enabled. We have a pretty big system with 5,300 GPUs, and the same system has 1,900 more or less CPU-only machines. So you actually, to run in one case, on the other case, we have two sites of what's called a partition. So you have to go to the GPU partition or non-GPU partition. So on reframe, you also want to abstract out these concepts. We don't really want the users to be freaking about, hey, I have to be here, I have to be there, I have to run this way, this way. We just abstract out a lot of things. One of the concepts that we have also reframes, we want to guarantee that we have a very complete documentation. So if you go to our website, you're going to see the reader docs, and you're going to be able to see how the documentation starts using it. So we're going to have tutorials. And on computing systems, on HPC systems, we have a lot of different people working in them. For instance, I work in the user support team, and I'm PhD in chemistry. I'm not a computer scientist. And how can I be developing? How can I be using reframe? The reason we can do that is because we abstract out the people who are actually using reframe from the people who are actually developing tests. We try to make it easy for everyone who doesn't have a PhD in computer science, let's say, so someone who has minimal knowledge of Bash to be using it. So we have two different APIs. One is called the front-end API. It's actually called how to run reframe, and we want it to be as simple as possible. So we only have some flags, but if you want to run it, the minimal flag you have to do is minus r, minus r. And we like the dreaming face guide. It's actually, in general, a user who wants to run reframe, or a system administrator, but you also have a service at CSCS that you can run reframe towards Jenkins. So you can have a robot running your code and taking pre-size and checking the performance of pre-size whenever you make a pull request, for instance. We also, for those who actually are developing reframe tests, that's not so difficult also. So what you have to do is basically write a Python class and inherit from regression test and decorate it with a small decorator called simple test. This simple test allows us to basically go through what's called a pipeline that I'll show you later. And also enables you all the functionality internal to the API, like the system abstraction and the environment abstraction. The environment abstraction in HPC is not only about the variables that you have, but also about the modules system that you have. So some systems have T-MOD, some systems have L-MOD, T-MOD4. If you don't know what I'm talking about, that is environment managers, module managers, where you can actually load different versions of software. And you don't have to care about it when you're running at CSCS what's the type of modules you're using. One thing that we can say, okay, these guys are really doing abstracting, and so this thing must be really difficult to use. In practice, it's not. The simple class that I've talked about in general looks like that. So, okay, I'm cramping the lines because I have to fit and translate. But as you can see, you just need to put, like, a descriptor. If you want to give a nice name, you have your class. Let me see if I can do this. So have your class. You basically init function inheriting from the regression test. You write a description name just to give you a nice description what your test does. So I want to run on pitch 9, which is how this big cluster I told you with 5,000 to run on GPUs, on the GPU part of the cluster. And then I can say, I want to test the GNU compiler, the Cray compiler, and the PGI compiler. I can say the input file of my test is this .cu file. So it's going to be like an NVIDIA or CUDA source. So we frame and identify that when you say single source. And we call the NVCC compiler. And it identifies the ending of your single source and call the appropriate compiler. And we will be compiling these applications using basic flags, the minus 0, 3, and you can give executable options. On our systems, you need to know basically what's the module. You have to load for CUDA, but you can do a mapping. You can always say, CUDA toolkit means this module use or that module load, that specific module. You can do mappings. And on this big cluster that I told you, we have just one GPU per node. So you can do the test with one GPU. But you can do, you can say here, like, if I am on test on system A, do two GPUs, I'm system B, do four GPUs. And so the way we do, guarantee that we have a sanity check, guarantee if you have like the values are correct, we do this, what's called sanity patterns. And you extract information from your test. And in this case, if the line time for single matrix is found in the steady out, you're going to guarantee that it was passed. Because actually it's the last line of your output. And then we can look at the output and it will say something like performance. It was in gigaflops per second. And then if you look for this line, it's tracked out this pattern, converted to a float, and then we're going to be able to compare. So this number here, 50 is just a dummy number. So if this number is within 10% of this 50, we're going to pass performance. So as you can see, we're just trying to abstract away all the logics of doing graph or getting the things comparing, converting to numbers into reframe. I can say that, yes, nice. These texts we can write. But do you really use it in, like can I really use for my precise, can I really use for my scientific applications? Yes. So we have like an application at CSS that's actually been developed every day and they test it. And they use CMake in this case. And this is how complexity is for them to write a test. So they say, I need CMake and I need this version of GCC because they were specific for this version. And I need to compile with these options. These compile options, my code. And voila, they get the code. Because we are using Python and now we can use the amazing computer science concepts that we could not use in Shell. One of them is OO, so I can inherit. I can say, you know what, now I have my MPI version of the test, my MPI version of the test. To compile with MPI, I need this flag. So now I have two tests. I have the previous test. That was the basic one that was only compiling with asset on. And I have now a second test that is basically doing the MPI version. And we can go on. We can actually not only a simple test on top, but we can also have this parametrize test. Then I have a factory. So instead of having action to write a factory class or a factory function to spawn several tests, I can parametrize them. So in this case, I will be parametrizing a different test that we compile for Haswell, Broadwell architecture, and a native architecture that can be the current one that you are using architecture. So you can see it's really simple to write tests on reframe. And the pipeline that you get by adding these simple tests and this parametrize is a very fixed one. And this pipeline is more or less the following. You start the front end, you can say reframe minus R. What reframe you do, you go into a folder that you say, and then you look at all the Python files inside, all of them that are reframe tests. And then it will start running so you're going to pick a test and you check, hey, is this test meant to be run on this system? If yes, we continue. If no, we just skip the test. And then with the same applies to every single conditions that you can impose. We can say, is this test to support the program environment? So in my laptop, do I have a Cray compiler or not? If I say, no, I don't have a Cray compiler I just skip the Cray test. I wouldn't go continue to the GNU test where I have. I can go to the Clang test that I have on my laptop. And then this step we actually run on the machine that you actually have executed. On our supercomputing center, what we have is actually we also have these compute nodes that you spawn to the scheduler the execution. So we do all these analysis in a compute node and you can run the test on the local node and you can run the test on the compute node. So we take your test. We verify it's valid. We set up the environment so you clone, we guarantee that you run it on the same environment that's always equal to the test that you have executed. You compile your test if it's necessary and we spawn these tests. Okay, my laptop does not need a workload manager. My laptop is just my laptop. I don't want to be installing things that I don't need. So one of the things that we don't need a workload manager, at least I don't need it. So what I do is basically on my laptop I can say my workload manager is a local scheduler. So you actually just emit a batch script and runs on my local laptop. So this way we can run from my laptop to Travis and when we talk to the people from operations they say, the system is saying, for them, the same means the system is getting the correct numbers and getting the correct performance. On reframe nomenclature we are a bit different. Sainterize means that we are actually giving the correct science which means that I'm giving the correct number. And performance means that we're actually performing in that given range of value that we're expecting. So we do have different steps for checking the performance and checking the same sanity. And after if your test has passed or not we clean up the stage and continue and then report at the end. One way that you can do or one type of test that you can do is that you can compile your own test if you don't want to be running anything. If you want to compile a library you can compile it. If you want to run only test because you have already compiled an application in your computer you can run it only. And so we have already predefined places that you can do that. The way that you can actually define what's the path that you have where check is minus C common line. But you can also define different configuration files to actually define what is your laptop, what is HPC center and what is the other HPC center that you are running. So this configuration file allows us to map different systems and run the same test on different systems. So how does it look like when you run reframe? It looks like this basically if you have passed all the tests they will just say run okay, pass, run, okay, pass. So I can run or pray, I can run the example test for instance. If something doesn't pass what happens is that if you run you say failed and at the end you say the entire regression failed in one test case and then you report a failure. In this case I just run one test, one check, so you only see one failure here. But you can see the reason why it was failing was performance. So it passed sanity and the reason is because this number here is beyond the range that we have defined, the reference range, right? One of the nice things for HPC centers is that we care about the users a lot and by caring about the users we need to guarantee that whenever we use a report an error we can find the reason why. That means that we save a lot of logs. So at CS specifically we have a centralized logging service that we save all CS logs and reframe can also send to that CS log. So we can send all the performance of our applications that we are running daily at CS CS to decentralized and block graphs. For instance like as you can see this is basically the logs of reframe execution of one application that's called AMBER running the GPU and this is the CPU version. You see a bit fluctuates and then we can understand why and we can report why it was smaller or why it was higher. Then we can understand. The usage that we have for reframe inside CS CS in our National Computer Center is basically we run through all the three major clusters that we have that are Pitsdient, Pitscash and Leone. Pitsdient anyone can access to projects Pitscash is the the one that we do the climate calculations so the simulations that say it's going to rain or not rain, they are done in our center and the Leone is a private cluster for my customer. So there on the street clusters we run what's called production tests that we do that we run daily maintenance tests that we basically run before doing an upgrade in the system and after doing an upgrade in the system and what we call diagnostics. When the node goes bad we run some tests to bring it back before and why you do that because we have seen really nice things with reframe that allows us to probe and improve the quality of services. Sometimes we before the upgrade the application is running with a given performance after the upgrade the application drops performance or increase performance so then we go investigate the reason why and by majority of the case we can bring the performance up back to the original and if we didn't have reframe to actually monitor that the performance would be bad and we wouldn't notice. Only the user will be complaining. So the way we do that we use true Jenkins so reframe is taken from Jenkins and then we can see that the nice thing about Jenkins that you have all these nice interfaces that we can we don't need to care about logging into the system we can just look to the web browser. So we like it very much because it guarantees the quality of services of our systems and the application that we support but we also want to empower users so at CSCS we have a CI service so you can apply for computer time at CSCS, apply for a CI service there and then we run let's give example precise every day and test if it was performing at a real supercomputing center. For that we have actually integrated test reframe integrations not only in our CI service but also on public services like Travis for instance. This way we allow you to be developing on your laptop, testing reframe on your local laptop, making the poor request to get home working on Travis, the same test that you have written for your laptop and bringing the same test to our computer center so your cross boundaries test it and internally we use Jenkins so the usage is the same that we have for our own tests so for us it's really nice because we can debug in the same interface so this is how it looks like internally at CSCS and as you know this is how it looks like on Travis so we can actually have a nice integration in all these cases so just to conclude I know I speak very fast and I haven't spoken very fast but just to summarize the take home message about what reframe reframe is a regression test framework that allows us to guarantee quality of our software and it's written in Python 3 it can be portable across different computing systems from your laptop to HPC systems can be used in different HPC systems and gives you a nice way to verify where you have failed to get comprehensive reports one thing that we are missing on reframe is dependency of tests so I can say test A depends on test B which depends on test C so it's on our future direction on the world map and we have support to run any command line inside reframe but we only say that we support something if the usage of that thing is really simple so today we do support containers at CSS we run container test at CSS but we don't say we support containers because the usage is still too much commands we are working on simplified interface so you can say this is my container this is what I want to run and run it to make everybody's life easy and we have been asked by one computing center to have what's called benchmarking mode so we can stress the system we don't say we support only but this is very specific to supercomputing where you want to to queue a machine, want to stress the city limits so we have some partners that we have been using reframe as you can see majority of them are supercomputing centers but we have some companies involved and if you are using reframe and you haven't told us please tell us because we have many people using it and we acknowledge the team that has been developing reframe in CSS as you can see we have a lot of people because reframe and regression testing is very important for us so this is a project that was not going to go away in the near future for sure because HPC centers they do require regression testing with that I want to thank you for your attention and any questions so the question was if I understood you can only write tests in python no, I mean this test is actually the test itself in reframe is in python but this application is an envid application so you can say I want to run this script just put the script there and you are going to run this script you still have to wrap this interface yes so the comment was basically that by adding this step of writing the class it may add some complexity because you just had your shell script before you have it yeah this is true in the beginning of reframe by 2016 we shared the reframe between operations and they said no this is very difficult to use we have to have a PhD in computer science to use it because it was very difficult to use so we are slowly incrementally in the interface to make it as easy as possible so we have some internal ideas that I don't know if I can share with you to simplify these and using different input methods but it's not there yet it's now roadmap but it's not in the near future roadmap so yes today you still have to do some python to have it yeah you have this layer python layer yeah so just one comment like an extra complexity to write python class around your yes but if you just have a single type of system to test then that's fine go for a shell script but imagine you have to test different on different HPC systems then the complexity really becomes much higher and you want to abstract this away for example you don't want to have in your script the logic of the system is very simple you have a high job finished when did they finish we have examples of people developing software and they have ended up writing a thousands of lines of bar script just doing stuff that's we are handling by the framework and they could do the same stuff with plenty lines of python code because we have the system part versus the logic of the test that's the key advantage yes I agree you have to do it later and depending on what you want to do then it's it can really untie your hands any other questions no thank you very much