 Thank you. Can you hear me? It's good. Okay. So I'm Venetina Mancinelli and I'm a software engineer from CERN. And in the past two years I've been working on the American Out Service that I'm going to describe today. I choose a very long title. So I'm going to give you an introduction of what is CERN and what is the WLCG for the one who are not familiar with it. And then I'm going to describe you what is AmerCloud, how it's, what it does and how it's used by the experiments to run stress and functional tests on the worldwide LSE competing grid resources. So what is CERN? CERN is a cool place. The coolest we can say literally is part of the LSE are cooled down to Kelvin degrees. So that's the coldest temperature reached and measured in the universe. So CERN stands for the European Organization for Nuclear Research. And in French it works better. And it is composed by 21 countries. And Czech Republic is one of them. And it's also, CERN is also a laboratory, a physics laboratory just outside Geneva in the border from Switzerland and France. And its mission is to study the fundamental laws of physics. There are many experiments that are carried on at CERN, but probably the most famous one are the one connected to the LSE, the Large Jordan Collider. The Large Jordan Collider is the biggest collider in the world. It has 27 kilometers circumference. You can see the outline in the map in the country side just outside Geneva. This is another picture of the presentation. So the LSE is located 100 meters below the ground. And inside the LSE, a bunch of particles are, you are accelerated up to 0.999 times the speed of light. Till the point when they perform more than 10,000 rounds around the 27 kilometers per second. I forgot to mention that the LSE is used to study the high energy physics. And so there are two tracks for the bunch of particles around the LSE so that the bunch of particles run in opposite direction. And there are four points where this bunch can meet and where they collide. And on these four points is where the experiments, the four experiments detectors are located. These are the pictures of the four detectors. The experiments are ALIS, ATLAS, CMS, NLSCB. To give an indication of the size, I hope you can see this is a person standing in front of the detector of ATLAS, which is the biggest of the four detectors. It's bigger than a four storey's building and couldn't fit in this room. The four experiments have focused on different aspects of the physics that happen during the collision. So they're different, but basically they're big cameras that take snapshots of what happened during the collision. They're constructed as layers of sensors and these sensors measure different things that happened during the collision, like the passage of a particle through the sensor, for example, measure the energy of the particle. So the data processing of the data collected by the C can be very much simplified like this. So the first, they see the collision happen and the snapshots are collected by the detector. Not all the events are collected. First, there are filters that remove the not interesting events. For example, the one with that happened in low energy that have been already studied in the past. So they're not interesting. The filters are both hardware and software. What comes out is, for example, something like this. This is a visual representation of a collision, of the events of a collision. This is actually very important because it was one of the ones presented by the experiment CMS to prove the measurement of a production of X boson. After this, the data is distributed all around the world so that physicists can access it and process it. Processing means, for example, changing the format, reducing the size of the data and run statistical analysis on it and produce plot. And if from the statistical analysis you're lacking, you find something, then you write an article about it. So the distribution that are processing all around the world is performed thanks to the WLCG, the worldwide LSE computing grid. It's composed, WLCG is composed by more than 170 data centers spread all around the world in 42 countries. The only continent where there's not a data center is Antarctica. So it has a tear structure and the tear zero is turned that acquired the data from the LSE stories for long-time storage and distributed all over the world. All the data centers around the world provide resources on the numbers of 300 petabytes of disk, 200 petabytes of tape, 300,000 cores. But all these resources come from a very ethereal environment. So there are high throughput, high performance, let's say opportunistic resources, cloud resources. And also use different kind of, either way different kind of storage type. And the objective of the WLCG is to provide transparent access to the physicists around the world to access and process the data without caring about all the underlying differences and complexity. There is a community of more than 10,000 users of physicists that run their analysis in producing around 2 millions of jobs every day. So as I said, the physicists are transparent over the complexity and the heterogeneity of the system but what they carry is their data analysis is fast and it's performant and it works. But with this complex system, you can understand that it's hard to monitor the performances, to make changes and evaluate the performances after the changes. And it was with this in mind that AMERCAL was created to give a tool to a system administrator a way to run tests and when provisioning, for example, new data centers. AMERCAL is a service that provides the functionality of running tests on demand with the possibility to manage, to generate on a site. I decided to run a large-scale stress test but also for constant validation to automatically generate tests for continuous tests. An important point of AMERCAL is that to test the site is we use the real workload of the experiments. So we use the same, we, to test the site, we submit jobs and these jobs are the same workload that the physicists submit to process their data. There are others, the WCG use other monitoring tools to test more basic functionalities of the system. So we use the real workload of the experiments. So we use the same components, for example, to test the read and write from storage. But these, using the real workload of the experiments can test all the workflow of the OSTAC from test submission to lending the job on the work and to accessing the data so it tests everything. And also AMERCAL provides real-time information about the results of the test with the efficiency of the jobs and the success rate and the performance of the jobs submitted with the tests. It's used by three experiments at CERN with different use cases and in average we run, we manage around 100 tests per day. We produce a lot of 100,000 jobs per day which is more or less 2% of the total submit jobs. But if we see the real consumption CPU, it's a lot less because usually these jobs are a shorter adjust to test the functionalities and the performance. So some concept. So we use template to give the possibility to users to define a class of tests comparing with the object oriented, this is exactly the definition of a class for the test. And these allow the users to define what kind of job they want to run so that they can run the same jobs on every site. So perfectly be sure that what they submitted is the same on every site and perfectly measure the performances of the OSTAC on different sites, on different architecture. The test is, let's say, the distance of a template. It has a start time and end time and it's the one that AmerCloud runs and derives the definition, the configuration from the template and it submit, during its run it submit jobs. As I say, the job is the experiment payload that are submitted to the target site and run on the site. And the job results, by job results, we don't care about the actual output of the jobs that would be the output of the data processing normally. What we care is about the performance of the jobs that we use and we gather metrics about the execution of these jobs and that's what we store and we use for the test report. So to give a very simple view of the architecture, we have a web interface that is built with Django where the user can configure templates and tests and request new tests and also where we provide plots and statistics about the test results. Then there is the, let's say, the backend, the AmerCloud test infrastructure that is the one that takes care of scheduling the test generating the jobs from the user configuration and then submitting to the grid resources. So yeah, the blue, the blue, yeah, clouds will be the site on the grid and the blocks are the single worker node inside the site. To submit to the resources of the WCG, as we submit the same, we use the same workflow of the experiments, we submit through the single experiments workload management system. There's Panda for Atlas, Crab for CMS and Dirac for HHCB. These workload management systems are the ones that take care of scheduling the jobs and submitting them and lending them on the actual worker node. To do so, we use a software developing collaboration with CERN that is GANGA that gives a common interface for the different clients to use for the experiments specific workload management system. So yeah, we use GANGA to submit the jobs and AmerCloud monitors these jobs and it takes care of submitting and to create on the site the desired load that the user requested and gets the, and from the workload management system, we get the metrics of the job results and we store it in our database and then we create summaries and test reports to show to the user. In particular, here you can see how the user can configure a template, can choose if the category, if functional or success, the difference is that if it's functional, the user, and if it's active, AmerCloud will take care of submitting new tests continuously, so the user once they create the template doesn't have to do anything else and they run until the template is active. If it's a stress test, the user configure the class of the template and then we'll, with another, I'll show you later, it will request new tests. And yeah, the user also choose what kind of job they want to submit, as we said, so that they know which one, what kind of job, what kind are submitting. And then select the sites to target with this template and how much jobs they want to, how much load they want to generate on this site. As I know, AmerCloud is aware of the topology of the grid seen as the experiments, as we pulled the information from the topology grid information services of the experiments. After this, so for stress test, the user can just select which template to use and add start time and end time and AmerCloud schedule and run the new test. As I said, we have pages with report for the test to show how much load is on the, on the resources, the evolution for the, for the success rate and also, let's say, more important the plots over the, for the performances. So here, for example, there are how many events per seconds have been processed, CPU percentage. And this is what comes, we use as metrics for, this is others jobs. We give information not only at test, at test level, but also a more generic view. For example, here there is on top the results for one specific site of a specific set of templates that are important for, for the experts of, for this site. So the results over, over the time of this particular template and a view of the old, the old grid. Using colors to show, easily show the sites that are more performant and the ones that are less performant so that the, for example, cloud admin or site admin can easily spot the issues. Yeah, so AmerCloud has been used extensively by the, by the experiments over the years and has become, has become an essential tool for tests and for automating and, and it's, it's been very useful for, especially for the experiments, computing operation. And it has, it's, it helps to reduce the effort for the operations and is used for every day, let's say every day operation to generate report for site performance monitoring or to, as I, as I show for, to provide views for constant, constant site efficiency monitoring. But it's also used for, for taking automatic actions. For example, in, in Atras, we, AmerCloud use a set of very, very specific draft, and defined template to, so use the results of these templates to blacklist or recover the sites. So this is the activity of AmerCloud blacklisting the, the Atras sites over, over the various year. It's, I think it's monthly, yes, it's monthly. So this is all the decision that AmerCloud took based on the, on the test results. And this is very important because before this was the job of operators that had to monitor the jobs on, over, over the, the, the old WLCG and check which, which one were failing, the bug, the reason why they were failing. If it was, if it was, for example, misconfiguration from, from the, from the user and then contact, blacklist the site, contact, contact the, the site, the site admin. And this is all done automatically now with AmerCloud and AmerCloud results. But it's also used for commissioning your system and technology. For example, it's used for testing the changes in the, in the workload management system when introducing new components or changing, changing the components so that the, that the developers can, can request for tests, configuring the, the job so that they, they go and test, for example, the, the version of the, of the workload management system and the components. It's been used also for commissioning the, for example, during the commission of the beginner data centers when CEN was extending the, the resources of the CEN data center to compare the performances of the, the Merin data center, Meran data center and the beginner data center by running tests that, for example, were configured to read from data from one, in one data center and then data from another data center and compare the, the performances. And it's also been used to run skill tests on the, the workload management, the new CMS workload management system, CREP3 in 2004 and 2014 when getting ready for, run two for the new run of the CMS upgraded their workload management system to CREP3 and it used AmerCloud to generate load on the, on the system. So here we weren't targeting sites but we're targeting the, the whole system basically. And you see in the picture the, these are the simultaneous running job. They managed to reach the, the goal that they, they were targeting of 20,000 jobs. The, the, it's both red but if you can see the dark red is the, are the jobs submitted by AmerCloud and the, or the red, the red, bright red is the one, are the one submitted by, by the user. And this was, it was very useful for them because they could generate the, the, the desired load on the system without having to, for example, ask for campaign from the user, ask them to, to use the other, the other, the new system and to, to scale testing. They could decide when the, the time period went to do it and how to increase the load. So they had all these, they had AmerCloud to do this for them. So I showed you how AmerCloud is important and how it's used for, for the operation of the, and the testing for, in the WSCG. And in the past two years we focused on strengthening the, the infrastructure. We moved to the CERN agile infrastructure. We're a very happy user of the CERN agile infrastructure, which use the CERN private cloud powered by OpenStack. Use PAPETA's configuration management for the, for the OS and for a month to map hosts to OS group and sub-OS group. And we use the, we use environment to separate the production clusters to, to the, from the QA and Dev clusters to use OS group and sub-OS group to, to identify the different kind of machines we have about, let me mean more than 20 virtual machines on, on scientific, scientific Linux 6. And so yeah, we separate the, for example, web server, the submission host and separate the, the, the different configuration we do in the one used for each experiments. Yeah. Also, coming from free with the agile infrastructure of CERN, there's also the tools developed by the CERN IT monitoring for, that, so we don't have to develop our own monitoring for, for our infrastructure. It's, comes from free, comes with the, with the machines deployed on the CERN agile infrastructure. There is host, we have host monitoring for all the machines of the, of the infrastructure. The, a framework to develop custom sensors on the, on the host in Python, way to configure notification and also to report service level monitoring. And all these with a nice Kibana interface. So yeah, we, oh, okay. We, as I said, we try to improve our infrastructure, but of course there's always room for improvement. We're trying to improve our operations, especially the QA process with increasing our, the use of unit tests and contribute. We're, we're thinking about introducing continuous, continuous integration that we're not using now. Thinking about using Jenkins. We're also reviewing the AmerCloud data store and planning to move the monitoring of the test results to the ELK delk stack. But there's not only plans for the infrastructure. There are also planning for extending the functionalities of, of the, of, of AmerCloud. As I said, it was, Till now it was used, it was already used for commissioning new sites to test the infrastructure of, of the new site and, and evaluate the performances. And CERN is now planning to, how to extend the infrastructure. One of the possible roles is that they're taking into the consideration is to, to provision resources from commercial crowd provider. And we like to provide, to provide a tool to test these, these resources without, first, first of all, without exposing the experiments directly to issues with infrastructure that can be, can be discovered and, and, and fixed with the, with these kind of tests. And also to have a tool to compare the performances of, of different providers. To do this, to do this, taking the, taking a look at the infrastructure again. We want to avoid to go through directly the workload management, workload management system. And we're thinking about using condo to submit payload directly on, on the host of the, on the resources of the, of the provider. And this, this job submitted can be, for example, a wrapped experiment, experiment payload that can run but without the workload management system. So we can test the resources really with what then the experiments will want to run on them, but also benchmark tool for, for performance evaluation and comparison. That, that was all. It was faster than I thought. Thank you for your attention. Any question? Yes. So it's open source, so it's open source, but at the moment, let's say it was, since it was focused on submitting jobs for the workload management management system. So jobs that were, for example, analysis jobs for Atlas, for CMS. So it was built around this idea. So that's what, and especially because it was using only Gunga submit backend, let's say. So it will create jobs that are job configuration and it's for Gunga. But since we're now trying, we're planning to move with the, with another, with Condor, we're planning on having this totally a lot more generic. So a way to have, so from, from the template configuration, generate something that is unexecutable and that can be shipped to a, to a, to a resource. Yes? Yes. So yes, the, that's, that's, that's it. So we're not, we don't have a continuous integration. We have the, we use the development cluster, QA cluster, but what we do is develop, then push the, the commit build. We have a building system, but then we, it's more manual the way we, we put it to QA and then it goes in, in production. So that's, that's good. One of the wanted improvement would really improve the, the process. Is that enough? Okay. There was, yes. Yes. So we don't really, we don't care about the results of the, of the output, let's say, of the, of the job. So if the, the thing that we're checking is not really that the, that the, the, the, the output is correct because what we want to see is that the, the job could perform, could correctly perform. So we're checking that all the components in the stack were correctly working so that the job could access the, the data, could run the, the software framework for the experiment, could write the data, could edit the, and then the data could be, the data, the output data are put in the catalog. So all this more than, it's pretty much trusted the fact that if you submit a job that asks to do this, then the output should be always the same, but we don't check that. Yeah, of course. I mean, you're submitting, you're actually using the, the CPU, so you're, you're paying what, what you're doing. But the, the phase when we, we use it is when the, the integration team or this administration want to be sure that the system, the, the, the resource and the architecture work. So it's maybe, so this wouldn't be used when you can actually run real job. So before you want to be sure that you want to put these resources inside, inside the, the grid. Before putting real, for example, before having the real user, physicists submit to those resources and then having job failing because there was issue that we could have spotted before with this. So of course it's money, but it's also, this is important to have your user happy. Yes, I will check to be sure which one, and then I can, we can see later. Yes, usually, yes. Some what? Why, in production, while this research is in production. Not really, I mean, we use, no, we haven't been, for the moment we wasn't used like this. For that is, for the continuous checks, we use, we don't really stress this. This is, we just use small simple, simple jobs that check the full, the full stack, the full workflow. And, but don't take a lot of CPU and don't stress the, the system because it's, it's production doesn't stress the test. At the site, because it's in production, we just want to make sure that if things are working, keeps working, if things are not working, we spot the problem. It gets, like Lisa notified the site and the site can act on it. And then if it goes back on in, like, let's say one, like Lisa, they go in test. So the site can perform their intervention. And if America keeps sending tests, sending jobs, and if then they, the jobs are successful, then they come back online. It was just for performance of the, of the old workflow management system with Crabb, like I showed you before with the scale test. This, like, so here, it was mainly used just to, to, to, to, to, to, to, to, to, to, to, to these outside this. So, here it was mainly used just to, so in this party was used as a med想 or stress test tool. So to have a lot, to generate a lot of load. And then the, from the Crabb 3 site, they were, they were measuring the performance on, with their own tools and it was basically, America was I was used to stress test it. In general, when making new changes, what the user does is to request a new, the developer do is, for example, when making changes to a component of the work management system of Panda, for example, the data mover or these kind of components, they request a test specific to test that part, but not performance. A matter of time, if you want to keep talking about this, we can do it. OK. Thank you.