 It's me, Massef Kosos-Kumata from G.R.Net, my colleague Themis and Kosos-Kagelidis. We have our colleagues from SRC with Emir and Daniel. We have also our colleague from France, Cyril. As I said before, the idea is that we're going to do a short presentation, well, sort of short presentation on what is the, how ARG operates and how it works and then demonstrate this in a live demo. So with this, I think it's wise to start. So Themis, you want to take the floor please and start the presentation. But should we say that for questions, you can add some questions to Slido? Yes, we do have a Slido session open and there's quite a few questions there. Also, if you have any questions, it's easier for us to go to use Slido for that. And there will be time since we can allocate questions either in some stages or at the end of the session in general. Yes, there are some demo sessions when we're going to see the different components of the ARGO monitoring service. And during that time, we can have a small discussion and ideas of how you see and you want to use this service and how we use it to compute what we do. So, I'm Themis Zamani from Jernet and I will present the ARGO monitoring service with my colleagues who are going to do the live demo of some of our components. The ARGO monitoring. The ARGO monitoring is based on user experience and what we want to do is to compute the status and the availability and the ability of the services. Users and researchers all around the world have access to a number of services. Usually, most of the services are up, but sometimes, although everything looks okay, the user starts complaining and we can understand what was the problem. Usually, it's some of the internal functions of the service saying that something is not responding, a 500 issues and problems that all of the service owners have faced and what that means, the service remains unavailable longer than it's expected. What ARGO is trying to do is to emulate the user behavior and try to find out all these problems and monitor the services to provide real-time status reports, availability and reliability reports and real-time alerts. Based on this idea, let's start the training by using a simple example, what we usually do and how people are reaching us. So, hi, ARGO team. I'm new to ESCUB and I have a new, we received, let's say that we received this email. I'm new to ESCUB and I have a new service that I want to monitor. What should I do? What is the process I should follow? The service I want you to monitor is a wiki we have. It is in a high availability mode. When we receive a message like an email like that, the response we usually get, we send is that thank you for contacting us. Before we start, it would be great if we have the URLs of the service you're referring to and we will start monitoring your service. Apart from some main checks that the service is healthy, we would also like to have a description of the main actions the users follow so as to help you create some checks from the user perspective. And we are requesting the owner to open a ticket at a help desk. Then the service owner replies with an email that says that the service using the endpoint wiki1.ushab.eu as first instance. And the second instance is that wiki2.ushab.eu. Some of the main functionalities the user can do is to login and create a page. Based on this email and this information, we start to monitor the service. The wiki service consists as the service owner told us of wiki1.ushab.eu and another endpoint wiki2.ushab.eu. So before we start, we must all have the sum of the main wardings we use in the monitoring service. The first one is the service, the name of the specific service being monitored. Here is the wiki service, the host name, which is the address of the host being monitored, wiki1.ushab.eu, the service type, the type of the service, which is for here we're going to call it wiki. And every each service type can have a defined set of metrics which are tests that we check in order to compute the status of a service endpoint. And finally, the service endpoint, which is the combination of the service type and the host name for this example, for wiki1. Service type is the wiki, a service type of wiki listening on port 443 on the host wiki1.ushab.eu is the service endpoint. So let's start monitoring the wiki1.ushab.eu. From the monitoring end in perspective, we discussed it with the service owner and we saw that he requested actually four metrics. Two of them were regular checks, which is the service identity because the wiki1 runs out the HTTPS. And the HTTP check that the URL is responding from the user perspective, we added two more metrics, the login function and the creative page function as it was described in the email and as it was developed in the probe. So these metrics run for several times during the day. And in one of these times, it's run for the set validity we took okay. For the login we took critical as a status result, the HTTP check was okay and the creative page functionality was also okay. How all this depicted in the monitoring item wiki1? In order to say that the wiki1.ushab.eu is working properly, all of these metrics should have okay status. So based on the metrics, the results of the metrics we saw earlier, the login had a critical state for three hours, for 12 to three. So this critical is depicted to status, is depicted to the monitoring item and we can say that the wiki1 critical mode for three consistent hours. How the availability now is computed based on this information. The service availability is based on two numbers. The first one is the amount of time the service was up. For us, during the day, one day it was 21 hours. And the second number is the given period we want to check. It was for one day 24 hours. So the availability for this monitoring item is 87.5%. Let's see the reliability, which is almost the same here, which is the same here because no schedule time, downtime was defined. The reliability is based on the period the service was up and the period that it was supposed to be up during the day. That means that if a downtime was defined for 12 to three, the defined hours that the service will be up was 24 hours, one, three, because a downtime was declared. So the reliability of the service would be 100%. The other service didn't have any, the other instance of the service, the monitoring item Wiki2 didn't have any problem. It was working with almost all its functionalities okay. Not almost, all its functionalities okay. Let's see now that for this monitoring item, the availability is 100%. We can see now that the availability is 100% and the reliability is also 100%. Let's see how now, how the service, how we compute the availability and the reliability and the status for the service as a whole. The service as was described in the email consists of two endpoints, the Wiki1 and the Wiki2. It is in high availability mode. As we saw in the previous slides, the Wiki1 had a problem, one of its functionalities had a problem during the day. So this is depicted in the instance and the Wiki1 was critical for three hours. So for this day, for this time, the Wiki was critical. In order to say that the Wiki set to find out the status of the Wiki service, we have, we said that it's in high availability mode. So either one of the two of the instances working, we can say that the Wiki service is working and the status is okay for the whole day. So now that the Wiki2 is working properly for the whole day, we can say that the Wiki service is working properly for the whole day. That's how we compute the values for the service with multiple endpoints. So how is this computed? When you say that we want to monitor something, you're thinking about an Agios instance or another system like it. But in Ergo monitoring, the monitoring is based on multiple components because all this information is structured and used by different components that interoperate in between. So the first information we use to start monitoring is the topology. We're going to say some things about the topology of the monitoring infrastructure. The topology is information mainly about the monitoring services, the service types they're running, the service endpoints of the services, the way they're organized in groups of sites, in groups of services. And we can model different types of infrastructure architectures. And finally, we need actually the information about the service actors, who are the owners of the service and who are the administrators of the service. Suppose that you're on a site, suppose that you're on the site one that offers two services, the service one and the service two. Service one is a compute service. So we can say that the service type is compute and the endpoint is service one.eu. Service two is an analytics service and the service type of it is analytics and the endpoint of the service is service two.com. This you as a site owner, you decided that this site would become a member of a bigger group by complying to a number of requirements. And now we have a new group that has multiple sites. At the same time, a new project was created and decided to gather all these groups of sites to a higher level of hierarchy to create one more level. So now the topology of this infrastructure is a project that has a group of sites and each site has service types with one or more service end points. So this is actually an example of a topology of an infrastructure. For topology tools, we started, we support a number of different softwares, like GOGDB, which is the grid configuration DB, database from EGI, the DPMT, which is the data project management tool and simple XML files with predefined format. And now we can move to, for this training, we have deployed the GOGDB training instance to show you how the information is stored and how is the topology and the hierarchy of the topology. So, Amir, I'm sorry. Hi, can you hear me? Yep. Excellent. So let's just give you a second. Okay, so Demi said we are going to show you just briefly how the GOGDB works as a topology database for this whole demo thingy. So I'll just briefly demonstrate. I'm not gonna go through the whole GOGDB and in the first part, I'm going to demonstrate another core service, which will be in AI. So in order to access this instance, I'm going to use my personal national AI account. Just give me a second over here. Hopefully I do not print my password so everybody can see. And there's a bunch of things about transferring your attributes, but you can see I use my national ID, national AI to access this instance. Same will be for the second system that we'll show. So GOGDB itself enables you to define projects and GI sites and so forth and so forth. For purpose of this exercise, we prepared already several sites, which Demi's described. So let's say we picked real size entropy three in France. So GOGDB allows you to define all kinds of attributes which are necessary in order to manage the infrastructure. And in this particular exercise, I'm just going to add one service. Okay, so I just go find and click add service. Now, as Temi said, there are two very important things here for us. One is the service type. As you will see later on, service type tells us how do we monitor this particular service. So in this case, let's say we will add just one web portal. A second thing which defines for us a service endpoint is the hostname. So I need to be three, three, all right. I guess this works, let's hope. So these two things are very important for us. Another thing that you can also use is a service URL where you can put additional information like exact address or on two, three, or if you're using some weird port and stuff like that. So you can, this one we also use. And then another thing which is important here is you have to say, yes, I want to be monitored. Here you can define an address where we can send you an application more about that later on. And these other fields I'll just skip for now. This is mandatory. And one more is about the notifications. So I want to receive notifications and I click that service. And success, it's always good. You have your service. So here you can see you added a new service endpoint and it is of a service type by portal. We didn't provide any specific URL, but we expect to see this guy further down the line. That would be it for the GOGDB part. I'll hand over back to Tennis. Do you want, does anyone want to ask anything about the topology before we move from GOGDB? There's a question from Manuel if the monitoring information is publicly available for the public in general or only to the service provider? Yes, but now we are in the GOGDB that's why I'm asking, we're moving from it. We'll see this when we're in the web UI part. Okay. Do you want to ask anything about the GOGDB? Because I'm, okay. One moment to share my screen again. So now that we have the topology of the infrastructure, we need another source of truth, which is the poem. The metrics and the profiles. Poem started as the main component we used from the very beginning of Argo and holds the information of service types, metrics, metric configuration and probes. You could create as many profiles as you want and these profiles instruct the monitoring instances, what kind of tests you want to execute for a given service. The last two years, the poem has evolved and it's going to be the main entry of Argo and especially for our one-stop shop. From there, the infrastructure owner, we let all the main information about the metrics, the probes and the profiles you want to use. So it supports the metrics, the repos, the metric profiles, the aggregation profiles, operation profiles and report profiles that are work in progress. In order for poem to start organizing this information, it needs the service type information from the topology tool that you are going to use and for this training session, we're using the GOGDB. So actually, the poem has a plug-in to connect with the topology tool. In other, if you want to better understand how poem works, there is a site with documentation with detailed information about all the possible actions you can do with poem now that has changed. Let's start with the first information that we need in order to organize the metrics, the probes and our profiles. This is the repos, the packages and the probes. You all know what is a YAM repository which is our warehouses of software. We actually need them in order to install the probes to our monitoring engines. The packets which enables the quit and easy software installation. It is a collection of items, scripts, libraries, files and manifests, et cetera. And it's how we installed them from the piece of software in our monitoring engine. And finally, the probes, which are used to check the service and every probe has a list of metrics that we need. One, two, stop this noise. So, this is a view from a poem. We're going to show you the demo. And so, poem has a list of probes that are re-installed as a library that are used for the services already checked from other infrastructures. You can use them and you can search in our library and find the probes that you want. If you cannot find the probe, there is a documentation with clear guidelines about how to create and develop your own probe with the metrics you want. We are there to guide you and support you. And there is a process that we usually follow to create and develop a probe. First of all, we start discussing what you want to check as a service owner, discussion of the representatives and developers of each service in order to agree on the set of monitoring metrics you want to do for your service. The second thing that the developers of the service start developing the checks and the development life cycle includes the coding of the probe, documentation, testing and packaging. We have a clear guidelines for all the steps and what you can do to easily create your own probe. As soon as the probe is ready, then we can start monitoring. We have a testing instance where we test your probe for a period of time, let's say two weeks to one month, and to check that everything is monitored properly. The life cycle of these is based on the following repetitive steps. First, guidelines from the service owners are created for you of how to use the probe. And the monitoring team makes the necessary operation. We tested in the test instance and we verify that everything is working properly. And then the probe is moved from the test instance to the production one and is used in our reports to start monitoring the services. Let's see what is a metric. You create your probes and each probe has a metric. It has a list of metrics for your service. A metric is a simple check of code that checks specific functionality of a given service. For example, the OrgNegas Exchange Portal Web Check is a metric that checks that the site is responding. The HTTP of the site is responding correctly. The HRSRC is CertLifetime. It's another metric that checks the validity of a certificate and says how many days are left till your certificate expires. Till your certificate expires. For a specific service, we can list as many metrics as we want so as to be sure that the functionality of this service is checked by the monitoring engine. For example, for our weak example, we can use both of them. We said that we're going to use both of them, the CertLifetime and the CertLifetime validity. And the HTTP check, so we're going to use the HRSRC, the CertLifetime and the Portal Web Check. Once we decide, we can see here a print from a poem on how a metric is defined. And here is some facts that now in our library we have more than 110 probes and more than 350 metrics from 16 different repos. So most of the well-known metrics and probes are included in our library. Now that we know exactly what more metrics we want to use for each service, we can start creating the profiles. Services and associated metrics are grouped into profiles. Then these profiles instruct the monitoring instances what kind of test to execute for all of the services and for each service. So let's say that we want to create an obscritical metric profile that has two services, the WIKI, the AOSK WIKI that we set in the example, and the Argo Web UI. For the AOSK WIKI, we have two metrics, the CertLifetime and the Portal Web Check. And for Argo Web UI, we have the Argo Web AR to see that it has results for the availability and the reliability and the metric Argo Web status to see that we have results for status. This is a profile, actually. Two services with two metrics for each service. This is a view from a poem. And now that we have the metrics and the metric profiles, we can continue with the aggregation profiles. We said that we have a topology in our infrastructure, and we have to create a profile of how these monitor items in our hierarchy are going to be grouped. So we have a project with tools and infrastructure, the tools are the WIKI and the Web UI, and the infrastructure contains compute and archive services. We must say in the aggregation profiles, which are the operation in between of all these different levels of the topology. For example, for the Argo Web UI, we have three endpoints. And we say that if one of these endpoints is working properly, then we can say that the Argo Web UI is working properly. And for the tools, we need both WIKI and Web UI to work in order to say that the tools are functioning properly. This is the aggregation profile, where you define how monitored items are grouped and form hierarchies. And finally, we have the operations profile, how two different statuses are combined. This is in principle, this is the time that the profile that we define, how adding and ordering operations are performed between status values. In general, we have a default operation profiles that says that when you find OK and critical, the computed result is critical. When you have OK and warning, the computed result is warning. And when you have OK and OK, then the computed result is OK. And now that we have all these defined in our poem, let's have a demo and see how the problem is working. Amir? So let me start sharing again. Let me mute this one. Can you see? OK, I'll see if you can see my screen now. OK, so this is the poem. Again, as in case in Gokti B, I'm going to use the AI proxy to check in in order to use my national identity. Again, remember, just continue. And I reached the interface of poem. So the term is already covered, what it does. So I'll just show you the live demo, and then we'll post you the same link so you can browse around by yourself. We start from probes. So let's say in my first demo, we showed you how the world that you want to monitor looks like. So it's split in three sides. Every side has several service endpoints, which are defined as a service type and a host name. Now we know what do we want to monitor. Now this is a slightly more difficult part where we define how do we actually get to monitor these things. So we will start from the probes. As you've heard, we already have a bunch of probes. Now you've also heard that we mentioned Nagios. And Temis mentioned the guidelines that we defined. So basically these are Nagios probes that we use. So they follow the Nagios API. So as long as you have something that can generate Nagios friendly statuses, which is basically four different exastates out of any piece of code, we can plug you in. Now this view shows you all these 100-something probes that we have. So in this case, we will use this very simple check HTTP. So this is one of the standard thingies which are available in Apple repository. It's widely used in a check HTTP. So I'll just press here so you can see what kind of stuff you can get out of here. So you can get the version that we configure for you that you can use. You get the link to where the repositories of this particular probe, you get the link to the documentation so you can go and see what kind of different metrics this particular probe can work for you. And then finally, this is an important part. You get to see in which metrics, we call the metric templates here, because you use the metrics when you define your profiles and when you define how you want to monitor your services. But we also have a library of all metrics where they are called metric templates. And here you can see that in all of these different metrics, this particular probe is used. Now the probe is basically the difference is that the parameters are slightly different. So there's a set of things that you can tune when you're using the probe. And that's what makes every particular metric here. Now, so when you start using it, you will just come here and pick all the probes that you need. And if there's something missing, then you'll just go and develop it yourself with our assistance, of course. And then we will add it to this repository. Going step forward, so as I said, there's the repository library of all the metrics that you can use as a tenant. There's a higher number that is mentioned. My link is a bit slow, it seems. Otherwise, this just runs. So right now, the monitoring engine supports center six and center seven packages. So let's pick center seven. And so for example, we mentioned certificate lifetime, certificate validity test. And here you basically select metrics that you want to use. So let's try to add this one. So here you select any metric from the library that you want to use as a tenant and just click import. And it will say something, something. Probably we already imported it. And we move forward to the list of metrics. So here you see the metrics that you can use in the profiles that you can basically use to monitor your services. So right now, we added the cert validity. So cert validity pro basically goes and checks if your certificate on your HTTPS or any kind of TLS speaking service is still valid. We can see that it's using this particular package. There's all kind of different information. But all this comes, all this comes prepared for you. So you don't have to do anything here. There is one thing that you can tune. And this is in config part. So one thing is in the interval. Interval tells Argo how often this service should be checked. So by default, it's 240 minutes. We do not have that much time for this demo. So we will just change this to 1. So every minute it will go and ask how long, how much time do we have for this particular certificate? Sorry, I got lost there for a bit. Now, max check attempts is these are internals for Nagia. So when Nagia spots an error, it can change the frequency. So in this particular case, we will just say maximal check attempts is 1. So the frequency is always going to be 1 minute. Finally, there's a timeout. So how much time do you want to give the service to reply? And you don't want to put 60 because the 60 is the same as the interval, so you just put 30. This guy is in second, and retry and intervals are in minutes. Finally, I just click Save. Yes, this. So I tuned the metric a bit, and here I can see all the other ones. I'm not going to go through all of them, where I'm going to move to profiles. So as Tim said, the profiles allow you to define how you want to monitor different service types. You can have as many profiles as you want here. Of course, we just have one for this demo. We don't have that much time. So in this profile, the only thing, so you see this thing is much simpler than all the other things, because you already defined metrics, probes, broadsets, and so forth. So here, you just say, I want to use these metrics against these service types. So here, for example, I can add, I want to check my web portal service type. Again, you have auto-complete here. You don't have to remember how the service types look like. And the same thing goes here. So we're going to use search validity. So I'm basically saying for all the endpoints that are of a type web portal, also check if their search is good. So save. Yes. Sorry. And that's it for the, thank you. So that's it for the metric template. So this brings me to my last point, and that's the aggregation profile. So this is the complicated hierarchy that Temm is showed. And this is what really, this is kind of a rich part of our system. It allows you to aggregate statuses of things in a different manner. So this guy, again, gets a little bit more complicated only because it does all these cool things. So first things is that, as you've seen, individual service types will have multiple metrics on them. So when defining what is the status of individual service, that is of a type service type, you can say, OK, you can do end or or between individual metrics. So you basically say, all the metrics that I'm doing against the web portal have to fail in order for me to say that this is a failure. Or at least one of them. Then this was described in an example. So in this case, we are doing a check if the endpoint speaks HTTP, and we check if it has a valid certificate. So we say both has to be fulfilled in order for this particular service type to be valid. So I put an end here. Then I define different groups. This was also shown in an example. So here I say that individual sites will provide for me two groups of, let's say, capabilities. So they provide a portal, but they also provide IGL, which is a specific type of service. And now in these groups, we can define multiple service types that belong to a group. So for all of you who come from the EGI world, here you would see something like, if this was a compute capability, here you would see CREAL, CREAL, RXC, HTC, Condor C. And then here you would see storage capability. This would be VEPTAV, SRM, Grid FTP, and so forth. Here we only have a bunch. And then again, here you say, OK, what's the logical operation between individual endpoints of a given service type? So this was the demo's example. If you have two endpoints of a type web portal, should I do an end or or? So if it's a high availability setup, you do an or because you just care for one of them to work. Or if they provide different things, then you say end here as well. And then again, when you calculate the whole availability of your whole site, you can choose whether you want to do end or or between individual groups. And that's pretty much it. Any questions? Sorry, this was very fast. We will give you a link and you can browse around by yourself and there's also documentation. Yes. Costas added the links to the chat. So any questions? You should all be able to log in with your AI check-in. Check-in AI. And you will have read-only access. So any questions? Can I proceed with the next component? OK, since there are no questions. OK, let's continue. Can you see my screen? I hope so. OK, now we have the topology of infrastructure. We have what we want to check, the MATIC profile, how monitored items are grouped, aggregation profile, and how the operations are performed. We can create our first report and start monitoring the infrastructure. The next component that appears in the monitoring service is the monitoring engine. We have two engines, usually, for all our projects, one in Greece and one in Croatia, to support high availability. In our example, we have to check the WIKI-1 and the WIKI-2 so the monitoring engines start making checks, start running the checks for these two sites, services. The checks are running for Greece, from the one monitoring engine, from Greece, and also from Croatia in different intervals. One moment. How are these monitoring engines configured? It's not a manual work, as I have already told you. I explain all the components, from the other components. So there's an auto-configuration that runs every half an hour, usually, or this is something that we can configure. For the training, we did it for every two minutes. The auto-configuration runs every half an hour and gets the information that wants the metrics, the topology, which services to check from the topology tool, and the metrics and profiles what to check for its service from poem. Every time a change happens, this is automatically depicted in both monitoring engines. So we don't have to do any manual work. We auto-configure the monitoring engines in both of our monitoring engines. This is where we create our... Can you please unplug and unplug your mic, because your voice is starting to get distorted. Hear me? Much better. Thank you. It's Mac and my microphone. So now that the monitoring engine configured, it starts creating the metric results. So we need these metric results to start computing our status results and availability and to start the computations, let's say. The monitoring engines will set the metrics results to the compute engine to start the computation. We are using the other component, the algorithm messaging service, to do that. We have a plugin, which is another work component, the IMS publisher, which is a component acting as a bridge from the monitoring instance to the messaging service. It is a part of software... The part of software start running on every monitoring instance, and it is responsible for forming in predefined schemas and dispatching messages that wrap up the results of the test. All of this data, the metric data, are saved in AVRA schema. We can see an example here. This is the schema we use for our metric data. This is what the monitoring engine sends to the compute engine via the messaging service. I will try to show you... Let's see some real data. I have opened a postman. They did the fact that we're using the web API that they are going to show you later on as the basic component, the stores and all the information, all our data. I can show you the... The latest data we get from the monitoring engine. I'm setting the request, and I'm getting this data. You can see that the metric data has information about the endpoint group, the CNRS, and we'll show you the service, which is the web portal, the endpoint of the service, the metric, the check that ran, the exact timestamp, the status, the result of the metric, the check, and the small summary. These are the list, a list of metric data we get from the monitoring engine. So we have the data in a predefined format, and we can easily use it from our other component. In order to start computations, we have created a small health check for all our components. We have set a number of checks to be sure that everything is working properly. Does data flow through AMS? Have we switched on the computation? Is data correctly deployed in HDFS? These are some of the questions we have. I want to go into deeper detail, but I put them in the presentation for everyone who wants to see that after the presentation on its own. Now that we have the topology, the metrics, and the data, we can start the computation. The ERGO compute engine is the main component of the ERGO monitoring and is responsible for computing status, availability, and reliability of all the services using the metric results from the monitoring engines, the information about the topology of the infrastructure from the topology tool, the information about schedule time times of the topology tool, and information. We can support information about the importance of each entity in the infrastructure. We call them weights. Using the metric data collected to the compute engine, the compute engine is responsible for flattening out the metric results and for computing the service availability and reliability metrics. Results are stored on the fast, reliable distributed data store, and the computation starts with an HDFS behind. Computations can be in batch and streaming forms. We have batch and streaming jobs running all the time, and for status and availability and reliability reports. And computation platform gives the ability to easily scale from small and simple tenants to very large and complex ones. This is how we do... how our compute engine is configured. As I've already told you, we have Argo Web API as our main short API for connecting to the other services and storing all the data. And from the Argo Web API, we get information about the reports and the profiles, metric, operation, aggregation and thresholds. We store them in the compute engine and we start executing the jobs. We execute both batch and streaming jobs because some jobs need real-time results. These jobs are running in Flink nodes and we get also the data from the HDFS nodes. The results of our jobs, both batch and streaming jobs are stored in our data store, which is MongoDB. This is also what exactly needs to be configured in the compute engine, but it's hidden to the user of our service. I just wanted to show to you to see how the type of information we use in the compute engine. After the compute engine, it's the Web API and it's the core API of the monitoring engine. It is used to connect the different components between and exchange the information. It's pertinent. It can manage all the profiles we show in the poem and manage the reports. It is used by the compute engine, the monitoring engine and poem and you can browse status and availability and reliability reports. I have some results here, AR results examples from the API and it's depicted in the UI in this way and some status results examples from the Web API here. It's the DR and DR endpoint we declared in this training and the status is depicted in this way in the Web UI. Web API supports different types of users. We can create as many different types of users we want. Some of our users want to have access to their own data. We give them view access to display their own data wherever they want. I have an example for this. For the training data we created the view user viewer here he is and you can see for now that for the ARGO service for example the availability and reliability is 100% for all the services it's again 100% it's for the web portal because we have declared in topology that JRNet has a web portal and for the site we can get information for the site which is also 100% for JRNet site. So I created a user that has view permissions on site JRNet and its services you can get the information about the availability and reliability of the site and reliability of all the services and of course the status of the services you can get the information and display it in your service to say that my availability for this period is 100% or the status of my service now is ok display information about the monitoring of your service this is the information from the web API and now we can go to the web UI where Cyril will display will show you the different capabilities of the web UI. The web UI is currently used by the EGI and the European Open Service Cloud where some of the services most of the services are available. Cyril? Ok. Is it ok? You can see the screen. Yes Cyril. Ok. So what we have done today is a demonstration for this session and we have connected the UI to the training tenant. So here is the landing page and you can get information first you can receive here information related to topology so you can find what has been shown by email just previously the three NGIs which have been declared the three sites and the two type of services and for all these nine points which are monitored you retrieve different information like the availability or availability on the last 30 days here we have just a subset of information because the tenant is just new you have the global availability or availability over these last 30 days and here you can find the last 500 checks with a summary by status this example we have for critical 7 ok extra and this information is also available in a table with a level of details which is more accurate you can retrieve the output from the check and there is different type of filter you can check the last you can have a better idea of the last check here on the last table you can see eventually done times which are declared in the topology database so in this case we have no done times and this is the summary for the global topology database but on the left you can also have the same dashboard for a sub level of details for the sites in this case for example go to GL net and you have the same view only for GL net so now I come back to previous page so this is the basic page when you just arrive on the interface then on the menu you have different possibilities so the dashboard is the landing page then you can go to the availability or availability page this page is the summary of the values of the abilities over the last 5 months with the summary of by NGI and please inform the red and green ones that we have values defined by the tenant if it's under 80% it's ready so depending from the threshold declared in the infrastructure you will see red or green colors depending from the level of ability and reliability for the last 4 months we have no data because it's a new tenant as I was saying previously you can go through charts the same information but different way to display it and on each table you have the possibility to copy the data to export it and to have it also in PDF then you can go to a different level of details so you click on one name of NGI and it will go to the site so it's exactly the same information the same type of information but at the cycle level you can access to the chart also and finally you can go to the different endpoints services sorry and finally to the endpoints what is new in these last 2 pages is that you have also a connection to the status page so if I click here I will go directly to the corresponding statuses in the status page I think that's all for now for this part so now we will go to the status page it's I will say the same behavior so we have the displaying statuses for the NGI level in this case different colors corresponding to the different statuses different possible statuses for example for critical we have red color and for okay we have green one and in this specific case we have the missing metric because it's just new then you can click on the bar to go to the next level of details that's the same thing further you reach the endpoint and here you are rising the level of metric and if you go to the last level you will have the detail of this metric with the output and the reason in case of failure for example the summary of the failure so if I go back to the main page I think I've explained everything and you have also the possibility to slide to one day in the past with the little icon on the left if I can interrupt here as you can see now in the middle row it has NB3 CNRS with a bit of missing data for the previous hours but at the far end you will see that the color has changed this is the new metric this is the new endpoint that emir added this is the new metric that emir changed and added to all the sides so you can see it changed from missing to red and then green when it finally came through so there is another possibility from the menu is the custom report basically is custom access to the different information so instead of going to the main level of granularity which is NGI in your case you can directly select a site or a group depending from your topology and then you can select what you want to see for example availability or reliability with daily values availability or reliability with monthly values or statuities and then you can select a given period with predefined period or you can add a custom range if you want so I will select daily values for example for site NB3 CNRS ok I will try another one because there is no data for this one because it's quite new so the last 7 days I think it will be better ok that's better now so basically that's the same information as shown previously this is the only thing which is different is that we have a value per day and you can see also additional information like don't time and unknown statuities percentage in that case you can still copy and export information you have still the charts and you can go to you can drill down to details and get information next granularity if I do the same for custom report with status you will have the same information and still the same process you can go click and go to the next detail of information then on the left you have also a menu about profile details you can find different information like topology that's the same that you have in the main page a landing page but presented a little bit differently description of the tenant and you can retrieve the matrix profile so the different matrix associated to the service which are register and the way the aggregation profile the way we are computing the different matrix with and or different operation we are applying on the matrix so that's all for all the results you can find into the UI then you have different documentation the first one is documentation of the UI you can retrieve information related to the different pages presented there is also external link on the Algo documentation which is more global documentation and then there is the term of use of the different components of Algo that's all for me perfect and I will share my screen for the last time our last component is the notification service if you think we are over or not so is there a problem with your service? always an alert should be sent we analyze the monitoring results and send alerts based on a set of rules an example of an e-mail alert you can see here with site Budapest which was critical you can see that the endpoint affected and you can see the status of all different points of Budapest these are real-time status events that are the basis of the alerts they are generated in the computer engine based on a set of rules that we have defined you can register opt-in to the alerts by clicking on the topology tool to notify flag per site or per service what the latest change is to consolidate the alerts and send less e-mails with richer information so what we actually did instead of sending just an e-mail that your site is critical we are trying to inform the owner of the site which are the problems and why your site is critical but the site with Budapest became critical then and the endpoint affected is the SRM endpoint and the metric that created the problem was the SRM put there is also more information about the problem with the summary, critical file was not copied to SRM and the whole message that returned from the check here you can see that we have the summary and the whole message with exact problem to the SRM endpoint and to the SRM put metric alert status summary we notified that the site is affected due to a service endpoint and a metric like I showed you in the previous image what's the status of the rest of the site service and points, can we provide a summary of course that's the extra information that we want to provide that you are the owner of the site with Budapest and you have all these endpoints now you have a problem at SRM endpoint grid 113 and all the other endpoints decide BDII, the CRIMSI endpoints are healthy and you can see in an e-mail the summary of all the endpoints you have in your site an example of metric summary is that the status of metrics you have a site that the endpoint that has multiple metrics let's say for example this endpoint that has all these different SRM metrics you can see what's going on with all your metrics the endpoint and the metrics that we use so that's what we did with all the alerts, we tried to create a summary and a view with just one e-mail if you are a site owner if you are endpoint owner you get the information that all the endpoints or some of your endpoints are healthy and some are not and if you are endpoint owner you have the problem with some metrics of your endpoint that's all for me and for the monitoring engine we can have a discussion at the end of the site they have links to all of our documentation sites so if you have any questions we can start we have 20 minutes to discuss about whatever you want. Do you think I should go to the Slido? It's only one moment. While Temis is looking for the page for the display results from Slido you can go to the Slido I posted in the beginning and start filling in the bowl. I have a metric for Temis. So we do have a question with regard to where do e-mails go? The e-mails go where the topology tells us to. It depends on the endpoint and it depends on the site contact points I can show it in the back in the GOC TV Yes please. Let's share. I'm going to take over. To answer the question I think best is to show you I really ran quickly through this but you have a bunch of sites so I added the service endpoint here and here you can see Biggie is saying e-mail in this particular case if you said that you want to receive notifications so for example in this in case of the GOC TV you say here I want to receive notifications and notifications will be sent to this address. In addition there is also a possibility in certain cases you can define additional e-mail for individual service endpoints again for individual service endpoints you can say I want to receive notifications or not. This will give you different levels of information if you want to receive notifications at the service level then you get a notification that this specific service has an issue which is affected by metric X and the result of probe that is assigned to this metric. You will get a notification at the site level it will give you the current status of all of the services or all of the endpoints you have on your site if one of them fails. Does that answer your question? He said thanks to Slido. It answers my question. Thank you. While with the theme is now showing a Slido can you please go in and fill in the first question we'd like to understand what is your role or the category you belong to from the major one. Sorry? Is the monitoring information publicly available for the public in general or only for the service provider? It's up for now? It's up to the tenant. The UI is fully customizable and can choose the default we have now for all the tenants we have is that everything is fully public due to our policy to follow open access. But there is no problem to add restriction with AI or something similar? We've done that before but it was chosen as a policy to have all the results open. I don't see anybody voting so I think that's the next question. Which is? If you could please answer the poll. That's what Costa is saying, right? Yeah. We would like to know the category you belong to if you would like to use the service. The theme is we'll go to one question one by one please fill in the first one. We'll give it one minute. And then we'll move to the next one because I think not all of them are live at the same point. No problem. It's my first time using Slider so I don't know how it works. Me either. The link is in the chat, right? Yeah. I can post it again if you want because it's a bit tricky to go there. I think we can move on to the next question. Yeah. Yeah. Yes. I'll pay for that. No, maybe. Why maybe? Who said maybe? I want to see to understand why not? Why? What is the functionality you mean? I don't know. So, how do I raise the hand? I don't know how do you... Sorry. Yeah, it was me maybe. I already filled the poll before the presentation but the demo was very nice. I think it's very good functionality and it provides functionality that I'm representing Radio Astronomy. That we certainly partly lack. So, we have a lot of sysadmin monitoring. Not this service level monitoring. And given the demo, we would certainly share it with others. The maybe is just because, well, it's the first time I really had a good look and we'll have to consider and share it with other people before deciding. We will be more than happy if you want us to share it to give you more information or do a sort of demo if you want. Okay, much appreciated. Yeah, now that we have the training part on... We have set up a training instance, we can easily create a demo for your services. And whoever wants to actually play around with it, you can contact us and we can give you access to the demo instance so that you can see how it works. Moving on to the next question, this seems to be a little bit difficult, but we'd like to understand how will you use Argo and you see that, you know, service provider and sysadmin are the ones that seem to be more interested in that. Now, for me, and what I demonstrate this, it also has value to funding agencies or, let's say, service owner or higher level management people so they can understand exactly how, what is the quality of the services offered and why, if people have complaints and why. And we try to gather all this information and present it with nice reports every month to show exactly that the quality of the services. And let's go to the more, let's say, open question and a more interesting one. Please provide your feedback on Argo and its functionality. What feature do you miss? What do you like to see more? It's monitoring services. It's easy. Sites use. Interesting service and offered. I don't see any feedback, any feature that might be missed because we, we are all of that. Does anybody want to comment lively? I add some more info. Interesting question. I'm asked, so as a service owner is interested in using Argo, what are the first steps? It's, it's an interesting question, but doesn't have a simple answer. They, or it's a little bit tricky. If you're part of one of the big infrastructures, ECI, EUDAT, or EOS have been general or affiliated with that, then it's quite easy. You just register with their topology providers, which is Coke, TP and DPMT, and you will start to be monitored. If you are not, and you have your own, you want to be appeared on your own as a different tenant, then we need to discuss and how we can support you on this. In general, this is an option that is going to be supported in the future, but it's not for free. We need to come to some kind of agreement and depending on the size of your infrastructure and what you need to mount or not understand what needs to be done. So the first answer, the initial answer to this is open and take it to us on the EOS calculus and we'll see how we can assist you. Does that answer your question, Amanda? No problem. Okay, moving on. Do we have any other questions that you would like to be answered? So if we don't have anything else? If we don't have anything else, then I have only one more, let's say, poll for you, which is for us really valuable for you, not so much. I want you to give us your, and I'm thinking back on the presentation in the demo we did. Thanks for the answer that already answered. And thanks to the other provided feedback by comments. You will find the presentation. I will add our contact details and where you can find more information about the Ergo and I will update, upload the presentation to the site, to the EOS hub event site. Thank you. Thank you all. That'll be all for us. If you don't have any more questions, feel free to contact us whenever you want. If you have any questions, if you want to play with the demo, then we have created. Bye-bye. Thank you.