 and the session about the achievement of the XDC project. And also let me thank the organizers of this event for hosting us and hosting this one hour and half, in which we will try to present some of the achievement of the project. The chair of the session will be Jacinto Dombito, that probably you already know and he is also the technical coordinator of the extreme data cloud project. So I would like to leave the floor to Jacinto for a short introduction about the session and presentation of the session and of the speakers. And then I will be back to briefly introduce the extreme data cloud project. So Jacinto, if you want to say something. Yeah. Good afternoon, everybody. This is Jacinto speaking. Again, welcome all to this XDC session. This session will be mainly focused on let you test which services and components we deeply worked on during the lifetime of the project. The project ended at the end of April. And we developed extreme data cloud solutions for different communities. And we will try to show the outcome of this project during those demos you will see. And we will also try to provide a bit more technical details. So now those services works thanks to the presentation you will see. We hopefully are willing to make an interactive session. Also if by remote could be a bit difficult, but please just rise your hand to make question or use the chat to try to ask things. And I will try to pass the questions or give the right to speak to the people are willing to step in and to make questions. So feel free to ask whatever could be useful for your use cases, your environment, your organization. This is trying to understand also from you the feedback that could somehow let us understand how this fits your requirement as user community or service provider. So please try to be interactive. During the session, I will send a few times in the chat window the link for the slide where you can try to interact with us with the question we already prepared to you. And this obviously will help us to get more in contact with your feedback. So just to go very quickly, I will pass the floor to Daniela that is that will present the project as a wall. So Daniela. Yes, thank you Jacinto. Let me share my presentation. She's also uploaded into the agenda. Okay, I think you are seeing my slides, a very brief introduction about the project just to give a little background to the next presentation. I promise I will speak sorry for about 10 minutes, not more. The next DC is a software development and integration project, and the focus is on quality of service and policy driven data management. The approach was to use already existing quality production quality data management services, you can see the logos of those services in this slide. What we did in next DC was to add the missing functionalities to these service toolbox functionality requested by the research communities represented into the project consortium. Next DC is an Indigo data cloud follow up project. So we inherited many technologies that were already developed in that project. And also we use the technologies provided by partners of the project. Our approach is that to release in our internal software repository, but of course we want and we will upload pushback our developments into the upstream repositories. Okay, so this is done in order to guarantee sustainability of the project after the end of the project. And in fact the project, as Jacinto already anticipated, is already ended, it ended last month. Okay, so at the end of April. Okay, five user community have driven the developments of XTC, the CTA, the European XFL, WCG, Life Watch, a CRAN for what concern the medical research. And also we kept an eye on the long day of science trying to provide easy to use web interface web interfaces to some of our developed services. For instance, the orchestrator dashboard and all the web interfaces provided by the one data system. As I said, the focus is on the policy driven data management so orchestration of the data management to orchestrate the data lifecycle into the distributed infrastructure. Pre-processing during the ingestion is another topic that has been addressed by the project and metadata management. We have a lot of requirements concerning the metadata management at least by provided by three of the five user community represented into the consortium, CTA, CRAN and Life Watch. All of them provided the requirements for the metadata management. Then data management based on storage events. We'll say something more about this later on. Then smart caching and sensitive data handling. Okay. I leave these for reference just to say again that the project just just ended. It was a small project, only eight partners from seven countries, three million euros, the total budget that we received from the European Commission under the infra 21 call on the age of 2020 framework. The main achievement of the project is contained into two project releases. The first one, XTC one, was released last year and XTC two was released in March. A lot of technical highlights for both releases are present. You can check on the release notes for both of them. Ten components for each release were released. As I said, we used our internal repositories, but all the developments will be pushed back to the upstream repositories. So if you are already using XTC developed services, you will get our developments as soon as they will reach the upstream repositories. So it's just a matter of waiting some weeks. Okay, we like to stick just on a few topics and technical highlights that are addressed by part of these releases. First of all, the new generation user authentication systems that was add to almost all the XTC components. I have stolen this slide from my colleague, Andrea Čakanti, that presented this at the CHAP conference a couple of years ago. And the pitch, the transition from the older system based on X519 certificates toward a more modern system based on Opened Connect token authentication system. And this was implemented in almost all the components in our architecture, also respecting all the workflow chain for all the components involved into the orchestrator change. And the implementation is based on the Indigo IAM solution. Okay, as I said, the orchestration workflow for the data life cycle is one of the main topics for the project. And we worked with the two-fold approach. Okay, what we call the approach from the top, where the user interact directly with the orchestrator components. So the Indigo orchestrators and the Rusio in our case. So the user can inject data management policies into these components creating the Tosca templates, injected into the orchestrator and the orchestrator is attached to Rusio that enforce the actual policies. But also the second approach was from the bottom. So the users interact directly with the storage systems and the storage systems are able to raise storage event notification. These event notifications are injected into a message bus, reach the orchestrating components, the orchestrator and Rusio that apply the policies that had to be previously uploaded by the users. So the reference implementation was done for these into the cache storage system and also into home. And let me add here that the Indigo orchestrator is an Indigo PIS orchestrator. So an Indigo PIS is a component that is able also to orchestrate computing resources. This is the link between the computing orchestration and the data management in our architecture. Okay, so it's able to perform what I called at the beginning a pre-processing during ingestion. Thanks to a data ware scheduling that is able to perform. Few words about the XDC caching solutions. We worked again in two streamlines, one based on the X2D protocol exploiting the XCache system from the X2D developers. And the other streamline was based on the HTTP protocol. In this case, we exploited the nginx caching system that was extending to support both X509 and OpenID connected tokens authentication systems. Okay, and the use case for these was the creation of federated distributed caches. For instance, at the national level and also for the inclusion of this class site into a distributed infrastructure. Okay, for instance, a site created on a public cloud provider. Okay, what we delivered in this case are mainly recipes to deploy the caches and the distributed caches. We did a few development on these mainly on the nginx plugins. But what we release are recipes based on Kubernetes and containers to deploy the caches. We worked a lot on one data which is a key component for our project. The first part of the project was devoted to improve performance, usability and scalability of the system. And then many, many functionalities will be added. Lucas will present some of these new functionalities later on today. They implemented the QoS based mechanism for automatic data placement. The introduction of our vests for the large-scale data testing. The introduction of an elastic search engine to deliver metadata changes, changes that was exploited to support the acronym use case. And our colleague Sergei will show the metadata repository that was created for the acronym community. So a lot of work done inside one data for supporting metadata handling. Last slide about QoS transition again and in particular the QoS transition systems that were introduced into the EOS storage system. The storage system used at CERN, but not only. CERN is the main developer for this system. We added a new QoS management interface based on the CMI protocol inherited by the Indigo Data Cloud project. And below this CMI interface, the EOS team added the support for bulk QoS changes. One limitation of the CMI protocol is that the transition can be done file by file, one file at a time. And the big work here was done in order to support bulk QoS transition. So a transition engine was developed and included into the EOS system. So I do not have a conclusion slide, but I think that more or less I managed to stay on time. I just leave you with the contact details for the project. And in our website you can download the extreme data cloud service catalogue, which is providing more details on the development that have been done within XDC. On these already existing very well known production quality components are widely used in the current European, but not only European infrastructure. So that's all from my side. And thank you for listening. I stop sharing. I know if you already have questions or otherwise I leave the floor to Jacinto to introduce the next speaker. Thank you, Daniele. I guess that there are no questions or at least no end was raised or question by chat. So let's go to the first demo into the round. This is mostly an offline demo. Marika will present us the integration work done in order to put together the past orchestration from Indigo and the Ruchio data orchestration from CERN and the HCC experiment working together at the moment. So this was one of the last achievement done into the XDC project. Marika, I guess you are ready to share the slides so I'll give you the floor. Well, I think that I already sharing. I hope I'm sharing because. Yes, only we see the small window with the slide and all other window is shared too. So maybe just increase the size of. Yeah, that's perfect. It's okay. Okay. Thank you. Hello everyone. So I'm going to present a few slides about the policy-driven data management through the integration of the PAS orchestrator with Ruchio. I will go through this outline. So I will first of all recap brief the PAS orchestration architecture and the main functionalities. And then I will present the integration with Ruchio. And I will focus on the pre-processing at data ingestion scenario. So the PAS orchestrator is based on the developments carried out during the European Horizon 2020 Indigo data cloud project that started in 2015 and ended in 2017. Whereas advanced features and important announcements have been implemented in the framework of the project, the deep hybrid data cloud, especially for the part that concerns the exploitation of special resources for deep learning workloads, and extreme extreme data cloud project for what concerns the data orchestration functionalities. And these two projects have just reached the end, but further improvements are still being added in the framework of the EOSCAP project that is still running. In particular, I will talk about some of the developments carried out in the XBC project. The orchestration system allows to coordinate the provisioning of complex virtualized computer storage resources on different cloud management frameworks, both private clouds such as OpenStack, OpenNebula, and public providers such as Amazon, Azure, Google Cloud, and so on. Moreover, the orchestrator is able to coordinate the deployment of dockerized long running services and batch-like jobs on top of mesos clusters. And recently it has been also extended in order to deal with the integration of HPC sites. The past orchestration system implements an abstraction layer featuring advanced federation and scheduling capabilities that ensure the transparent access to these heterogeneous YAS environments. And the selection of the best resource providers based on the user requirements that are expressed in TOSCA language, and other criteria like the user's service level agreements, the services availability, some monitoring data, monitoring metrics, and the data allocation. This slide shows the high-level architecture of the past orchestrator that is the core component of the Indigo past layer. As you can see, the architecture is modular and consists of different plugins for managing the interactions and the integration with different compute and storage services. The orange circle is used to highlight the announcements implemented in the XDC project. In particular, we have extended the data placement connectors that are used for the data aware scheduling. The orchestrator is able to submit the processing job to one of the compute centers that have the data specified by the user, enabling the processing near the data. And we have completely developed from scratch the data management connector that is used to steer the data management system implemented by Rusio. Next slide I will talk about Rusio, just a few words to say that Rusio is the data management system initially developed by CERN for the Atlas experiment. It is able to manage large amounts of data on intelligent storage systems geographically distributed. It implements declarative data management, so the user says what he wants and Rusio figures out how to do it. For example, I can say to Rusio that I want three copies of my data on three different sites and one copy on tape, and Rusio will be able to orchestrate the data movement and the replica of the data. So we have integrated Rusio with the orchestrator in order to extend the orchestrator capabilities. The orchestrator was initially an orchestrator of compute services and we integrated the orchestrator with Rusio in order to extend its functionalities also in terms of data orchestration. And in this diagram I have depicted, I have sketched the main workflow for the scenario pre-processing at data ingestion. So the user submits his workflow request in a TOSCA template, including some important information like the storage space to watch for incoming data, the application to be run on these incoming data, and the replication rule that must be enforced on the incoming or the pre-processed data that will be taken in charge by Rusio. The storage system holding the watched storage notifies the presence of new data by sending a message to the XDC message queue. This is a new component that has been added, included in the overall system architecture. Indiegord orchestrator listens on this message bus and receives the notification from the storage services. And the data, the ingested data are registered into Rusio, including the replica policy, the replica rule specified by the user. Then the orchestrator selects the best compute site to perform the requested processing. To do so, it collects information from different sources concerning the service level agreement signed by the users with the different sites, the monitoring metrics, the storage points and so on. At this point, the orchestrator triggers a data movement through Rusio in order to copy the data to the selected compute center. And then the orchestrator gets notified on the completion of the data transfer, always listening on the Rusio message queue. Finally, the orchestrator triggers the processing job by submitting the request to those computing clusters available at the site. In our case, it's a message cluster with a chronos framework that is able to manage dockerized batch-like jobs. Then, as soon as the job output is produced, its availability is notified again to the interested parties in particular Rusio via the XDC message bus. And finally, the data generated by the processing step is automatically registered into Rusio. And then, after that, Rusio can take care that the policies requested by the user are actually applied. So, for completeness sake, here I've put this diagram to show the system high-level architecture. At the bottom, you can see the different storage systems that send a notification to the message bus when a new storage event happens, for example, when some new data are available. The same message bus is used by the orchestrator and Rusio. The orchestrator is able to consume the messages produced by the storage services and by Rusio itself. The replication rules are submitted to Rusio by the orchestrator on behalf of the user. Under the hood, Rusio orchestrates the data movement through the FTS, an FTS plugin. And finally, I would like to underline that the whole stack uses OAuth to zero tokens issued by Indigo YAM in order to manage the authentication and the authorization through the different flows among the different services. Okay, this is my presentation. Of course, if there are any questions, I will be happy to answer. Okay, thank you, Marika. I don't see a question either on the slide or in the chat. Is there any question specifically through this demo? I guess no. So we can move to the next one in the agenda that is from the acronym, Sergei, who will present the work done together with one data team in order to provide the solutions for the medical communities that are supported by Acrin in terms of providing metadata information about the publication. Sergei, are you ready for sharing? Yeah, okay. I see your screen. Okay, and the slides are okay. Thank you. Yeah. Okay, so hello everyone and welcome again to the Action Data Cloud workshop. And today I'm happy to present metadata repository for clinical study objects for Acrin and not only for Acrin but for the whole medical communities. And I would like to start from the problems which different clinical trial scientists may face and why did we decide to develop metadata repository in general. So first here was left for different problems and first main problem here is that clinical research studies and data objects belonging to it are often scattered around. So usually they information about clinical trials may start maybe start into different websites publications repositories registries and databases. And it creates huge problems for the scientists to find the necessary information through all these very of the data sources. The next huge problem is that the mechanism of gaining access vary between different places and different data objects. So for some data sources, they may provide the APIs, which define the data access, but some of them don't provide anything and that creates a huge problem because the scientists and the researchers need to scrap the web pages to extract the necessary metadata information. There is no great discovery metadata schemas implemented and used for the discovery. It's another huge problem because while the information and the metadata is extracted to be analyzed it should be standardized. But the problem here is that each data source store and provide the data in different formats and schemas. And those three problems may be summarized into the one huge problem that the findability of clinical studies and related data objects is difficult and time consuming process. And here also I would like to make a simple and small comment about the studies and data objects what is the study and the data object cell. So the study here is that is the main core information about the clinical trials, which include the title, authors main dates of the clinical trials, study types, study status and all the core information of the clinical trials. And the data objects are the files and the kind of extension for this core information and maybe represented the documents and the web pages, the XML files, spreadsheets and so on. So it's the data objects, I could say that they fulfilled the information about the clinical trials. So, as you may understand from the previous slide, we decided to maximize the discoverability of all the clinical trials and related data objects and put all the metadata about this into the single system called metadata repository and provide the links to different data sources where the information about the clinical studies and data objects are available. And by the developing of the metadata repository, we decided to support the F or findability principle of the fair principles. So we developed the metadata repository within the XDC project, exploring mainly the one data system. Here, we just provided the metadata and we just provide the data here. The main functionality of the search engine and the filtering functionality provided by our partners from Italy, NFN and one data solution implement the metadata collection and transport from the multiple one provider to the central one zone service and provide the metadata management systems. On this slide, you can see the main architecture of the metadata repository and probably not the architecture but in general overview of the metadata repository. On the top level of this schema, you can see that we start from the collection and processing separately and individually each data source. After that, when we finish to process and analyze the original data, we extract this information and put into individual databases. After that, the ETL processes start and these ETL processes mainly include such procedures as a standardization and cleaning from the duplication. After that, all these records become available on the core database containing studies and data objects. All these studies and data objects are linked between themselves and put into the single system. Later on, we start to convert all these files into the JSON format and push all these JSON files to the one data system first of all to the one provider and after that to data spaces and for sure into one zone later on. At the final step of this whole system is that all these files and metadata information become available through the web portal. Just a few words about the data sources. For now, mainly we work with the four data sources including clinical trials.gov, PubMed, Biling and Yoda. As you can see on the second column of this table, each data source provides a different extraction method, which I mentioned as a huge problem for the scientists. For example, clinical trials.gov. We provide the XML files, which we have to process manually and PubMed provide the API or XML files as well. And for example, Biling and Yoda, they don't provide any API or XML files, so we have to scrap the content from the web pages. And the total number of the JSON files is about 800,000 of the JSON files and this is the status at the end of the March 2020. For now, the total number of JSON files is more than one million and includes several new data sources. For now, it's time for the demonstration. And first of all, to go to the metadata repository, you have to print in your browser this address or mdr.org. After that, you will be redirected to the metadata repository. And on this page, you can see the metadata repository in general, the single web page application. On the top of this page, you can see the title of the metadata repository, the selection or search fields of the studies. The left panel includes the filters for studies and data objects. And on the bottom part of the page, you can find the data sources and contribution organization information, the disclaimer information, all the context details. And also we include the help information which describe all the mechanisms and procedures of the searching here. So now I'd like to demonstrate the functionality of the web portal and here I'd like to start from the selection of this specific study. So by default, you can use the different select modes and you can use different modes to find the specific study. So for example, here if you select the specific study, after that you will see the availability of different study ID types. So for example, you can find the study by the trial registry ID, by the funders ID, sponsors ID and so on. So if I will print here the study ID, which is quite unique for each clinical trials, and then I will click find, I will see that the system found the necessary clinical trials. After that I can clear the results and go to the next wave for all the findability. So for example, I can print here that the data will contain the COVID-19, which is quite important for now. And we can find. And the system found more than 1000 records. So it found 102,025 studies and on the main page of the web portal the only 1000 of the records is uploaded. So you can click on each study and see the information about each study with the necessary information about the data objects here with the links and references to external sources. And the green light here mean that this data object and this link is publicly available, so you can go to this link and find the necessary information. Also if I will use the filter section, I can for example, just select all and we'll see, we'll not see anything or select only the necessary types of the studies. And see the results right after the selection or the selection of the options. So now the results again and try to make another search and we'll try to find the specific the cardiovascular clinical trial, which calls omnihar. So I found two clinical trials related to the omnihar study. As you can see, each clinical trial here contain the necessary data objects with the useful links. And by clicking on which you will be redirected to the necessary page with the information or with the document. So for example, you can see that there are several journal articles here. And here contain just a trial registry entry in the journal article as well. And I can click on this link. And yeah, after that I was redirected to the PubMed source with the information and data objects related to the omnihar clinical trials. So I'll go back to the presentation. And when the MedData repository was developed and run as a demonstration. So we decided to make the evaluation of the usability and user stratifications. So first of all, we developed several questionnaires and the protocols to ask people, especially from the scientific medical scientific communities. Did they find the MedData repository really useful? And their reports were published on 7th of April. And the main answers that was yes. So they really happy to use the MedData repository. They found it really useful user friendly and they confirmed that they found all the necessary information in the MedData repository. Also we don't want to stop on this and we would like to continue the development and improvement of the MedData repository. And here we spread out our activity into two main projects. In the EOS Hub project and EOS Clive. But the activities which will do within those two projects are very different. So within EOS Hub project we will provide the portal, upgrade the MedData injection processes, develop elastic search-based APIs, extend the MedData repository functionality and collect the data on user actions and the feedbacks. And main focus in the EOS Clive project will be on the revision of the MedData model, extension of the list of the data sources, preparation of the API supporting data access and integration with the other work packages in EOS Clive like AI work package. Also which is very useful for now is that the MedData repository helps with COVID-19 crisis. We put MedData repository in production for a current taskforce on COVID-19 which is accessible on the web page here. Also MedData repository has been linked to the related resource to the European COVID-19 data portal. MedData repository included the recommendation of the FDA COVID-19 guidelines and recommendations which is under development and will be available on this link. And also MedData repository has been proposed for inclusion in the infection disease data observatory. That's all from my side. Thank you for your attention. If you have any questions, so be happy to answer. Thank you. There is a question from Tony. Hi, Tony. A very interesting one indeed, which is the role of one data in this particular topic, this particular use cases. Maybe since we have also looked and she connected that is somehow like to have a demonstration soon after it could also try to go much deeper on the one data functions in this user for this use cases. Yeah, so I can answer this that one data here for the MedData repository, the key component as Daniel mentioned before, the one data in general, the key component for the whole extreme data cloud activities. But for Akron especially, we use the one data system to store our metadata in JSON format. And after that, the report which I demonstrate here is created within the one data system. So, yeah, MedData repository report is publicly available, but it's created within the one data system and it is kind of the plugin of the one data. It's kind of extension of the one data system. I will give a short, very short demo about what's going on behind the scenes about this project. I mean about this demo, about the MDR fix later on. Okay, so let's wait for the demo soon after there will be a bit more information. There is also another question from Xeroe. I hope the, I pronounce it correctly the name. The question is, you say you need to scape Yoda, weren't you able to use high roads for this? As far as I know, we discovered the Yoda data source and we found the only optimal way to extract the data by describing because there was also a possibility to download records from the Yoda in CSV format. But the information which is included into this CSV file is quite limited. And if we scrap the web pages, we can extract more information, including the links to the necessary data objects which are represented at the documents and the files. Thank you. Is there any further comment or question? I guess no. I don't see any other feedback neither from Slido or. And please. Yes, please. Also, we have the Akron Med-Data Repository Wiki website where you can find all the necessary information about the Med-Data repository, including the Med-Data standards we are using for our Med-Data schemas. The JSON schemas examples with the data description and the information about the data sources, individual data sources and the control terminology. That's all. Okay. Thank you. So, guessing there are no further questions to Sergei. I will leave the floor to Lukas. I guess that most of you already know Lukas from one data team. We will try to highlight the most interesting topics that are up within the XTC from one data tool, and we will also see a very nice demo from Lukas. So, Lukas, I already see your slides shared, so I guess you're ready. Thank you. I hope you can hear me well. So, good afternoon, this is Lukas speaking. I will give you a very short evaluation and no current summary of the current status, how the one data platform looks like after the XTC project, which was supporting heavily our work. And just to summarize what is the one data and what are the ambitions of our platform. We are currently a cross-cloud data processing platform, which is delivering to the end user, unifying data access processing abilities at a large scale, converting the access to hybrid resources in terms of different good clouds as well, hybrid storages, different type of storage backend technology. And all those things should be allow the user to easily access status with the data and manage them while the processing and deliver the data in the uniform way. Although we extend as well our functionality not only to the data, but as well to the metadata management. So, we deliver consistent data and metadata management platform, which allows us to build a data discovery infrastructure based on distributed data, feeding the metadata changes into the elastic search, and that was the prior presentation of CAG presentation just before me. So, we were delivering data from several data sources into the centralized elastic search, eventually consistent one manner, and based on that, there is a flag input. Okay. I will give you a very short demo of the internal slides of a few things which are, which has been changed in our platform in the release 2022, which is publicly available now. And it will be a four topics. This is the data management just a very short introduction to what we are and for those who doesn't know what that one data platform is, and it will as well brings a new features about the three different new things, completely new things being supported by and the development of those things was supported by ACC project, one of them is quality of studies. Second is the new advanced token mechanism and data discovery. Okay, so I will switch to the live demo now. And I will show you my screen. This is like my demo infrastructure demo. And for those who doesn't know us, one very important remark is because I'm going to answer this question very often. This is the world you can build by yourself. So demo and data delivery is just the deployment of our component software stacks in my domain authority. But it's not like the Dropbox approach where there is one central service needed in the world. It could be completely different, different implementation, different deployments, one of them is EGI data and there are as well completely different environments. So we don't need to be involved in the ecosystem provider for you to be. But okay, coming back to my demo. This is the entry page of my demo world, and is connected to the server like external IDPs, which allows me to enter, which are tightly integrated with our system authentication so I can log into the system for instance using EGI and checking which is one of the IOSC IDP system and it enters and delivers me my world access. So we have a view for the data providers. It means that at the point that the storages and places where my data are currently managed by the platform itself in the distributed places, distributed world. We have something in Crackwaffe in Italy and Portugal. But my data authority brought me as well some information from, which is an authentication authority brought me some information of who am I and brought me information about membership on special groups. It was inherited automatically from EGI checking and this is the place where the old hierarchies is there and we can see the changes in the place of my membership depending on my authorization. I have a very limited authentication. I cannot see the whole structure of the tree if I'm coming from the EGI but I can only see my parts and the groups where I'm a member of. And the groups are a big hierarchy of the EGI SSO as organization but these are the groups I'm a member of and each of those groups might have some subgroups but I don't have a special rights in most of those groups to see the members. It depends. Everything is tightly connected with the single sign on but this is a data management platform and this is fully distributed. And based on P2P principle so we could use a concept of the data collection which is named system space data space. And this brings you the idea of the different spaces I have access to a few data spaces which might be shared amongst other users this data spaces currently are shared only to me but but just a demonstration purpose as you can see them. I can just share this particular data space to to all of the all of the data demo on the users so if you log into the system you will be seeing that. Of course this brings some complication as I can see who actually might see my data space because those those people are already a member of all the users group. If you will be logging into the system this users group will be user base list will be growing. Okay this is like a first first factor of authentication, which brings us a special control. What are the roles of each of the groups member of each of the individual user in our system so we can define precisely what is my high level role in that collection. Of course, each of the collection might have a different pattern of access control was paid by user the group membership, which gives me a full freedom of sharing that amongst other groups of the user collaborating at Hawke. But the principle the major thing is like in our system is where the particular data collection might be stored and this is like a P2P approach in our system, gives me the access to the distributed. Give me an overview of the distributed infrastructure supporting particular data collection. In this case, the day collection is named demo XTC and this particular collection is supported in three different location. I have a different collection, which is supported in the by to two of the providers, which I have access of course this could be completely detached it could be different completely different providers different software stack and this joint collections business doesn't need to be it's just connected at all. It's complete global mesh and no central point of anything except the entry page, which is called one zone. Of course, if you have a data, you want to see the data and we provided you as well a graphical user web browser for the data collections is some sort of a virtual file system. But of course, everything what I'm what I would be do is not for the graphical user interface is for the rest API and for POSIX virtual file system so I can mount my virtual file collections. I have access to my virtual to the POSIX and do the processing on this on this collection based on the passion algorithms and so on transferring things across different locations. But if you're looking at this this collection now at the demo XTC. I'm connected to you are seeing now a new graphical interface, which is the result as well of the XTC project, which is probably not I haven't presented it yet at all to anybody. So this is like a new integrated data navigator, which brings me to the idea about the locality of my data. And now I'm connected to the place to the barry. I can switch between the locations like a taps my system this is actually connected to crack of now I'm back now to barry. But the idea about our system is of course delivering the global namespace. So wherever you go, you technically should see more exactly the same thing with some delay, delay compensation because it's based on eventually consistency. So if I have a distributed file system like this and this with the data collection like this, I can manage the data locality. This is this is what differentiates system from the others at all, because we have inherited the embedded and the locality of data. And in this case, the data is coming from the existing NFS collection, which is like that to automatically register it into the data into the one data world at barry. And I have like a storage set storage in crack of an S3 storage at least one. So all of those technology behind the scene are completely hidden from the from the users. So I can tell the system replicate me here. And it will in a few seconds you will see the replication is there if I switch to crack of I will see that that's faster because there will be no delay in communication. So it's the matter of really seconds where the data is replicated depend on the size of the delay of the first blows is very, very low. So when then no matter where I access my data, we will deliver you if I access the data from crack. And now I'm connected to crack when I will be. And as you see here, the data is not there. I can read this, this data, and this will be delivered to me on the fly, which means that there will be a delivery and data caching approach. So the data is cached now in crack because the replication happens. Of course, this leads to a lot of opportunities because we allow you to define the jobs that you can replicate the whole directories and so on. But now I'm coming to a new mechanism, which we introduced during the XTC project, which is a, which is a core quality of service. So this mechanism allows me to define the quality of service of a specific file based on the rules, which is the inherited and inspired by Russia. So you have a rules, which tells you about special tags and operators, logic operators there, which allows me to tell where the actual unexpected number of replicas and what is the rules of replication. So it means that everything what I create on a particular, it make more sense on the individual directories. So I will define it here. I will solve that country equals Italy, which is like the current status. Basically, this is the place where the files are there. So it means that the quality of service rules will be verified very quickly because no replication is needed because currently the data distribution is Italy. But if I tell the system extra another rule that as you see here now is green because everything is fine. Now I will tell another rule. So there should be many rules. Country equals Poland or storage. I could have like a complex, complex square here with brackets and logic operators and summaries intersection and so on there. I will force that everything what I have in the, in this directory is properly replicated to crack was you see now the replication happens very quickly. Because the data distribution happens. Everything now is stopped being replicated behind the scene to crack off because that was my intention. So, whatever I create in my distributed ecosystem, the system tries to fulfill my rules of the quality of service. And this gives a lot of freedom in terms of the complexity data management distributed system because you can attach this multiple quality of service expectations. On the different levels on the level of the entire collection or even on the level of the entire on the single director of all the single file. So the most important thing is that we can ask the system at any level, whether what is the current quality of service rule expect fulfillment in this very moment. So it's green if everything below it's green we use a very complicated algorithm so for the very large collections you can ask a particular sub directory to check what is the current fulfillment of the rule. And the system will tell you if this particular sub directories already fulfilled even though the whole quality of service expectations not fulfilled yet because it takes a lot of time. Of course this what I did it's requested some data transfer operations as we catch it here is like the there was like a very small traffic because of the files were behind the scene manner so this is what you can see and observe the traffic of the quality of service. And the expectations how do you perceive the look out the second new thing is like what we do what we did it is we designed the entire access token mechanisms in in our platform, which we brought a new mechanism to build a new access tokens as you know, you using our system at tokens for authentication delegation so you can control you can connect one client can connect your virtual fire system. Then you need to present yourself using the token it could be internal token from our system or could be external token from external IDP but our internal tokens gives you a lot of flexibility demo tokens gives you a lot of flexibility in terms of the precise. limitation of the stock and because these tokens not be very powerful this is like a total delegation of your identity, so we introduce a lot of things like you know you can define the region where the token can be used for instance on tactical. or the particular country, or even ASN from BGP so if I noted in INFN IGP INFN IPs are the connected to 137 BGP IS number this this is my bet. Let's prefer than use just the IP masks for the access IP mask of course are possible, but we can specify as well for what purpose the token might be used, which limits the tokens used for one client, which are not allowed to for instance use the rest API is and can't delete or messed up with your account if this this data is this token is leaked but the most interesting tokens are in the data part here, because we can specify what is the. what is the what is the power of the token in terms of the data access, so we can tell that this token can be only used for read only to the data so it block any any rights I have my system are overwritten by the token now, and this gives me the only read only access plus, I can that dedicate some special path to the brain let's say. brain scans or some other if you are the pass pass demos. and so on or even individual object ID nice system such created token is very limited and this token is like the limits usage of can be used only within this is criteria this limitation. I prepared that the previous that token different one for the this limited token to demonstrate you quickly during the demo. is a token which is limited only to use from. from Italy and Poland ISN 137 which is INFN and brain scan and demo and read only and this is my connected like this is my connected examples. This is my screen I will switch to the demo I mounted the virtual five system one that only virtual five system twice, one is like my my data with a full token, so I can see everything what is in inside of my ecosystem. is exactly the same what you have on the graph case interface, or I mounted it with the limited token is a little cold limited space. And this is that this this token which limits only me to read only the data and access only the demo and brain scan from the single collection and now when I enter this, as you see, I can only see single collection, instead of all my collections, all my collections, sorry, all my collections before. And when I enter this data collection, I'm not even seeing all the directories on my files, this collection is only limited to this this prefix of patterns. So when I when I will go to the demo, I see that the files but when I want to cut the files. They work fine, but when I want to remove the the files. It's permission denied because it's read only. So this gives me a lot of extra flexibilities to make the tokens and we're more secure in the last year the processing and the very controlled delegation of tokens of course the expiration time as well as the important if I define that this token is not very very trustful I can just take go there and revoke it and save. And as of now in a few seconds depends on the situation, the token will be rejected to use that all. So, I just revoke the token if I find out that the whole thing is probably not. The next thing I wanted to just to tell is like a final very quick thing and to address the problem which was the question by by Tommy. To, to tell a bit about the data discovery, data discovery and integration by with elastic search. And the thing behind the scene looks like in this picture was he was presented only this broad embedded discovery portal, which is a part of our one zone. But this embedded discovery portal was connected to elastic search but elastic search was fed by the metadata gathered from by from the several collections and the collections are gathered by the data spaces, which are supported by several providers and the data collections were pumped from the external system. So, we have the distributed data management, which covers metadata as well. And this there is a thing which we call data harvesters, which at the level of the zone cats eventually data metadata from the external system and feed them to the elastic search to be used for data discovery portals. And the important part of that is like, we guarantee that the metadata is not lost, and we guarantee as well to ensure consistency, even the complexity of the distributed environment and metadata connected this path is, is quite enormous. So, and the large scale so that we solve a lot of troubles to make the data consistency between the data and metadata management systems, and delivering them into the last time. So, that was the elastic search or multiple elastic searches depending on the configuration so you have your flexibility and the zone to define the different harvesters which are pumping metadata from different data spaces into the different elastic searches. So, that was the, that's what I wanted to conclude my presentation to take over my time. So, this is the time to the questions, if you, if you are interested in your leaders, that was a very quick summary, what we did during XTC and how one data looks like this very much. Thank you, Łukasz. Thank you for this nice live demo. And I don't see question on the chat. I'm just cross checking on the slide though. I don't see any specific question to Łukasz on the slide though. Is there anyone would like to add comments or question both to the Łukasz demo and the information or to the others presenters. I guess. Okay. Okay. Is there a question I guess to Łukasz first and then we can also answer as project as a whole. Are quality of service rules based on metadata? This is a very good question. I wasn't very precise in my presentation. The answer is like the quality of service rules are based on metadata but attached to the storages. So, if you have a distributed system like us, what we will be do is a bunch of providers. Each of those providers have a different backend storage system. Now a system can have multiple storage behind each of those dots. And in our system, if you look at the configuration cluster, for instance, for barry, you can see on my screen that there will be a metadata. One moment I need to connect there. There will be a metadata matrix specifying an expressions which might be used in the quality of service rules in the one data in the storage. So, there is actually in this particular barry, we have only one storage which is a POSIX. POSIX is a POSIX which is important from external file system as I told you. And there are quality of service parameters which is a storage POSIX, it's external, true, and it's located in country Italy. Of course, the language of expression can be grown is like a matter of vocabulary defined amongst the players using one under the single environment ecosystem. So, we have like the rules for cool ideas for the data, the latency, data durability aspects and so on. So, you can tell express that I want things in Italy on the storage with high durability or I want things in a specific Polish data center on the storage with high throughput because I will be delivering on processing the data quickly in the next days. So, this is the metadata, true, but not but connected to the storage and ecosystem configuration. So, this is the answer to this and we'll give you a flexibility based on the complex rules of expressions that you can express into the system making the whole thing working as you want. This is inspired as I told you, inspired by Russia work as well. So, it's quite similar in case of for instance of Lisbon we have a few storages there, one of them is storage S3, a location in Portugal and second as well as three. So, basically it's like you can have like a complexity of these storages. You have like a providers, providers have multiple storages and on top of that these storages are supporting your spaces and thanks to that you might use expressions which are complex. Thank you, Lukash. I would add to what already Lukash said that obviously this is the same behavior more or less doable with the use of Russia as Lukash was quickly recapping and this means that both if you have the full one data solutions or if you are exploiting just a few of the services of XTC. You may have similar behavior orchestrating the data based on some specific set of metadata and based rules on those specific set of metadata. Always on the levels of metadata there is a question from Mark Portier. I tried to recap the question, does any components in XTC work towards extracting that in publishing cementing to ADER data or metadata? More specifically, no elements from WC3C vision on SEM web or LOD like or LDF or LDP seen to be the scope of the project are those for seen in the future version. Lukash, I know there are some things about metadata and the format that you can show up. Yes, in our system because I still think I'm sharing the screen. In our system what you do is not exactly just data, but like this, you might have a data object and metadata and we support currently three levels of metadata support key values. So this is a list of key values, JSON, which is like an object class structured thing and RDF is XML things which are connected to the file. So you have the files, data or data objects plus the bunch of metadata connected to it. And the metadata gives you some extra features for data discovery, data processing as well and access and managing this part of the ecosystem. So what else we do with the metadata which might be interesting is we support thing which is called sharing, data sharing. So it's like the, it's a part of the, this is not a part of the actually HDC project was a part of EGI and Gauge, but it's connected to all those things. So you might share the collection or part of the collection, which gives you the public available directory, which might be used and then later on published using a DOI. If you have a proper agreement with DOI or PID system connected to your environment. In my demo, I don't, but it's like part in the G.I. and Gauge should download data habits connected there. So you might have like a public environment, which is completely available for the users in the world. And then you publish data and metadata as well. Plus, while you're publishing this, while we are meeting this DOI, there is a process of publishing the information there, including the metadata connected to the collection, which is later on as well exposed using OIPMH interface for data harvesting, data discovery. So you, we deliver you a call ecosystem and bring where you manage your data for processing, manage the metadata for data discovery internal or external, and as well for publishing the data results in form of open data, which is as well connected with data and metadata. So this is probably partially answering only the question. We don't 100% covering all these aspects yet. Yes, just to add a small comment on this. Obviously there are in the project, there is also, there are several communities that is managing the metadata space from themselves. So taking out the metadata part in a dedicated services from the community expertise, and we are well able to onboard and to use these solutions and let them just use or leverage the services for the data management and the data access, etc. So it will really depends on the needs of your community and the way you decide to work on those two aspects together, the data and the metadata. There are also some examples from the CTA community that is using one data to leverage the ability of automatically look to new data, look to the metadata and pre-process those data in a way that the metadata automatically registered into one data solutions by default as the new data appears. So it really, it is flexible enough that at the end really depends on your real requirement from as user community, and we can try to fit your perfect solution on your requirement. I see another question about from Karl Friedrich, do you have any thoughts on how to keep registered authorized user up to date, for example, when changing affiliation, etc. This is a very, very good question. This is the integrated title integrated with IDP, so it's like this. So when I'm logging to the system, we gather through the extended protocol, attribute protocols from OpenID Connect or some bunch of the extended attributes which tells us what is the membership, current membership of those users. If I'm kicked out of some of the other groups, the system will automatically be removed from our system during the next token verification authentication. So usually it takes several minutes, of course it's not instant, but we will renew this information from the extended system and then after the next time you will be not a member of particular group. For instance, if you are not access this group anymore, and in that sense, these groups will disappear from my list. I have a very gigantic bunch of the groups here because I'm a member of many, many groups in EGI. I do it on purpose, but this is like the case of tight integration of our system with the external IDPs. So this is crucial for data management and access control rights because I can use these groups as well for different levels of membership controlling as well access control rights because we have like the permission things here based on access control rights, which allows me to use these groups to the individual user or groups to define the very precise NFS version level of particular file or particular data collection. So this is multi-level, first level, what I can do in generic data collection as I showed membership or individual files as well. And this is tightly connected, what I gather from external IDP, it goes through group memberships here and the groups then later can be used while individual or individual users can be useful precise data management access control. Yeah, if I may add a bit about the authentication authorization part as, I mean, platform-wide solution, we use Indigo EM for most of the services in the EM. You can easily join different IDP entries on the same identity from EM, so you can join your Google account together with your own IDP. And this gives you the possibility to try to map different IDP logins to the same real authentication ID and then you can use the same onto several servers and instances as you are looking into this demo. We are using also EM as an IDP for this one data instance, for example, this is doable for many other servers already in the XTC and further, so this will help us also to dig into this very interesting problem indeed. Yeah, so while you're just explaining the integration with EM IDP, I just show you a login with different IDP. And in that different IDP, I have a different identity, and this different identity is a member of a different part of the groups, and these groups have a limited access. And if you see in my screen now, I only see a single data collection, and I'm a member of this data collection. Mainly because at the moment I see this, because I'm a member of all users, this is why I see this group, this data collection. I'm a different Lukash Lutka than this is me, and I see that part because I'm a member of the other collection. So take the other groups, and this is why I'm seeing it, and the others are not there. So this is tightly connected to the IDP, different IDPs. And we support OpenIRM as a first citizen, inherited from Indigo, but as well we are supporting EGI Engage, EGI Check-in, and a few others, including Elixir IDP with the extended attributes system. Okay, sure. Mario Davies is asking where the one provider in this board is hosted. Yes, this is like private deployment, unfortunately not at Leap, but I would like to get some resources from Leap who are making my demos there, so it's not there. Okay, so next time, Mario, please, some VMs to Lukash. There's little borders from the other project. Can I make a comment? Yes, please. So indeed we have one provider at Lisbon since one and a half months for the CdataNet project, but I was not able to instantiate and there were some errors. This is one thing. So indeed we have at LIP in the NCD one provider. So it's not like we don't want to have the one provider. Nobody asked beyond that. No, no, no, no. If you had any trouble, I will, of course, we will help you. The demo is for the newest release 20202, it's really fresh and I'm using just control the demo environment just for my other cells, it's not production data behind any of this demo infrastructure, it's just for the distributed demonstration purposes. But of course the production we are welcome to help you in the deployment of these things for the production scale. Okay, so thank you all. Unfortunately we reached our time limit so with this last minutes I would like to ask you again to fill up our pool on Slido and please if you leave us your contact, we will get back to you trying to answer to questions or to any kind of feedback we may provide to you in order to get more connected to the XCC project. We see already a few questions and comments to be very useful for us but please go on providing as much comment and feedback you can to us, we will appreciate. Daniele if you want to just add something at the end. I wanted to advertise the Poland survey again but you already did it. Just let me thank again all the participants to this call and also the organizer of the entire EOSCAP week for hosting us and giving us the possibility to show some of our services and XTC developments. And thank you all again. I think we cannot close the session just in time. Indeed. Rob how it works we should close the meeting you should do it. Okay, I'll be one time if we can move on to your next session now. Meanwhile I'll push everyone to waiting room again to prepare for the next session. Thank you. Thank you. See you in the afternoon.