 my NFN. I'm considering this session about the EOSCUB technical workshop. This was a long-standing activities started last year in June with the first technical workshop we had in Amsterdam, physically, at that time, luckily. This time, unfortunately, we are not able to run this in person so we decided to have a different scope for this second round of workshop and the main objective of this today would be to try to get all of you on board with the activity we have done in this first month, trying to provide you a bit of details on the more advanced technical specification that we have in our, I mean, that we have provided already and they are already public for you to be read and eventually provide comment. The agenda is very packing today so I would give the floor very soon to our first speakers. The idea is that for each of the presentation you obviously may ask a question but we hopefully levels at the end of the session a kind of interactive slots for a question and answer on the overall work that we have done. So if you don't have any specific, I mean, comment or question, I would pass the floor very quickly to Marika and then all Fernandez that are the first one in in our agenda today. They will talk us about cloud compute and orchestrate technical specification and all come from IGI, I guess that most of you know him already and Marika comes from ALF and is working on the orchestration in cloud computing since years. So please Marika and then all the floor is yours. Hi everyone, so this is an all I hope you can you can hear me fine. I'm sharing my screen with the presentation. Can you can you see the slides? Perfect and all and I can hear you very well. Okay, thank you very much. So as I was saying, this is a shared presentation between myself and all Fernandez working for the foundation and Marika who is working for AFN and we will cover two of the TCOM areas which are the cloud computing and the and the orchestrator. So I have this first slide to put things into the kind of perspective. We have the cloud compute containers and orchestration technical area that covers the lower layers of this diagram and here we have three specifications. The first one would be the infrastructure as a service virtual manage virtual machine management. Then we have the infrastructure as a service container management. So that's the second one. We have a set of ideally set of providers delivering this kind of functionality over a distributed infrastructure. And on top we have the third technical specification of this area which is the infrastructure as a service orchestration that would deal with the management of these providers in a joint coordinated way. And then on top we have the area that Marika is leading which is the past solutions technical area and we have here the platform as a service orchestration. So I will I will go into this first three and then Marika will take over and explain the area. So I will go quite quickly because the time we have is quite limited. It's a lot of information but I tried to summarize it as much as possible and I hope that if you have any questions we can discuss later. So starting with the first one the cloud infrastructure service virtual machine management. This is basically providing access to computing resources as virtual machines via an API. So it's a cloud service that delivers to virtual machines and in conjunction you normally have block storage network and this kind of auxiliary things that help you to have persistency and connectivity with the machines. So here we have on a scheme of what we expect. We have a system that has an infrastructure service API without you manage different VMs that get started from a set of virtual machine images and get associated some block storage and normally you have a virtual machine management user that is the one creating this infrastructure and you have at the other end some platform or end users that will interact with whatever thing these people have deployed here. Focusing on the interoperability for this specification we have three big areas here so the first one would be having an API base access. We prefer open or standard APIs. There are several APIs here, proprietary and open. So the idea is as open as the more open you are the better. So here we have a clear example open style would be a quite open API. So that's preferred. The APIs, whatever APIs that's used by the system should be supported by the orchestrators in the higher layer. Actually that means almost anything is supported because we have a quite good support for both commercial, proprietary and APIs and open APIs here in this layer. So it's kind of almost anything is allowed. And one thing is having just a graphical interface dashboard is not enough for interoperability. You really need to have API access to the system. Otherwise it's not interoperable with others. Then in the AI, of course, you need to be compliant with the USHAB AI. That basically means in this case, being compliant with OpenID Connect and following all the recommendations that the guidelines have. And then we have a third area which would be the federation. And here we expect that the system can provide virtual machine image management and can allow to synchronize images from the external world. So you can have the same virtual machine images everywhere. And also we expect that you are able to expose the usage of the infrastructure to the USHAB accounting following a set of standard records that we have for expositionary formation. And we have implementations in the USHAB for server system, mainly OpenStackUp and Nebula, which are the main ones installed here. Moving to the container management. So here what we expect is to have an API based management of applications composed of containers. We expect here that you have some kind of orchestration of the containers. And we have several systems here. Main one would be Kubernetes, but Docker, Swarm or Mesos should be equivalent. And the idea is that this system is able to manage resources on an infrastructure or service to execute your container application as you specified. In the interoperability, there are several implementation of these kind of features. All of them are non-compatible. So there are no clear guidelines of what's the right API to use here. One thing is clear is that Kubernetes is becoming the de facto standard. So that's kind of the thing that everyone is implementing. So it's useful to have into consideration. Just a few more comments in this category to the services should support the OCI image and run them in specs. That's the OpenContainer initiative that has the specification of the images and how to run the containers should support the service. You just have AI as before. And if the service can manage the underlying infrastructure service should support the APIs of that layer. Mainly that's open stack. Going to the last one, the cloud infrastructure orchestration. In this case, what we have is a system that is able to manage infrastructure or service resources as code. So you have an infrastructure description, pass it to the orchestrator and this tool will manage the resources on the different cloud providers. Very quickly on the interoperability, I change a bit the format. Here we don't have any common APIs available. It's similar to the container case. There are several systems. They're not clear winner here. But for the infrastructure description, there is a clear standard, which is the TOSCA. Of course, we would like to have everyone compliant with AI and for the API to support that lower layer. Open stack is the recommended, but others are pretty much welcome. So now I pass the lead to Marika. Marika, I think I can keep sharing. If you don't mind and you just tell me to go on. Yes, thank you very much. Okay, so the pass orchestration layer is an abstraction layer at a higher level, as we all already anticipated, on top of which we can build user interfaces that are provider agnostic user interfaces and other client interfaces that are provider agnostic. The pass orchestration is in fact able to abstract the gas environments and the underlying gas resources, leveraging the automatic discovery of cloud compute storage providers capabilities and service availability. So it enables the seamless access to different computing environments, both private clouds, like open stack, open Nebula, et cetera, and public clouds like Amazon, Azure, Google, and also HPC sites. So it features advanced scheduling capabilities. I mean, that we're scheduling, taking to account the data location when dispatching the deployment requests, or special hardware requirements like GPU, FPGA, InfiniBand, and so on. And it also provides some advanced mechanism for addressing the deployment failures, scheduling the deployment to another site. It also supports a complex workflow involving data orchestration and movements, providing policy driven data management, or handling quality service for data replicas, and so on. And it allows also to manage the provisioning configuration and scaling of the cloud resources supporting hybrid deployments and network orchestration in order to allow the seamless connection between different geographically distributed sites, creating, for example, overlay networks. And finally, it provides also on-demand application deployment, so allowing the user to start long running services in Docker containers or running processing jobs, both on container orchestrators or on HPC sites. So we can move to the next slide where we sketch the, envisage the reference architecture of this technical specification. We envisage the architecture that is basically composed of different plugins. We have some core components that are the API server that should manage the input requests, a workflow engine that should manage the complex deployment workflow, and the message bus to integrate the different plugins and the interaction with the different components of this architecture. Concerning the plugins that are the most important part as they provide the integration with the different YAS environments. In this reference architecture, we have reported four different plugins. The plugins for connecting to the cloud infrastructures, cloud management frameworks, using the cloud native interfaces, like those for OpenStack, OpenEbula, and the public clouds. The container orchestration interfaces, so that the pass layer can provide an abstraction layer on top of the container orchestrators that Henol presented before, for example, Kubernetes, ManSource, and so on. And we have the HPC connectors for integrating HPC environments. And here we have, for example, mentioned the QCG APIs and the SLAM APIs. And finally, we have the storage services plugins, the data management services and the data orchestration services. And we have put some reference components, for example, one data dynamic and the cache for the data management services and Rusio and FTS that are at a higher level, because they provide orchestration data orchestration functionalities. And okay, of course, the pass solutions have also to integrate with some federation services that are highlighted at the left on the left part of the of this diagram. And are basically the AI monitoring the information system and market price. Okay, of course, the AI provides the authorization and authentication part for the whole stack. And the monitoring and info system and marketplace can provide useful useful information for the pass components in order to implement the scheduling and federation capabilities that are exposed by this technical specification. So moving to the next slide. Concerning the standards, we have identified three main standards and APIs, the TOSCA topology and orchestration specification for cloud applications that defines the interoperable description of services and applications hosted in the cloud and elsewhere. And we are using this also to describe the jobs and Docker containers running on top of orchestrators like Messos and Kubernetes. So TOSCA can be easily extended and can be a point of strength for the interoperability. O out 2.0 for the authentication and authorization framework and rest for already the for implementing the APIs. Currently, there is not an official standard for the past orchestration, orchestrations APIs. And for this, we are proposing as a reference API, one implemented in the Indigo past orchestrator. And then this slide that you can find the link to the documentation of these rest APIs that basically provides the functions and the methods for managing TOSCA deployments. So how to deploy user deployments, how to create to update a deployment. As an example of solution, so we are the past orchestrator from the Indigo Data Cloud project and both of them information. This was the last slide. Thank you very much. Okay, so thank you both, you know, and and Monica, there are quite a few questions into the chat and trying to summarize there were one about Kubernetes, but I know already answered during the presentation. So that's fine. Together with the questions related to the production level of this past already. And this is indeed already production level, as Mario mentioned. There is a question from Pavel, again, which is the integration with the marketplace? We are mentioning integration with the marketplace. So if you can provide a bit more detail something. Yes, Mike, I guess. Yes, the idea is that the orchestration, the past orchestration can leverage some information coming from the marketplace. For example, if the orders and the SLA, the service level agreements among the users and the sites are exposed through this, some APIs of the marketplace, this information can be used by the orchestration tool in order to to schedule the user request to the right side. And this is something that is under discussion in the EOSCAB project. And we hope to come soon to some some conclusion. Okay, so this is still work, work in progress. About HTC integration, how this solution you showed fits with the Dirac. I guess that we have some integration with the HTC HPC cluster. This work is carried out in the Deeper project. Yes. Yeah, please. Yeah, we have a proof of concept inside the bus orchestrator that has been implemented in the framework of the hybrid data cloud project. We managed to use the same cloud interfaces, the same pass interfaces to send HPC job to an HPC site using the sort of gateway implementing so that this is a sort of unworded and the batch at the HPC site not Okay, I guess I'm a little bit blocking in audio, but anyway, so this is, I would say, a different solution that is somehow providing a similar function as a direct case, but on the other level, I mean, much more devoted for the docker's deployment and cloud like interaction with HPC. Okay, there is a question about the data integration. How this fits? I mean, the bus orchestrator, how this fits, or I would say could fit in the future with diodes. I don't know if I can still online, but anyway, that the general approach is plug in oriented. So we already have some plugins in this orchestrator that support some solutions, one data, the other. So basically, having the development for integrating a new plugin for iRods could indeed fill the gap. So that's a good point as soon as we as studies use cases. Yeah, sorry, I'm online. I didn't get the question because my connection today is unstable, unfortunately. But the idea is that I mean, we have an open architecture that can be easily expanded through plugins. As soon as we have a rest API specification that can be easily integrated with the orchestration layer. Just a comment from from my side. I would just very technology specific. So you can mitigate this via plugins to develop a specific interface. But from an architecture point of view, we have to come to the standards to see what is best supported via the the open standards for one way or the other. So I would focus more on the standards as the technology specific interfaces. Okay, that's a good comment. And always related to the standard indeed there is a comment. I guess this time for the I mean, interface on top of the orchestrator about the possibility to investigate on Hydra CG as a machine discoverable standard for doing this. Obviously, thanks for all those feedbacks. This could be part of the work that we have to carry out in the future. That's exactly the reason why we are showing this to the public in order to gather comments and feedback and try to improve the standard approach we are following and compatibility with the other services. Yeah, sure. And for this, maybe it would be useful to collect these comments. I will try to do this for all the session. It's quite hard because fortunately, there is a lot of comments. So that that's a good point. But obviously make my life a bit harder. But I like it. Okay. So there is a question about Kubernetes is to know in this tech. However, it makes more sense that provide us on Kubernetes on bare metal and Kubernetes as the orchestrator. So why not make part of the orchestrator layer as possible? Okay, this is a very technical question. In principle, we do this for messes already. We are able to leverage on what's already there and just deploy the dockers on top of those already available dockers orchestrator. But obviously, yeah, that's a good point. And exactly, this is somehow already covered in some other cases. Yeah, because the idea is that the the user should submit to the orchestration layer level to the orchestration layer, a request for deploying a Docker container. And then the orchestrator should select the best orchestrator Docker orchestrator at the the lower level that can be Kubernetes cluster or MSS cluster or something else. So the user shouldn't shouldn't take care of the underlying technology. It should simply describe his request in a Tosca template. So in a standard way, and the past orchestration layer should take care of the underlying technology, translating the Tosca request into the specific JSON or request, specific request for the underlying container orchestrator that can be Kubernetes or MSS. Okay, we have to go very fast here because obviously there is also the presentation. So I try to recap very quickly other comments. This is not only reference architecture, a comments from Mark, this is indeed a production already and implementation. It's not the central one, but obviously could be dedicated for specific, I mean, organization or community or traditional level or test. I mean, the deployment model is very up to the user's specifics and not the central subtlety somewhere, but obviously this could be for research infrastructure or whatever. So the deployment model is quite flexible. I agree with the comments related to IRODS. This is a technology, it's not a standard. IRODS is very used, the widely used technology I would say is somewhere, but this technology is specific as all the other technologies. So that's a good point to add it also, but it's like the others technology. Okay, so the next, I would very quickly go to the next point in the agenda, that is HTC and HPC by Ignacio Banker. Ignacio, I guess that you are nine, okay. Yes, perfect. Sell my screen, sorry because I have to give permission to zoom. I didn't realize this before. Sorry now I'm back, so I had to restart, so I think and now I will be able to do it. So one second, sorry for that. It seems that I am not able to share the screen. It says that the host will enable this feature for me. Yes, Andrea, if you can read the natural circles. Sorry for that, but I didn't realize that in this instance or in this version I couldn't do it. Yes, now I can. I hope that you can see my screen for this delay. So basically what I would like to comment here very briefly is the development that we have done or the work that we have done in the context of the high throughput competing and high performance competing technical committee. So first let me focus first of the of what I want to present. So first how would we get this requirement with our sources and what are the results that we have done so far which basically is three technical specification documents addressing two main problems, the multi-tenant job submissions including the containerized workloads and the hdc-hdc clusters. So first the information, I mean the specifications have been done according to the feedback that we have received from our user communities. So we started having some discussions with them, exchanging documents and we asked them to explicitly provide a set of requirements. So we clearly identified that there are age communities expressing the needs of high throughput competing to explicitly mention HPC also I believe, I mean from the information we can deduce that there are other HPC requirements within the HTC. Basically we group the requirements provided by the by the community and target at HTC HPC as the need for the deployment of specific classes for data analytics, the integration of the job submission with competing resources and HTC resources and the use of I mean under management of workloads that are based on containers as well as other additional things like the support of Jupyter networks, the real ability and a viability. So first I would like to I mean concentrate that we will have two major, we have studied two major macro features, one dedicated to multi-tenant platforms which is the capability of submitting HTC HPC workloads on those infrastructures and the second is the the deployment of single tenant customized infrastructures for which are classes for the researchers. So in the left side our main concern was on the transport management of the location of resources because we have existing resources which are the HTC ones and new resources that are deployed on the on the flow on on the flight which are based on cloud computing and a management of software dependencies. Similarly in the part of dedicated on-demand clusters we focus also on the elasticity and the customization including the support to specific hardware. So then I will try to focus a bit on the specifications that we have provided we have prepared so with respect to the multi-tenant job submissions so the effort has been on the minimization of this transition from the execution the local workloads, local computers to the distributed environments and then we identified needs at two levels so at the level of the service that implement this part on research, math making, application delivery management and reproducibility and also with respect to the whole ecosystem in a in a coherent integration on the authentication and the data access and then we explained this in two technical specification documents that are available there and I will be we will be very grateful if you can have a criticize and read. So very briefly with respect to specifications and standards so for this kind of services we identified for blocks those that are related related to the submissions of jobs in existing typical high throughput computing services like those that are related with computing elements or those that are related with a batch system that basically deal with the specification and management of jobs others that are related and has been also explained before by Enola America related to the authentication related to the cloud services and also for the containers in which we have clearly two main points which is the talk of like and ecosystem and variations the other infrastructure the other component related to container management that deal with use of level execution like singularity and the specification so very briefly so I think that this is a very well known architecture that our candidate and implementation for this is the AGI workload manager or direct for EGI that the main point has been to ensure the management of cloud resources on the fly when dealing with this this heterogeneous backends right in this point what is interesting also to consider and this is something that is quite demanded by the user is the management of containers in this workload which poses to a couple of important points which is the management of this kind of workloads in the existing backends for processing which requires some changes and also the need of managing the creation of customized images as I will later display. So very briefly the highlights and the integration that we depict in this technical specifications for this case referred to these four points related to the authentication authorization basically to have a coherent authentication mechanisms and dealing with the long-term living jobs that require sometimes the renewal of credentials with the information systems for not only discovering the HCC endpoint but also for discovering the health I mean for having information about the health and the status of the of the capacity managing the different types of job scheduling so the net if let's say batch qc systems that are deployed on the fly on the container that could deploy on the fly on the on the cloud services and also the management of a competing element like endpoints that are available and data management that requires a lot of important capabilities on specifying the the data distribution which is a little more interesting in this or a little more novel in this part is the how do the consideration of container services because it poses a special or a higher I mean there's a complexity in the case of existing infrastructures because it requires that the support of the containers that is is granted by the user administration but typically this is limited to those technologies that are that enable running containers on the user space due to the the complexity and multi-tenancy and also the use of previous escalated privilege the risk of escalated privilege that all the container token containers have and also which is another important point in the in the in I would say in the technology selected by the project is the use of container management system like you docker that permits the user to specify to use containers even if the the support is not given by the provider. This also poses the complexity of managing container images that are tailored by the for the backends because we have that we know that there are some limitations in the use of advanced hardware devices like networking or accelerators and potentially we identify the the the interests of having mechanisms to automatically create those containers based on specifications to ensure that we get the most appropriate I mean base images for that and also to to extend the job submission system to deal with these job tasks. So very briefly the second part that I think also addresses some of the comments that we have seen the previous presentation is to have the capability of deploying htcohpc clusters on demand so it's similar as the amazon elastico aks right but extended not only to kubernetes but also to different load balance. Sorry I think that someone forgot to move so this complement the previous use case as an environment in this case is isolated and single tenant so and in this case we the important thing is that it could be ideally self-managing in terms of adjusting the capacity to the actual workload and tailor to again to the to the backends. So again we have one specific technical document in the same folder that will be very happy that you have a look. Very briefly we add one specification that has been already mentioned by marica and and all related to the description of the clusters using the wasis toskka to describe the topology of the cluster and the area associated software dependencies and this is the very high level I mean architecture view in which you clearly identify these two roles the the person that deploys the cluster the person who could be the same that uses the clusters through the back through the job management system. In this case it's important to note that if elastic is this provided there should be some logic and some capability of acting on behalf of the user for extending and shrinking the cluster inside their container. Final thing is integration highlights on this point so we have one candidate implementation already on the EGI applications on demand which is the EC3 portal that they manages this thing that could be extended and also one of the an extension on the network format that is called ECAS. So in this case we need to integrate with the AAI for the to have a not only a coherent management of their convention but also which is very important to provide the capability of shrinking and enlarging the clusters on behalf of the user during the I mean the long living time of the clusters. Interaction with information system is especially critical because in this case cluster and cloud providers may not have all the the base images that they that could be needed or could have different flavors and that is something that should be dynamically retrieved from the information system to avoid failures and of course the integration with the compute cloud services. Just to end I will say that I will we will be very happy to get more feedback from the documents either through JIRA or through the confluence if you have capabilities of accessing it or directly emailing me and that's all from my side thanks. Thank you I see a lot of activity in the in the chat but if I don't see bad there are no specific question to this topic at the very moment so I will ask everyone is there is any specific question to in the answer that could be online I guess no and I don't see later in slide or question in this case so everything is clear and already may be acquired so thank you again. Next in the list is Mark and Eric and look I should be the last one as anticipated at the beginning so who will start I will start start with see where I can share my screen I think it should be this one. Okay perfect we can hear you perfectly and see the slides so that's perfect thank you Mark I will screen out so everyone can see my slides. Yeah it's still visible and that's perfect. Okay I shared this this this section with Heinrich and within this section we present definitions about building blocks which I talked about digital repository and about metadata and metadata aggregation so I will start with the digital repository the way of defining a reference architecture is to define a building block and to describe a building block and which type of functionalities are provided by such a building block and which API standards should be used and within a data repository many of many components many functionality comes together but if you define a digital repository a data repository is an infrastructure component that is able to store, manage and curate digital objects and return their bit streams when the request is being issued and this definition is also based on information which is provided by the RDA data foundation and technology working group so this is more in a broader discussed definition of what digital repository is and if you look at the other definitions of how to handle the data then it is also coming from this definition does it discredit in digital objects the digital object is represented by bit stream it's a reference and then identified by a position identified and has properties that are described by metadata digital objects can also be aggregated to digital collections and the digital collection is a principal and complex digital object which again is identified by a PD and described by metadata and if you define metadata, metadata contains descriptive contextual problems and assertions about properties of a digital object or and or in digital collection and below you can find the reference information which has been developed within the RDA about these terminologies but if you put this in a diagram then you have many different components which are connected or functionality provided by in digital repository so this diagram provides a high-level architecture, it suits digital repository, provides an interface but provides an interface for humans but also for machines so that you can access data also through APIs instead of just a website interface then this is also about that in digital object you should be able to upload and download data you should provide metadata descriptions position identified should be registered for the data objects for the collection can have 3Ds for the landing page but also for the individual objects which are being uploaded of course you need an AI interface because users should be able to upload data if you have open access data can be downloaded without logins but not always all data is for access therefore also people should be able to log in into the system to get access to the data when it is needed in general data is stored metadata but also metadata was in digital repository this made available to all the kind of search engines and therefore digital repository frequently provides an interface where metadata aggregators, search engines can collect the metadata from a digital repository to harvest them to provide all the interfaces and this is will be described by Heinrich but if you look at a digital repository itself mostly provides also a searching interface so that you can find information within the digital repository but also all the types of functionality is related to in the digital repository which is about okay how do you describe data what kind of metadata is to be used frequently there's a fairly community specific should be flexible but then how is metadata then being described when it is being harvested so there are different aspects of doing this you have to take and think about licensing what kind of licensing should you attach to to data it strongly depends on what kind of data but also what kind of access is being provided to the data and that is strongly defined by what kind of license is to be used for the data and you can do this differently for the metadata or for the data itself and in general for digital repositories data should be also curated and this is not always a technical issue but it is more process how do you manage the interests of the data but also how do you curate data over longer term of time what are the steps to be done and depending on the level of curating you can also think about a certification of the service and also there are multiple levels of certifications for certifying your data repository to provide trust if you look at this this is then the whole list of features I noticed that I'm also not yet mentioned something about metrics and I think that is also important for a digital repository that metrics are being collected metrics about not the usage of the resources being provided but mostly about how is the data being used or downloaded or viewed which is presented within the data repository if you go to standards there are a large number of standards and I think that is one of the process which we have to go to is to see what are the best standards to be used and there are some standards which are more common as less and less common if you look at some for example if you look at the upload and download there are standards which are more common for example the SWAT standard is very common is supported by different technologies DOIP is an upcoming standard which is made available but it's not yet very much supported within technologies but there are different developments but most of the standards which is currently provided to interact with data repositories are very technology specific and then it depends on which technology you choose which APIs are being provided by this technology for the metadata descriptions this is very defined by the communities so mostly in digital repositories there's a flexible way of defining metadata as we discussed is about what are the minimum metadata templates to be provided and that is more about metadata to make data harvestable by all our search engines for example in OpenAir view that or maybe through Google and it is about adapting those those standards but give also the flexibility so that communities can define their own exchanges to the metadata next to the harvesting in defining what is the minimum information you should make available through the harvesting we also have different protocols for harvesting OAPMH is one common protocol for for harvesting which is widely used but it's also sometimes considered as an all protocol resourcing is a newly developed protocol which has been standardized a number of years ago but you have also community specific protocols as OGSC which is coming from a geospatial community persistent identifiers there are many different persistent identifiers and also different persistent identifiers can be used for different types of information you can have DOIs for reference to the landing page or for the data collection you can have handle PEDs for referencing to individual objects but also you can have ORCID IDs for individuals but there are many different PED sessions available but because digital repository is depending on using persistent identifiers there's more depending on the guidelines for a set by persistent identifier building block to adapt to those guidelines instead of defining the guideline persistent identifiers within this description licensing there are different licensing models and the corrective commons is one common license which is frequently used also for referencing to publications but also for data and you have many for example if you work with software and you have many open source code licenses and it depends on which kind of data but it will help to lead users of the digital repository to select the best license for the data so some flexibility would be would be advised and there are models for doing this for the AI it is more defined by the AI building block at the moment there's a big endeavor to define the ELSKE AI and if you want to have digital repository then it is looking at the ELSKE AI what are the standards to be used to integrate your service within the ELSKE AI for example to allow federated authentication on your system for trash certification there are at the moment three major certification schemes which also are differentiating in level of heaviness to go through the process the core trust seal is a very common certification scheme which is currently widely adopted but it is also seen as the lowest entry level of providing a trust seal while the other two schemas are more stronger in the requirements for getting certified for the metrics at the moment there's a lot of discussion about what kind of metrics should be collected from digital repositories also within the IDA there's a lot of discussion on this but if you develop in digital repository or use technologies for setting up in digital repository it is about that those metrics will be implemented following the guidelines provided by for example the open sign framework because in this way you know that you collect the right information to show usage of data stored within the digital repository but going to the process to look at this I had contact with different people and also from different information but there were also from this I saw that there are already many different initiatives looking at defining what would be a digital repository how do you describe this what would be the functional digital repository one is by core next has defined a published report on the next generation repository I have looked at this documentation it is describing very much similar capabilities of the digital repository even some more advanced capabilities but it does not define at the moment the standards the apis to use to integrate or to settle the digital repository so but I think from from the perspective of defining an architecture you have to also relate to those types of communities which have been working also in the same area to define okay what would be a digital repository how do you describe this what would be the functionality the capabilities to provide another group which done very similar things is from the the cater group which is coming from the RDA defined as a group of European data experts and also there there's been a discussion about what is a digital repository and how do you describe this what would be the functionalities to provide and I think if we define an architecture for data repositories we have to discuss and to work also with those those groups to bring them in or collaborate with them to to define what will be in the repository what is the architecture and how capabilities or standards to be used and then i will go to Heinrich hello can you hear me well yes Emilio yes great so i'm Heinrich from shopping climate computing center in Hamburg and I will present here the area of metadata management and the data discovery and so we identified here within this area of building three micro functionalities so this is data discovery and access so this is addressed so this is the main aim of course so this is addressed to the end users and enable researchers to search for data resources and then then they can access them and use them so the means or to achieve this of course you need you have to meet some procedure to aggregate metadata so this is the metadata cataloging and indexing I name it here so this is addressed to the data providers it's mostly the same and so to make possible that data providers can publish metadata in a central metadata catalog and so additionally so it's it's good to have something as annotation service so this is the possibility again for end users to annotate data sets so that means that you can link data sets to semantic tags or the concepts ontologies or to add a comment or so to to make this so this as well supports in the overall key that you can exchange information and things with your colleagues and other researchers and as well here is the address to the data providers or to ontology providers so that they publish their ontologies and then can link to them okay for the detailed technical specification you can find them here on this link so okay this next slide mark okay here is the high level architecture so it's a little bit complex because here are these three macro functionality shown like functionality is shown so you have in the center here in the bottom right is the metadata catalog so the set the metadata are harvested through the metadata cataloging process by from the metadata or the data repositories so from the communities and here are different services and protocols used so the standard as OIPMH is here preferred so we want the central service in the UC hub which is doing this is that we to find but so I read already the comments in the chat or the questions yes we at the approach at least that we would have to find and the most generic metadata services is yes we support in principle any protocol and in principle we harvest any kind of specific metadata schema and then the hard work is the metadata mapping then and the metadata duration and finally then to index the the metadata records in the search portal and of course then you have to map them on on the unified metadata schema to the target metadata schema so but as far as possible we try to use of course along the fair principles control for capillaries for example again for you that you to find there's an interdisciplinary metadata service and catalog we have a categorization for disciplines for research disciplines and there are specific uh initiative that we improve this controlled by capillaries for research disciplines this is very important okay again to the main goal to the data discovery in access of course this should be again along the fair principles possible to accessible by human beings and but as well by machine by machine so as well we should provide the search graphical user interface but as well the possibility to access the search and access data via a command line interface okay there are different standards and tools for data search so different search API as elastic search and of course uh so it was standard for indexing as lusa musin solar and as well there are search repositories or portals they are based on or linked on data so the spa l plays an important role and additionally you can have an implementation service so this is in the left top here so that means a set that you can link metadata sets to uh to concepts as ontologies so this is based as well on linked data taxes so often the the records are here the annotation records are formatted or realized as chasen lds or in total or anything else and it's the model is used here it was used here the w3c annotation model okay next slide please yes here is a listed uh some standards example of standards so uh for data discovery as already already stressed by mark uh need a good metadata description is is important so and this are based on standardized metadata schemas and a set on control vocabularies yes and the big challenge is in this business that of course the controlled vocabularies are often specific and uh domain specific and community specific and but we want to support specific things and then the work is and the challenge is to map them on a common basis so that you can then provide all these facets and the fields and so on in the central search portal so as search apis as said there are a lot of standards and tools and software so there's elastic search likely uh a lot of you know them zcan is a platform search portal is not platform software you use in a route that you can find for example uh loose and solar is used mostly for indexing metadata and there are a lot of other search apis so very important are the guidelines here and for the address to the researchers for of course you need uh good and simple to understand search guides for the discovery portals and for the search portals so and if for metadata catalog is already addressed and a little bit discussed in the chat as I can see uh there are a lot of standards and protocols the old one mostly used as well is still in e-mail to find this over gmh because it's staple and it's uh widely used and it's right resource sync is the modern version of the successor and we uh I think the you then you will be to find this is on the do list that we uh as well support harvesting via resource sync but we as well uh support other protocols like or apis like some some data providers have specific chasing apis so then we can harvest from uh catalog geo geo uh networks via ogc or cws and yes and right so this is more in progress that we via spark well as well can uh harvest and collect metadata from triple store for example okay and I said the the heart and the the work and is to the meta mapping and the metadata operation so really to map the community specific records its team must be the central so and be defined to the to find meta data steamer this is based on the data side so there are a lot of others and of course here then the chat was discussing about the cat or question uh and there are a lot of you know dabbling chords by the old format and of course teamer.org already mentioned by mark uh so and as uh indexing uh then finally the metadata records in the portal the search portal mostly similar indexing is used but as well other software like zkan or uh we started as well as google does already a lot uh to collect metadata from steamer.org so steamer.org is I would say both uh metadata steamer and uh yes a lot of via portals use this technology it's very interesting so and here for the to addressing the data providers especially the interoperability guidelines is very important where this really described what I said in the last minute so which protocols should be used which standards and but the different services uh are supporting so and in bold I list here some examples of services discovery and metadata services so within the scope of the uc hub I mentioned here of course especially you that need to find which is within the scope of uc hub the central indexer and metadata service and but we have a strong collaboration with open air so it's for sure I think you know open air open air is more care about open data but for example as well this is a work in progress open air as well harvest now metadata from the other people find to apply their their services to improve the quality of the metadata and okay and we work together with as well this data site and use at the moment in you that you find use their metadata steamer uh and as well we work together this data site in direction with virtual object identifiers and as one sentence mark stressed in his presentation of data repositories where the p the concept of p i d s versus night and this is very important here the p i d s is now very central that they are used to of course to get linked to the uh data resources and finally the annotation service as said uh this is based on the w3c web annotation data model and here uh as format or realizations are often chasen all these linked with 800 chasens or turtle and other formats i'm not an expert here so this is done by a other colleague here young live or who is the service provider of the service that we denote and as well again always in this context guidelines good interoperability guidelines are very important to show the end users and as well to the providers in this case for the providers of technologies how to use standards in which yes sorry i am just stepping we we need to close very quickly because there is still a lucash presentation so yeah i close with the last slide can you switch and then mark yes again i go not in the detail here again there are several initiatives they so he's in the rda there are a lot of uh yes metadata uh and i think groups and i didn't in the mention ago fair discovery and limitation network because i just okay thanks so it's been too long thank you both and it's a mark uh i see mark basically it's already answering to the chat to the question and also because we are very late with the schedule unfortunately i have to go through those questions and give the floor directly to lucash that will present us last part of the agenda about uh that us as a specification and good morning can you hear me well just to yes yes lucash and the signs are sharing the screen and should you see my screen i will be very very my presentation is going to be short so please stay tuned it's not not going to take very long and the talk is about the data platform for processing so from the high level point of architecture review we are talking about the thing which can help us to transparently or efficiently manage a complex data processing in distributed environment so in practice this is this this yellow box which we have here which connects the processing power which we might have with the data resources but of course we are talking about the eosk and the distributed environments uh in practice we need to combine a very large complex ecosystem all together to deliver some sort of uniform data input source into the data processing i took a liberty and uh from this mark and harrick slides i just took a picture about the data repositories but the practice if you are talking about the platform processing platform it has to be tightly connected with the existing repositories data repositories so uh which allows us to to take the source and input of the data files and as well to place that the results in the end somewhere of course the the maturity mark was talking about the repositories might vary it depends on the the collection which we work it doesn't mean that we have like all precise curated metadata in our environments in all the repositories but in practice from the high level if you on the all the architecture this is the this is the essence data processing platform is the thing which is a layer which combines the all the ecosystem of the data and delivers the processing and delivers us the and enables us to process the data in the efficient model and there are many other aspects about that and the way how we are currently achieving this this this high level picture in the current applications but the macro features are more or less like that so first of all we need to have like a ways to integrate the existing the distributed repository somehow to integrate the digital repository or data repositories into the into the data processing platform and there is another problem how to access the data processing platform from the computational infrastructure so it means that how we are reading the data from these worker nodes through from the platform there are different approaches the different approach with htp base api so obligatory in the api school the more traditional policies way which is compatible with more more data processing applications legacy applications and the very important part is as well how we are managing and input and output data of course input is a part of data processing platform but as well output we need to do something with the the results and the data processing platform often should as well facilitate that problem about the data processing part as well we are talking about the long distance data transfer problems which we need to have in the end in the environments because the the results are not always kept or the inputs are can be always in the same place of course we would love to have that case that we are transferring the computation and close to the data close to the data resource so we are in similar data center that's the optimal but it's not always possible especially we are talking about the scientific repeatability experiments the data processing of the results coming from other results other experiments it's usually impossible that might be very difficult there is a several other aspects like the quality of service expression which people are expert based on data location access control rights because the all those things are very important because the it touches the data so we need to take seriously very much the the the rights and the control of security aspects and if we are talking about the processing this very important part about which we can't forget and we remember from the grid times it's like a pain very pain of the the problem is like how to delegate my identity to the processing power processing nodes while the processing might take a long a long a long time because most of the solutions are very nicely connected to the web interfaces but if you are going to the batch processing along the processing and things which are lasting weeks the the problems of the delegation of AI authority might be might be important might be tricky and this aspect has to be as well taken by the data processing platform and of course there are things like obvious scalability and decentralization because in the world where we are talking about that and we need to have a way of taking the heat out of the the repositories while we are processing the data and the repository the processing environments has to be we go with the global scalability so these aspects are are crucial but last but not least there is a key element that we are working with the heterogeneity of the infrastructure and the storages because while we are there deploying the that's processing infrastructure and the broad vision is to do that even ad hoc so we we meet the problems with the heterogeneity of storage technology which is behind us in this heterogeneity of the technology which is delivering the virtual infrastructure we are talking about the cloud HPC as well is not very uniform that all of these aspects are very very problematic and the expectation is like the application this is a crucial part applications should be the they shouldn't be much changed in the processing if we want to make the that thing happens otherwise the the cost of the application of the patient is this too much it would be too much so the examples of relevant solutions and standard are just gathering the key point which are gathering the key elements of this part of the picture is like we need to address the sort of protocols of the integration of data repositories in the current deployments we are having the web docs and the grid FTP dominant but there are as well the postings which is very local to the the data centers and it requires some sort of gateway interfaces which might be done by grid FTP or the other software stock solution or went up whatever so but this this this kind of solution in which is this is this is key element but the second part is how we are reading the data again the application most of the application the scientific works prefers postings but there are coming of course the solution for S3 web docs and other things so all of those things are combined together in this environment and we are talking about the distributed environment we need to preserve the ability to access different protocols in the different environments by chosen by the application and the key part is like how we are moving the the data between the location and how we are managing all those things there are a few solutions which are so like old-fashioned are seen as FTP typical for common in HPC centers for the very small data collection of the small input parts but if you are talking about the large things there it's not so trivial so we are talking about the pathobytes and data and transferring them for a long time and might take so long so we need to have like a power by the other solutions which are available in the market and so in the talking while Mark was telling and Harry was telling about the metadata in terms of the data digital repositories we have like facing a different type of metadata here and problems because there are many applications requires processing driven by metadata attached to the data means that they are scheduling and orchestrating the part of the solution or the processing based on the on the attached metadata it's not the data of metadata for discovery the entire data collection which was hiring topics but it's about the individual data of the file and this is supported by several solutions at the moment by a few solutions at the moment but it's getting more and more important by the community especially in AI oriented applications so the other part are obvious but one of the things I would like to stand for a second about that is AI delegation in the process I mentioned this is a big pain in the in the process because when we are having like the application running our computational job we need to to ground and present our identity in the loop to to demonstrate that we have the right to read the process that the respective data finds so it's set in a light way but the process of delegation is a complex because most of the SSL solution are based on the short term data process data delegation and there is a lot of aspects about the trust how much how long we can trust with the the certificate or the token being delegated in the long run because otherwise we might be in troubles if someone hijacked our tokens from the processing infrastructure and of course we are remembering the grid sorry for spelling grid grid process certificates which are one of the most commonly used solution at the moment in this area but there are as well come on other aspects coming on the market which trying to simplify the whole process but anyway this process is not very easy for the end users because it requires some sort of understanding how the thing works behind the scene and this is crucial and this is like a very important problem to be sorted out by the processing platform in the future and if we are going to the challenges extremely going to the challenges I we found during the analysis of the problems several aspects of the the problem of deployments automatic deployments and federation and security aspects in the environment because the broad vision is like that we have like a data collection somewhere several data collections several data repositories so digital repositories and then we would like to do the processing on top of that ad hoc in some sort of some clouds where we are getting current the computational power of course the matter of optimization of deployment or the matter of the movement is a second level problem the first level is how to deploy the processing platform automatically in the processing environment and in that we are talking basically something like software defined storage because for what we get from the virtual environment where we can get the VMs and the cloud environment we basically get usually the just the VMs CPUs and low devices on top of that we should tear tear tear up for something automatically for the the processing processing infrastructure but data processing infrastructure but the problem is like these clouds even though they all of them are using open stacks and many of them using open stacks of course they are different they they deliver the problems of minor things which are blocking problems network configuration storage configuration aspects and all these little things are making the automatic deployments very very difficult and challenging and especially if you go a step further even if we overcome this there are forming problems then we go to the other aspects like if we want to make everything secure then the problem is how to automatically allocate certificates the issue certificates because everything should have like a SSI green certificate everything has to be properly the all the communication channels should be properly encrypted and how to delegate the connection to the IDPs how to connect the IDPs while we the instances has to be registered in the IDP system and so these are the challenges which are very difficult to be overcome more by both of the platforms that are dealing with the data processing and and trying to address the challenge to deploy the whole infrastructure on spot in the some certain clouds of course the scaling performance as well it's very important to understand because behind the scene sometimes it requires understanding of the actual storage in the cloud specific cloud resources because sometimes the block devices are independent or sometimes like a physical independent drives and sometimes the block devices are virtual coming from the centralized safe or centralized block IO devices so the the problem is that us marica and other people during this talk were showing about the different ways of processing data we have like jobs messos kubernetes and then hdchpc and all these things are all together bringing a lot of heterogeneity in the technology but if we are talking about a data processing platform itself it's very it's very good if we combine all these things together with the platform as a service because the complexity of the whole thing is those things are usually working well together and in practice they deliver if you don't have an ideal solution for processing and data processing platforms in the sense of some sort of universal one click button deployment and we get the processing solution it should be tightly connected with the platform as a service things orchestrating aspects as well and and so of course this is for the for the larger larger infrastructure larger processing data infrastructure in smaller cases there is a there are scenarios easy for data processing platforms being instead deployed separately and then build on top of that smaller virtual set of virtual machines which are connecting and individually processing the data amongst the the user needs and the very strong and important is like it's like there's strong integration with the data repositories because otherwise we don't have anything to process and well is a place where we are trying to descend and results out of the the processing of course most of the people think okay we have like a storage collection i have a home directory i have a collection directory in my data center but in practice if you have like a high level of use level of review you need to treat all those so data sources are some sort of data repository which has all together connected and build the whole ecosystem otherwise we are having too many boundaries in the in the whole area so that's the that's the quick summary of where we are with the data processing platforms and then how the situation looks like and i hope i wasn't talking too long and i i'm afraid i don't have like all the the nice recipe which way to go to solve all the problems because the problem the complexity of the situation is is like very different depending on the application type thank you for your time and hopefully you have some questions if you need to study the topic please ask thank you lucas thank you very much for this very interesting presentation unfortunately obviously we have 10 minutes after the scheduled time so we are very i mean late i would say i don't see a specific question to this session into the chat window let me cross check the slide though very quickly i i'm just refreshing the page to be sure i get the last snapshot not working sorry for that okay i see no other specific question from slido there are a few pool filled up so i would ask you to fill up the pool again unfortunately there is no room for discussion yes there is indeed the the text uh uh winner silver for the question that was exactly my last point in the agenda we are collecting all together all those materials you have seen in this presentation uh plus uh other at the moment they are advance a draft i would say i mean they are already internal reviewed and i mean people are actively working on it but it's not final neither officially approved um i've pasted the link into the chat it is a public wiki page from the official your scab wiki um you have here all the um already drafted the documents there are still many others under preparation that will be published in the simulink in the next weeks so you can look at those documents again not as final version but as a working draft that you can provide input to please just write to the authors of the documents and keep me copied that i'm following this activities as involved in the t-com and obviously we would put the same link into the official agenda of this session so that also afterwards you can reach the the page in order to get informed about those documents unfortunately i have to close here thank you all for joining and thank you all for the interesting feedback and participation for any kind of further question or comments please write directly an email to me or to the presenters and thank you for the organizing this session and i guess that here you will see the next session i don't know if andrea do you want to add a few words yes hello and thank you to everybody so here in the slide you can find the next lobby and rooms for the breakout session number two so if you want to join a different room other than the one of service provider forum i strongly encourage you to drop off this call and join one of the following uh other link that you cannot see on the on the slide so that's it we'll be back at trial and order thank you