 The affiliation is indeed the language archive of the Max Planck Institute for Psycho-Linguistics. But my basis for the other infrastructure work, my experience there, are some European projects which will come back later in the presentation, but I want to already name them now so that you see I'm not completely objective here. So first of all there is Clarin, which is the S3 project for bringing linguistic resources to the humanities. Then there is Dasis, which they call a cluster S3 project. So it tries to find out common services for the social sciences and humanities as free projects. And then there's EUDOT, well, which this presentation is about. So my intention is to give you some overview about what EUDOT is. Well, it's very, very project attacking very different problems. I will concentrate a bit on what we call it in EUDOT, the joint metadata domain. And then I have a slide on what does EUDOT mean or what kind of consequences does it have for interoperability issues with the communities. So EUDOT, key facts. Well, we have started in October 2011 and it's a three-year project, which means that we already have had our midterm evaluation. And although the evaluators were quite content with the progress of the project, they had some sharp observations, which we take too hard. The total budget is about 16 million euros, only getting 9 million from the EUs. That means that the partners themselves are putting a quite substantial amount of money into it. The objectives, keywords there, is that we have to deliver a cost efficient, high quality, collaborative data infrastructure, and it should be flexible, sustainable, and useful across geographical and disciplinary boundaries. Well, that's quite a ambition. Our consortium, it's a big consortium, partly consisting out of general compute and data centers, like SARA, like CSC, but also from a lot of what I would call community centers, like my own institute, the Max Planck Institute for Cycling Wistics, and for instance CERN, which is a big data and compute center, but directed towards high energy physics. Right, this makes up for a lot of interesting discussion between the partners. I will not try to hide that fact for you, but we're getting on. So it's about delivering data management services to the communities. Now, there are a lot of communities represented within the UDOT, but a few of them are more, how do you say, represented than other ones. And these first, one we call the core communities, these are CLARIN, which is about linguistic resources. EPOS, which is about seismology. LIFORGE, it's about biodiversity. INES, climate modeling, and VPH, the virtual physical human. These communities are also represent different, how do you say, foci or needs. So for instance INES deals with very big data sets of terabytes. There is CLARIN where the data sets are much smaller, but where the data sets are in nature much more complicated in the sense that there are much more complex relationships between the data sets and within the data sets. And there's EPOS, for instance, which deals with the question of how can we store these high bandwidth data streams that come from our transducers from our measuring equipment and how can we make that persistent in data infrastructure. Now, we have within UDOT what we call core service areas, and these are divided between community oriented services and what we call enabling services. The community oriented services are like things that we want to provide directly for the communities like simple data access and upload, long term preservation, shared workspaces for collaborative scientific research, execution and workflow, and joint metadata and data visibility. The enabling services are much more the oil that makes the whole thing function. There you find things like persistent identifier services based on EPIC, the European Persistent Identifier Consortium, and Datasite, which you have seen a presentation of just before. We want to use federated AI services, so federated identity management, network services, high speed connections need to be enabled between the data centers and monitoring and accounting. Now, obviously some of the services will be community specific. Some of the services can be shared by some communities and others are likely to be common to all of us. We have identified about six what we call interesting service cases which we are trying to build services for. So there is the safe replication service case which is about allowing communities to replicate data to selected data centers depending on the policy that is connected to that data set and do this in a robust, reliable and highly available way. With the underlying idea that there is safety in numbers, lots of copies keep things safe, right? There is dynamic replication case which is targeted towards moving data sets to those data centers or those compute centers that offer high performance computing facilities. There is metadata, so we want joint metadata domain for all the data that is stored within UDOT data centers that is registered within UDOT because, let me emphasize this, UDOT is not about doing things for all types of data. The data must be registered. Without registration, it does not exist according to us. We will come to the metadata case later on. Then we have the research data store or simple store as we call it, a function that will help researchers of the participating communities to upload and store their data with proper metadata in a safe way. We also need authentication and authorization infrastructure. There we go for FIM, Federated Identity Management, but that is yet one of the least developed service cases in UDOT. Persistent identifiers are used highly available in a fictive PID system that can be used within the communities and within UDOT itself. Here you see how these areas, these service cases, relate to one another. So there is safe replication, data stage, and the simple store that all deliver metadata to the metadata catalog, bound together with AI that should be of, connect to all of these cases and persistent identifiers that are what I call the oil that makes the whole system work. This shows a bit about how data is pushed between community centers and data centers. So we imagine there is data available at the community centers. This data gets pushed to the UDOT data centers where it is again pushed to specific facilities for either long term archiving or for high power computing. The binding thing that keeps everything together is this UDOT PID service. So the data is administrated through the persistent identifier system so that we know where the data is at a certain moment in time. There are, of course, registries or registry or registries that keep track of all the different centers and what are their capabilities. So we have an UDOT center registry and we are harvesting metadata from the communities and putting that in a UDOT metadata service. And we have the simple store and data uploaded to the UDOT simple store is then pushed for safety, of course, into UDOT space so it will be replicated over different UDOT data centers. Now, this should have given you somewhat an overview about what UDOT is about. I want now to concentrate a bit about which might also interest you, which is the joint metadata catalog. Now, this is a picture of how the different UDOT nodes or centers relate to this metadata store. So we envisage that we will develop a number of prototypes using the metadata that's currently out there at the community. So this kind of inspires us to address only for the moment the XML-formatted metadata that's out there with the communities using OAIPMH. Enus and Clarion, so the climatology and the Clarion, the linguistic community seem to have lots of metadata already online, community-specific metadata, not related directly to publications but more descriptive metadata that is of interest for the communities themselves. That's where we're interested in. And we will only do reasonably simple mapping of the different metadata schemes that are involved here. So we will be getting metadata, of course, from the data that's available within UDOT space, so data that's put in UDOT for safe replication, also from the simple store data and metadata that has been uploaded directly by researchers. And also, because we know there is a lot of metadata available by non-UDOT communities, as I would call them, so that do not play directly a role within UDOT, but if they have good quality metadata, why lock them out? We're more than happy to harvest them if they have something useful. And the idea is to store that all in the metadata store. So that is the vision that we had, but, of course, things are, let's say, more challenging than that. And what we see also is that we have to address things from the user's perspective, and there we think that the following points should be considered important. So, for instance, if we want to make this unified, this joint metadata domain useful for the community, it's important to answer the question, what terminology can these communities use when they want to search in this catalog, right? Will it be some kind of Dublin core, simple terminology, or can we make things so that the terminology can be switched depending on the background, the profile of the user that's actually going to use this joint metadata catalog? So that's a question that we have to answer. What will be common, browsable, and searchable dimensions? Of course, I don't think that it's useful to have a catalog that shows all the different metadata elements that are in the different metadata records that we will harvest. So we have to come to some core that can be interesting to all of the communities. So another thing that's important is how will the metadata be presented? What visualizations, options are available? And most important is what added value will this EU.metadata catalog have compared to those metadata catalogs that are already out there? Data sites show the nice one. So if we are doing something different, what added value do we have? Of course, compared to the community-specific joint metadata catalogs, as for instance, we have within clarin an important plus value is of course that we deal with cross-community metadata here. So that's an added value. It has already been mentioned today that the challenge is also how can we increase the metadata quality because I've personally the experience within clarin that if you start harvesting metadata from different sub-communities there that you are sometimes astonished by the bad quality that you get there. So anything that we can offer in the sense of a possibility for increasing this metadata quality is important. Now, as a joint metadata catalog, it's unlikely that you can attract that you have the resources to attract specialists to deal with every one of those communities. So what we can do is to offer the infrastructure, for instance, for a commenting facility. Originally, we had planned that commenting facility that users can say, well, this method, this set is very useful or it's not useful at all. But what you can also use it for is of course to allow the users to deliver comments on, well, this metadata is wrong because this is not, for instance, a woman speaking. No, it's a man that's speaking. Those errors are made and if possible, they should be improved. Okay, direct visibility of the data uploaded in the Easy Store. Well, that's also important. If I think from the metadata, from finding metadata, you offer direct visualizations of the underlying resources, that's a big plus. So good visualizations is important. And linking, of course, to the original metadata, the metadata such as it was stored by the communities is also an important plus. Harmonized data citation facility, that was also already mentioned, I think, and export of the metadata as linked open data is considered as an added value. Now, no matter what kind of added value you deliver, what you should also consider is that there are what we call emerging communities that do not have any real usable catalog of metadata out there yet. And there, such a joint metadata catalog, even it's a shallow one in the sense that you do not do things with what I call deep community specific metadata could already be a usable catalog for those emerging communities because they do not have anything yet in place themselves. Okay, now, more technical pictures here. The metadata catalog. So what do we have foreseen for you that here? Well, what we have foreseen is that we will use OAIPMH to harvest the different communities. So we have a harvester. We will be using CCOM. I should explain perhaps a bit that recently we have been doing an evaluation of some of the systems out there that could be used as the basis for a joint metadata catalog. So we have been evaluating CCOM. We have been evaluating DNET and we have been evaluating a homegrown solution at the DECA Airset, which is an institute for climatology building. And from that evaluation, it has come out that for the moment, for the first prototypes, we will be using CCOM. Now, I have to emphasize that this will not mean that we will use CCOM forever. So it's just a preliminary choice for the first prototypes, because next to, let's say, working on harvesting metadata and showing that in prototypes we have kind of a parallel track of constantly evaluating interesting technologies that are out there. For instance, our US colleagues use something, in data one, use something called Mercury, and we will be evaluating that also, but because it seems to be working very well. But for the moment, we will use CCOM. Anyway, in CCOM, you can show... CCOM can deal with different metadata schemas, but it kind of requires you to develop different adapters, one adapter for every specific metadata schema. So if you got two schemas, you have to have an adapter for schema A and one adapter for schema B. This adapter filters kind of... first, it flattens possible hierarchical metadata schemas, and then it filters out those aspects, those metadata elements that you want to show in CCOM. So you do a data reduction here. Why do we do that? It does not perhaps appear as very high tech to you. Well, we do that because we need fast results based on community practices. We certainly do not want to develop a new metadata schema, the schema that will put an end to all other metadata schemas. You've heard those stories before, I'm sure. And we want to use the metadata that's already out there, using existing and proven technology, if possible. If it's not out there, then we will develop something, but if it's out there, we want to use it. Now, I already said that in this set of CCOM we will only deal with limited filtered out metadata. So if we would only use CCOM, then lots of information would be lost and we couldn't offer other facilities, like, for instance, full metadata content search. If you want to do that, we will need to store, I think, the metadata as it is delivered by the communities in some kind of big pot with metadata. Now, we think we, for the moment, we will use an XML database to do that and we can use content search from that. This does not mean that it's useful for fast browsing, fast searching. For that, I think CCOM is a better solution there. So we will put all the metadata as it is harvested from the communities in a big pot and we will use filtered metadata for faceted browsing and search by using CCOM. Yeah. Now, this is not the whole story, as you might expect, because already one of the core communities, Clarin, uses CMDI component metadata as a metadata infrastructure and there we have, as effect, that users can specify metadata schema as they see fit to describe any new type of resource. This results in already, for the moment, more than 100 different metadata schemas and if we want to, let's say, fit this in this architecture, we can, of course, for every new metadata schema create by hand a new adapter, but that scales badly. For the first prototype, we can pull it off that we only take into account the largest set of metadata records in Clarin, which is that which is transformed from what we call old existing MD records, but it's not something that will scale into the future. So how to deal with that and now I have to excuse for the slide that happens when you, till the last minute, you still want to edit your slide, so ignore there what you see in the upper right corner. What we want to do there is to have some kind of an automatic adapter generator that can generate an adapter as it's needed. So, for instance, a new metadata schema comes along, then we want to generate a new adapter. For that, we need information about the schema. Now, within Clarin, fortunately, we already think we have that information available. Available in a Clarin-specific metadata schema registry within a Clarin-specific metadata concept registry, so all the metadata concepts that are used within the different Clarin schemas require that the metadata concept have links to a concept definition, and there are relations between these different concepts. Now, I will not go into that any further because that would require very CMDI-specific presentations. I will only want to say that for Clarin, we think we can generate these adapters automatically. This will also, of course, be a nice test to see if this kind of pragmatic ontology, as we call it, so it's not a full-fledged ontology. Now, it's a pragmatic ontology. You only store that knowledge, aspects that you need, so concepts and relations between the concepts in different registries, and we want to see if this approach can be applied in general to the whole, all the communities that play a role in EUDOT and where we want to harvest the metadata. How much time do I still have left? A couple of minutes. Okay, fine. But we'll have to see if we use that Clarin approach, then it also requires that we have to analyze the metadata schemas as they are used by the other communities and also extract the metadata concepts and relations in those registries. Now, interoperability. Of course, this workshop is about interoperability and I thought, well, what kind of interoperability issues are there within EUDOT that have consequences for the community and that, well, I want to mention here. First, I, of course, wanted to protect myself against discussion about interoperability for, let's say, the use of this presentation. It's a very narrow definition of interoperability from the IEEE-Closary, the ability of two or more systems or components to exchange information and to use that information that has been exchanged. Okay, with that definition, I think we should look at the following consequences for the EUDOT and for community practices. First of all, in the data replication use case, we use IROTS. IROTS is data replication middleware that's used as a follow-up of some other software that has been developed already for some years in the United States. I think it's the only reliable stuff out there if you want really serious and reliable shopping of big quantities of data between centers. It allows you to specify a set of policies associated with every data set, so you might, for instance, set a policy for a data set in the sense of I only want my data set replicated to data centers that adhere to this level of security and I don't want it to be, for instance, I don't want it to put on Dutch centers, for instance. That type of policy you can associate with your data sets and then the machinery behind it will make sure that your policies get respected. Now, IROTS, like I said, is the only grown or acceptable stuff that's out there at the moment, but if you accept that, then you kind of accept that if you want to link into this grid of community and data centers, you accept that you need also to do a installation of this software, which means there is a danger of technology lock-in. Now, this is one of the points that was noted also by the reviewers, and they have made a note there that we should try to be more generic. Well, we will take that to heart and we'll see if we can do that, but this requires also that we find alternatives, right? Okay, another thing that I think has consequences for the communities is that currently discussed within IROTS, within the AI task force, is that we probably will use certificates as the kernel of the security in the communications between the EU.Centers. Of course, we will make sure that federated identity management, which is much more, I think, to the liking of the communities and to enable much more, enables them to do their thing, to be interoperable with that, but the fact that X519 certificates are being used between the data centers and the community centers does impose some requirements. But anyway, it's all certified, it's all, you say, standardized, we can't complain about a lack of standardization there. A use of a specific persistent identifier system with some added information items. Now, in IROTS we have chosen for EPIC and data site, both based on the handle system technology. Within IROTS we do not make any assumptions about the level of granularity, so one community's resource might be another community's dataset. We don't make any assumptions there, so that means that we have to be able to issue persistent identifiers for reasonably small data objects. So, yeah, we use their handle and EPIC does not make any assumptions there about the size of the data object. This does mean that for any things that have to do with data citation, communities would have to be prepared to use a handle or a DUI, whatever you want, as the basic for citation. Okay, for metadata use of OEI-PMH, well, that's more or less accepted, I think, but perhaps we do need to develop a few EU that's standardized things for metadata such as these registries for metadata schema and metadata concepts. Overall, I think this is a limited set of interoperability issues. Other technology choices such as, for instance, formats, choices of, yeah, other things are not in the domain of EU that, but we leave that within the domain of the participating communities. So I think there are limited consequences for the communities here. Last slide. Interoperability of these, if I may finish that. There are many communities active within data management. Either they are developing ideas, either they are more practical, so developing tools and infrastructure, or they might be doing both. Now, to make ideas come together or at least get informed about differences between communities we need to talk, right? So this is, for instance, a forum where we do that, but there is also this very important research data alliance initiative that is, has been launched. That's supported by the EU, supported by our US colleagues, it's supported by the Australians. In the EU, EU.nOpenAir support is together through the ICORDI project. I hope it's also already been mentioned yesterday, not sure. It's supposed to be bottom-up and community-driven, but not only community-driven. I mean, people need some guidance, so I'm happy to say that also there is a place for what we call the data scientists doing this all. Right, we will have our big workshop in Keutenburg soon. If you like, I can give you more information about this. We would like to invite as many people as possible to participate there, and it would be good that, I mean, it's, of course, understandable that you will be there with your specific hats about the technology you want to push around, but it would be very good, of course, also, if you could take other people's, other group's, opinions into account when you participate there. And there I would like to end. Thank you.