 This is not about CRM. This is in the context of CRM. It is about solving a problem that we have been pondering on for many years. I've been pondering working on this as programmer and now I'm an academic, traditional academic and did it humanities in Cologne. So now I don't have to solve the problems anymore, but sometimes it's fun to come back and think how things should be done. So we'll describe a workflow where great literature documenting excavations will be translated into some sort of TEI with CRM connected to it. And I will try to describe why we do this and also talk about possible automatic decisions because it's quite many. So I'll talk a little bit about the remaining country heritage and the management of that. I will show what kind of previous work this is based on. I will show the pipeline, talk about visualization, possible automatization, and we actually have results even if it's quite early in the process. So the point here is that we have a number of different organizations and universities are engaged in this work as elsewhere. It is not really any computer-based digital connection between different techniques questions. And this is what is the long-term goal, and I will say a little bit more about what I mean with that. So we have a system where the National Library is responsible for legal documents. The postage of documents, so the great literature will be there and other publications, and it's catalog in a traditional library OPEC based on mark. Fine. The National Archive similar, actually it's the same, more or less the same institution as the library. They also use the mark-based OPEC. You have a cultural heritage, the Cultural Heritage Handicraft and Tourism Organization. They also have a lot of excavation projects, and they also have a similar kind of a mark-based software to take care of it. So that means that all these kind of cultural heritage organizations in cooperation with the Archaeological Institute of Tehran University and other academic university institutions have a fully functional system, a fully functional system where you can use a search interface to a library or archive database and get access to information and read it, and if that system didn't work, none of you would have been engaged, because this is what people have been doing and to a large extent still do. It's functional. The point is we want to have something better, and in order to have something better it's not enough just to classify the documents. We have to dive into them and work on the information within the documents, because that's where we can do information integration using the computer at the deeper sense. We do information integration of course. We're reading and thinking and connecting and making notes or whatever we do. So we we have a reasonable long history, which is reasonably important. Of course, all countries have that at some extent, but I don't have to make the point that this is an area with a tremendous amount of heritage. And we want to make it more accessible both for people in the country and people internationally. So we want to improve the functioning system, and the focus now is on great literature. Then of course, if we manage in doing this, they manage mostly. Then of course, it can be extended to museum information or these other things. So we have a number of excavation reports, and these are published in the traditional way, usually as great publications, sometimes larger print runs, and what we are doing with the things is based on previous work on acquisition catalogs in Norway in the 1990s. So it's old technology. Well, it's old methodology. So the idea was to treat museum catalogs as first class textual objects. So instead of reading them and extracting the true information about the objects, we did it as text. The point was to then encode the information in these documents at, in that case, a reasonably detailed level, and each SGML, LHXML element had an ID connected to it. So everything could be referred to addressed using a unique identifier. Then information was extracted to a database and that included the ID so that anything in the database which was based on this textual information, published information, could be chased back to online versions. So the database would link to sources, what you can call scholarly reproducibility. We also did similar things on great literature on archival material, but it was a bit too big. We only scanned facsimiles and did a reasonably deep classification, but no text extraction, no POTCR. POTCR didn't work at the time on this kind of material and no typing in. So the methodology was slightly different, but the idea was again to have this reproducible information available. So this is a slide some of you have seen a few times. So the idea is that you need the bibliographical record. Obviously, you need to know what you're talking about in context, then you create a facsimile, then you create a text with XML markup, structural, but also more detail, and then that semantic markup is then connected to the museum database. So you have a connection between the XML markup and some sort of data model and in our case, in this Rufo concept, we will use just CRM element objects and properties, of course, in a real running system. This will probably be an implemented database where you then have the mathematics available. So what we now see is that OCR works, but we need a lot of manual investment in extracting the concepts. We can automatize more and more and that might make this process possible to put through economically. It's very hard to do all of this manually, and it's a reason to be boring for it. But we have to start manually to get somewhere. We have to understand what we do in the manual processes before we automatize them. It's in a way assumptionally basis. So it will give access to the information into contexts, in the context of the graph database of the CRM model of the ontology of the database, call it what you want, and in the context of the original textual document. And that's the point, to link those two contexts to get scholarly reproducibility. Of course, 20 years have gone. We will not do this based on TEI. We will use CRM in a reasonably direct way and we have XML tools and RDF triple stores. Take care of the data. We used other things before, and of course visualization, virtual technology for visualization is totally different story. I mean, you can do things in a week that you couldn't do in half a year. It's quite a while ago. So in order to put all this together, we use Word and mark up using color coding in Word documents, export it at .x, and do an XML-XML transformation to TEI-XML. Then we have been using this, we come back to that, and then we just export it in some sort of RDF linearization. And then the standard cycle is very technology for visualization, if you're interested in any of this, we can talk about it. So we have a number of reports that have been looked into by Konsume obviously, and the actual tests were done on parts of the first one. So these are, as you would expect, what we have done in this first round of projects is to work on English tests. Because there's a lot of great literature published in English. It makes a few things along the pipeline easier. There are some more practical problems if and when this project really takes off and we start working on the Persian literature, but it's not impossible. The OCR works, is at least, Abby at least, have Persian implemented. There are some issues we have to solve. That's the whole thing. And of course some of the language technology we might use will be more developed for English. So what do we do? Extracting concepts from a reports manual. Read them, mark up. What is happening in the events being described in the text? These are mapped to CRM and there is a link between the source text and this extraction and in this way we have also trying to evaluate how CRM and the CRM family actually works in this Iranian context, which we'll come back to. Some basic things, as in all translation, there are issues. Which makes some of the concepts hard to use and understand when you do the translation and also when you apply it. One of her interesting findings is also that the use of these concepts are different in English reports depending on the mother tongue of the author. This is not statistically or I mean it's just an impression, but it's a fairly interesting impression and this is between English and German native speakers. So German native speakers write English excavation reports differently from English native speakers in English. So talking about speed, speed of translation, the problem with translations of things like data standards is they are always stated. Good. There's a steep learning curve, so and there are some other issues. It would have been good to have more. I'm not going to point fingers at anybody. It should be myself. It's quite complicated to actually analyze the text and map it and find the right level of how much information you extract and it's a lot of mechanical work. So this is a recently detailed mapping. It's basically two color codes, one for objects and one for properties. And this is then converted to TEI just using referring strings in TEI with ideas so that it's linked to a similar thing we do in in this case, first in whisky. So the conversion is very basic. The timing is very basic. We're just doing this to see whether we can make this work and whether it makes sense in the context of the process of getting there. The CRM model is then linearized and visualized. So whisky, we're also working on using coconut, an Iranian country heritage commercial system, which might also be able, because they are working a lot currently with trying to implement CRM also in that system. So this is the first draft, visualization, text, graph, and of course the idea, which we'll hopefully be able to show reasonably fast, is to have active links so that when you click on something here, the graph will move, click on something in the graph. And this is very much working process. This will really, we hope, be able to give also professional users much better access to their own information, even text their own themselves. But it also means the possibility for international connection to objects kept in museums in other countries or the things we are doing these things for. This is just a small attempt, just to make the point that of course you could put the next excavation report into a named entity recognizer, but it doesn't mean you find anything relevant except for some basic things. On the other hand, automatization is possible, not as a plug and play for the whole process. You put in an archive and you get fantastic models out, but you can speed up manual information extraction and you can have pretty high gains with some sort of self-learning system. Rule-based, maybe deep learning, I don't know exactly, might be better just to implement it very simply. But the point is, whenever you mark up a string as something, you can search through the rest of the documents, search through all of the documents and automatically mark up that. Just that, which is very simple, and then you just have to go through and accept them. These kinds of very basic tool sets can help and then of course it can be developed into something which works more on the context. So this is something that would be extremely interesting to work on also in this specific context and not the least in the context of person mature. So we have clarified the pipeline. We have identified where automatization will be really useful in this kind of work. We have identified some translation issues which will be important for future work on texts in person. We do have some issues. Working in Word of course is not ideal technically, but it works in practice. And the conversion is really easy, but we don't have comparable tools for working with graphs, I'm afraid. There were quite a few issues working on whiskey and we have a few technical people hanging around. This is not easy for an institution without programmers actually for the time being. So how do we make this of the property connection easily in a tool that works for somebody who is not really deep into the technology, nor necessarily information management, but not something about material? Yeah, it could be a better reloaded version of whiskey. It could be based on Cognos, it could be something else. This is one of the things we need to look into. We need to finish the visualization, clarify where we can automatize, streamline this, get the production process working better. We need to test more research on language cultural issues, but of course the production system needs significant funding. And of course once we start thinking about the production system we have to think about all these important but not always exciting things like long-term preservation. So the idea here is we're still working on proof of concept, but I think some version of this, which will look quite differently from what we saw here, is doable and can become a production pipeline. And that's the point.