 I'm from the Open Group, I'm the Open Group Director for Interoperability, and my role is to support our work on a number of topics, including semantic interoperability, and that's the reason for my being here in the Open Group's participation in this project. I'm going to talk about the need for a data classification system, what data classification is about, what we have already, and then get to those conclusions. So, looking at the big picture, we've heard a lot about data isn't new oil, there's some kind of untold, like a mineral wealth that you can extract, and it will be the basis of a whole new economy of productive services and products. But actually, just as extracting and refining minerals requires some very hard work, SNO does providing and consuming data, and this has been a great project shared by PSI, and this, as with the other workshops, we've had some very good discussions, and that's a selection of comments from the preceding workshops, highlighting the fact that although, public sector information may be openly available, actually quite a lot of hard work is required to make use of it, to provide it even, it's not just a case of going out and finding a big nugget of gold in your itch, it's not like that. And the very starting point of that is the problem of understanding the value of what you have. For those of you who are not geologists, by the way the mineral shown in this picture is not a valuable gold ore. It can be hard to understand the value of what is in the ground, it equally can be hard to understand the value of the raw data that is provided by public service administrations. So if you are a product or service supplier wanting to make products and services based on public information, it's not easy to work out what's there and what you can do with it. But classification is a good start for work analysis, so if you go back hundreds of years, people had classification systems for minerals and chemicals and they had some interesting names for these things and it was all very good, but now we have a much more efficient and modern way of doing it based of course on the periodic table. So we have there a solid and established classification method which has been the basis for a massive amount not only of scientific progress but also of commercial exploitation for natural resources. My thesis is, I may not be a professor but I can still have a thesis, my thesis is that we need a data classification system if we are to get the economic benefits that we should be getting from data in general and public sector information in particular. And that data classification system needs to enable analysis, it needs to aid our understanding and use of the information and it also needs to be able to make interoperability possible between independently developed pieces of information so that at the one level you can understand whether this piece of information produced by this administration is the same as that piece of information. So that information produced by that administration somewhere else and at another level you can combine pieces of information and it would be nice if you could do this by automatic means through some kind of composition but even to do it manually you need a classification system to know what we are working with. So what is data classification about shall we say? Here are some considerations. We want to describe the units of data to enable analysis, understanding and integration. We are talking about units of data so maybe we are talking about data elements, there is a definition of data element maybe we can classify data elements in the same way as the periodic table classifies natural resource elements. So okay this is a slide which I can either spend five minutes on or pass over very very quickly. You need to set the context of an element, is it the whole poem or is it some of the things in that poem. The other big conclusion to draw from this slide and the reason why it is in there is that there are some things that theoretical analysis is not good at and one of those is extracting meaning from poetry. Having made that point hopefully we can look also at structured data. There has been a lot of talk about what is structured data, what is unstructured data, what is semi-structured data and I would go further than just to say structured data is what you find in relational databases or spreadsheets which are like relational databases. I would say a triple store can take structured data. Messages with defined formats can take structured data APIs, the information that you pass across APIs in parameters or get back in JSON or whatever that structured data too. So the next consideration I'm going to put in a little example and I thought because Internet of Things was a big theme here I would bring in an Internet of Things example and also because it gives me a chance to give a plug for the Biota project which is about smart city building an IoT open innovation ecosystem for connected smart objects. It's going to be starting next year to be an interesting project. So the example is a very brief one which looks at the use of smart city public data by a user developed application. Biotope will be working on much more significant examples. This is just a simple one. So supposing you have a store that wants to do an analysis of the experience, how customer experience influences their purchasing habits. Various factors in that like the way the store is laid out, the behavior of the salespeople, but another factor might be outside temperature. So in the context of a music store, when it's cold outside, the people versus the slave rights and the left-hand occasion is sweet or rotten. You can draw that kind of conclusion. So that example will probably have some kind of application data model. They're looking at a customer visit, date, time, what is the customer's thoughts, what promotions has the customer taken, the advantage of what was the store format like outside temperature, maybe other things. And as you can see, there's more in that data model even though this is a very simplified thing. So the smart city is temperature information which they want is actually available. So in Amsterdam, I believe for example, they are actually issuing citizens with kits which will measure temperature and concentration of various gases in the air and so on. This city we assume has done so also, so the outside temperature is available as public information and it is stored using a rather different data model. So the data model captures the readings that have been taken, what type of reading is it, is it temperature, is it gas concentration, what's the value of it and so on. And that data model is completely different from the data model of the application. The data model of the application talks about temperature, this talks about a reading. One of these pipes could be temperature. They use different terms, assume different information structures but you can't tell the people who produced those models, you produced the wrong model or you should be producing the same model as the other guy because each model is actually right for its application. The administration doesn't want to have a separate table of temperature, a CO2 concentration and so on to have one table to keep the other things in. Another measurement is added, it's just the case of adding another element to the, another code to the list not of adding a new table. So you are integrating different data models both of which are right and this is an example that people talk about the long tail by which they mean one of the ends of the normal probability distribution. The point of which being that some applications are heavily used, those are the ones in the middle, those would require major development effort, others are not very much used at all, they are only viable if they need minimal effort, this is that kind of application, the people producing that application are not wanting to do a lot of work engaging in discussions on what information is available, if it's there they'll use it, if it isn't they'll do something else. So what do we have already to help us address these problems? We have the cabarets and we have grammatical structures to keep them together. So the example things in that slide actually came from the Dublin Corp, properties, title and creator. Dublin Corp is a major vocabulary that's available, the table headings in that one came from the person part of the icicle vocabulary, and that's what we talked about, it's available, we heard about the Inspire, unfortunately I wasn't able to be in that session but I believe that the Inspire project has a whole set of the cabarets that enable you to give spatial information. But also there are organisations such as the United Nations standard products and services code, which is a very comprehensive code list by which you can identify different products, different companies, different kinds of companies that manufacture those products. If you look at the medical field there's a very extensive list of terminology to cover all kinds of diseases, all kinds of treatments, and that's available. So there are also public services for cabarets, we've heard about D-PAC and the D-PAC application profile, particularly we've heard about the Italian application profile, we've just heard about the cabarets used in Greece, a whole set of the cabarets that are there. So how do you put the words in those cabarets together? You need a basic grammar so that you can interpret collections of terms by people but actually people are quite good at making sense of random inconsistencies. Machines especially need a structure, a grammar for these things. Natural language is not what we should be using, it's good for expressing thoughts and emotions, not specialised for data description and it's never used consistently. Relational database and data modelling is well-established and widely used, specifically designed for data description. There are international standards that describe it, particularly ISO 4179, so if you could almost say, well, what are we arguing about? What could be wrong with that? It does need to be applied not only in RDF's environment but also other environments but actually that's possible using generalisations of the concepts like object class and properties. The RDF, again we have that, it's an established W3C standard and I tend to assume that RDF also includes the RDF steamer and alloy. It has got defined machine interpretable representations which is important for machine processing and we heard for example that there is a body of sophisticated software that is based on the standards and that could potentially be taken advantage of but it takes a bit of thought how are you going to apply it to describing data. It's easy to see how you apply it to describing things in the world to resources on the web but structured data by table rows needs a little more thought. Data is the new oil, I think that is a sensible thing to say but the full exploitation of its potentials requires a data classification system and that should enable the use of those existing vocabularies that have been developed, not of one of them but of whichever ones are appropriate to your case, should provide a basic grammar for data descriptions, should be consistent with relational database usage because that's what most of the data in the world is structured according to it, should be able to accommodate other data representation approaches and should use RDF to facilitate the process. So I will stop there and I hope that you all actually got some of that even though I went through it at a huge speed and please comment some questions. Thank you for the presentation. Actually I was very much intrigued by the use of the term grammar and I would like to understand a bit better because I didn't really get it for what you would find a basic grammar for data descriptions. Okay so that's one of the things I glossed over. We have a grammar in normal languages which has nouns, verbs and adjectives but actually for data descriptions these are not really appropriate. Object class and property are much more appropriate grammatical constructs if you are putting terms together to describe data so if you can say not this is a noun, this is a verb but this is an object class and this is a property, that's the kind of thing you need to do. I'm just interested to delve a little bit more into the data that is the new oil. This is something we've heard as far back as Francis Moore said it's the new raw material for industrial revolution. I'm wondering if there's a contradiction in terms of coming from the open group and describing data as a discrete natural resource that someone uses and then depletes. One of my colleagues in the open group has a great saying which is assimilate is like a leaky bucket. You can only take it so far. So yes data is the new oil but maybe a difference is use does not deplete data. So maybe it's more like a new water. So maybe it's more like a new water which is circulated. Maybe it's like a window's cruise. I'm just wanting to clarify in what sense you have it to be like a new oil. In the sense that it can be used by companies, enterprises as a resource for the basis of products and services and in the sense that the use by enterprises of data as a resource will generate growth in the economy. Just to lift up there I think that's long enough. Why is it then feasible to think of data having to provide growth? This image is really strong of oil burning in fields and humans making war about it. That's why I'm a bit dizzy about the term right now. Okay so hopefully I will address that in my final slide but yes there are connotations of bad connotations about oil and okay so I showed a picture of gold mine but that's not really all that much better. There have been wars fought over that and I'm afraid to say that there probably have been or will be wars fought over data. And that's something that certainly the open group would not approve of but there we go. So what's the point of open group to address all this? Okay so here's a slightly more ecologically friendly sibling and the point of that is we're developing the open data element framework. Frameworks actually can occur naturally as structures that enable productivity and what we are developing is an index and a method for using it for the classification of data opens. And that will meet the requirements that I've posted in that conclusion slide. It's based on a framework the UDF which the open group published and maintained for some years but there are some differences. One difference is that whereas the UDF intended to have definitions for everything so that you never needed anything else we come to the conclusion that that's not the sensible thing to do. So it has a concept of plugins so all those different vocabulary that exists that I described can be plugged in to the framework and used. Those vocabulary, a lot of them are specialist vocabulary. It requires an expert body of specialists who know their subject to develop those vocabulary properly. The way we should be working is to encourage those people to develop those vocabulary and find a way of using them within a common framework and that's what we believe we would do. Another difference is that whereas object classes and properties are you might say the basic grammatical constructs that come out of ISO 179 and also RDF. We also have a concept of role which is kind of like an object class and you might consider it as a kind of object class. But some things such as person are fundamental and other things if you look at a lot of enterprise data models you'll see a customer entity. Now customer is not a fundamental object class. I belong to the customer entity of whatever companies I buy products from but not others. So customer is a role not a fundamental object class in the sense of person. And that is in fact an important distinction when you come to think about interoperability between data developed in different contexts of different applications because whereas many of the properties and the core object classes will translate between those applications the roles typically won't do. And that's by understanding that distinction you get a better ability to see how those things interoperate. So developing the open data framework I say we'll meet those requirements. It's currently a technical review. Those of you who have been in standards of work know that reviews can be funny things. You may come out looking completely different from the way it looks now or not come out at all but if it comes out looking the way it does now it will meet those requirements and it will have those differences from the unit. It looks like a very interesting piece of work. What do you foresee once it is published? Will it be going out to general community consultation? Will it be open for everyone to use? It will be open for everyone to use as I think pretty much all open group standards are. It will be open for everyone to use. If you have comments on what I've said and you think something is fundamentally different or needs changing then please tell me now or please send me an email because we won't have a public consultation period as such. The open group constitution doesn't have public consultations it has member consultations so it is now in open group member consultations. The rationale for that is that the standards that we produce represent the consensus of our members and that's what we're trying to establish. Is membership open to everyone? It is, it is not free but it is open to any organisation not individual. So thank you very much for your attention.