 Hello and welcome everyone. My name is Eric Fransen. We would like to thank you for joining us today for this webinar, a production of Dataversity with our speaker Dave McComb of Semantic Arts. Today Dave will be discussing Agile Enterprise Ontology. Just a few quick points to get us started here. Due to the large number of people that we expect in these sessions that you will be muted during the webinar, we will be collecting questions in the Q&A box in the bottom right hand corner of your screen. You'll find that below the chat window. Please use the Q&A box for questions for the speaker as opposed to chat. As always, we will send a follow-up email to all of the registrants within two business days. That email will contain links to the slides, the recording of this session, and any additional information that may come up during the webinar. A few quick words about our speaker Dave McComb. Dave McComb has over 30 years of experience with enterprise-level systems and enterprise architecture. He has built Enterprise Ontologies for over a dozen major enterprises, and he has one of the best radio voices in the biz. Everybody, please help me welcome Dave McComb. Dave. Thanks, Eric. I don't know what I'm going to do with this radio voice thing, but I'll see what we can do. And I'm the presenter now. You are. We see your slides. Great. Thank you, everyone. I'm guessing just by the number of people that signed up that there must be two sort of audiences in this talk, and the first audience I'm going to suggest are not ontologists. The people that were somehow curious about this want to know what it is. Is this really possible? And we are going to do some of that. Some of the early part of this presentation is just sort of what is an ontology, et cetera. But very briefly, because mostly this is going to be our experience report. What we found in 10 years of doing this, I think it's mostly going to be of interest to people who already are ontologists and want to raise the game a bit. But if you're not an ontologist, stick with it. Some of this will make sense. And what we're really trying to do here is just bring up some insights that if you do build an ontology, if you build it certain ways, it's going to make it far easier to extend it in place after it's been in use. But at the same time, I think most uses of it want a model that's stable even while you're changing it. It seems to be kind of day three, but we're going to get into that. And one that's considerably easier for other people to understand, because what we've been finding is if people don't understand your model, it doesn't get implemented in use. So that's kind of the overall where we're going. As I suggested, we've been doing this for 15 years, almost exclusively for about 10 years. There's a lot of large companies. We've built a lot of ontologies. And a lot of what I'm going to talk about today is just what have we found in the process of doing that? But for those of you who aren't ontologists or haven't come from this background, I am going to spend, I guess, about five or six slides. Just what is this and why do you want one? At some level, an enterprise ontology is sort of like an enterprise data model. At first order approximation, that's pretty much true. But there's several things that distinguish it, most of which come from the fact that this kind of model is built with semantic technology. The first thing, and it's not directly obvious from the technology, but it is possible with this technology to build a model that's comprehensive, meaning includes either all of an enterprise or whatever large subset you're working with, but at the same time be much, much simpler than the data models we used to build when we tried to do this. Weirdly, it sounds like a contradiction. It can be less abstract. You think if you had a high level, simple model of a business that it would, by nature, have to be abstract, but it actually can be less abstract and less ambiguous. We have grown up in this world where the prevailing wisdom is that you create conceptual models and then transform them into logical models and then transform them into physical models and then you build systems on them. In this kind of technology, the model you build at some level appears to be like a conceptual model except you can just implement it directly. You can just start implementing it without going through all these transforms and mapping and all that kind of stuff. Furthermore, when I say you can evolve it in place, I mean, we can change things without doing data conversions. Most changes in this kind of environment are additive. Obviously, not all are, but there are even techniques for managing non-additive change in place. Finally, I think what is bringing a lot of people to this is you can take the structure information you already have in your existing databases, unstructured data, you find where we find it all over the place, and all the semi-structured stuff, whether it's big data, JSON or open data, whatever. That's kind of what is an enterprise ontology. I'm only going to spend just a few slides on how it does that because I really do want to spend most of the time on what we've learned, but for those of you new to this, I want to give you at least a sense for what makes this possible. It's probably one of the most magic things, and it took me quite a while of working with this to add on to me how key and how pivotal this is. This thing called a URI, the Uniform Resource Identifier. At the first approximation, it looks a lot like a URL, and for all purposes other than thinking that you might have to go resolve it somewhere, it sort of works like a URL. But what's kind of magic about it is it gives you an identifier that is globally unique because we're, as you can see in this example here, we're hitching a ride on the domain name system, and we're taking, in this case, it's fictitious for the Q-SIF organization as assigning numbers to this case companies, but anybody who uses that number is guaranteed that it refers to the same thing no matter what database it's in, what table or column, which is not the way database identifiers work in a traditional system. You have to know the column and the data and the database and the table and all that kind of stuff, so huge advantage just for the real simple thing, and unlike more traditional GUIDs, which are also globally unique, these are resolvable. You have a hope if you see a GUID like, or if you see a URI like this, you can figure out what it meant. That's the first area of distinction. The next one is that everything in semantics is expressed in what we call a triple. Triples, as you might guess, have three parts. The thing on the left is called a subject. The little arc in the middle, which is directional, is called a predicate. The thing on the right is called an object. It makes a little tiny sentence. We call these assertions. And all information can be reduced to these kinds of assertions. Not only raw data like Dave owns a pickup truck or whatever it is, but also metadata. Everything comes down to this one format. Everything in this is a URI, so it identifies me. It identifies what does it mean to own something. It identifies this other object. There's one other variation of this where only the thing on the right can be a literal, but everything else, URI, URI, URI, or URI, URI, literal. That's it. That's all the structure you need to know, because that's all the structure there is. Now, that gives you an interesting side product and enables you not only to build graphs, and people are starting now to talk a lot about graph databases, but it allows for the self-assembly of a graph. So, for instance, we find a bunch of triples. In this case, that financial instrument was identified by this corporation. When the system detects that two URIs are the same, it can just snap them together. This is doing the equivalent of what a human would do in a relational database by writing a join. But in this case, you don't need to have any knowledge of the table structure, the table names, the column names, this key equals that key, any of that kind of stuff, because it's all done by the system for you and it constructs huge amounts of data into these dynamic graphs. That's pretty much the third thing that's interestingly different. The fourth thing is we have classes and properties and things like that, but unlike in a traditional environment where we subjectively design them and narratively describe them, as much as possible here, we try to create formal definitions for classes. A formal definition is not one with a tuxedo on or anything like that. A formal definition is one where a machine can process it and make some sense out of it and do something useful with it, but also soak in a human. We'll have one or two small examples of this further downstream, but for those of you who are curious, it does allow for a whole new level of rigor in design. And then finally, because everything's a triple here, not just the low level of searches they make, but things like membership in classes, that we can introduce new classes for instances that already exist. In this case, I had this imaginary family tree and we inferred that one of these people was in the class of all people. So sometimes people talk about schema on read and schema on write, and this is just essentially schema later and schema membership later. And later, if you had this same data and wrote some inference and detected, then not only is that person John a person, he's also a parent, we could further deduce that he's a grandparent, et cetera, et cetera. This is what I mean by we can not only evolve the data in place, this one instance is simultaneously a member of many, many classes, which is pretty much the norm in semantic technology and pretty much anything but the norm in any other kind of technology. So those are like the five most key distinctions that we've got going into this. What does that buy for you and why would you want to have one of these things? I think one of the main things it provides is a common denominator. At one level, a common denominator just between the many, many applications and databases you've already implemented, they're all different. You know, we've done a lot of large organizations. But beyond that, it provides a way to bring in semi-structured data from the web or from XML or for anything like that that through scripting and programming, you can go into those things and find and extract not only triples but triples that align with or correspond to the triples you harvested out of your relational database. Furthermore, pretty much the entire social media world now is a graph. People talk about the social graph and the knowledge graph and all those kinds of things. They are really graphs that are harvestable. You can turn them into triples and graphs. And there are a lot of techniques for taking completely unstructured text, finding the subjects, the predicates and the objects in the text and constructing triples out of that. And then putting those together, federating them, making sense out of this vast amount of difference that we currently deal with. So there's a lot of use cases that spin off from this one. Systems integration but systems building, many other things like that. So that's kind of the motivation. When we say these things can evolve in place, I think Tinker Toys is a great analogy. You build some stuff and there's still more places you can add things in and keep adding. You don't have to stop and restructure and refactor all the time. So that's kind of the preamble. I'm going to say a little bit about what we were doing initially and what we, I think a lot of this talk is about what we learned in the process. We used to build enterprise anthologies. And it typically would take four to six months when we might come up with somewhere on the order of 800 concepts. And when I say it's considerably simpler than what we were dealing with, it is. I mean, typically this is a map of a set of systems that collectively have tens of thousands or usually hundreds of thousands of attributes. It's covering most of those. Obviously all applications sometimes have some really peculiar local things that don't need to be shared. But for the most part we were shooting for that level of sharing. And we were quite proud that you could do that in that amount of time and get down to that view or that simple number of concepts. But what we discovered was even at that level, you know, from a consulting organization it's hard to sell, but probably more importantly, even as simple as that was, it was still hard for other people to understand. There's 800 relatively complex concepts and that it wasn't getting nearly the amount of uptake and implementation that we were hoping for. So collectively we came to this conclusion that if these things are going to be adopted in the future, they're going to have to be far more agile than they currently are. So armed with that little bit of information and in a parallel track, we've been working on something we call GIST. And I'm not going to try and sell you GIST because it's free, but I am going to tell you a few things about GIST. GIST is what we call an upper-level minimalist or a minimalist upper-level ontology. What that means is that a small number of concepts that you can use are the starting points that have been pretty well vetted and thought through over a long period of time. If you're interested, you can go to that website, download GIST. It's free. It's Creative Commons licensing. You just have to have some attribution there. But the reason I want to bring it up, when we first started using GIST in our work, we'd go design something based on GIST. And this is the Abuse Prodigé. This is part of their editor pane. If you haven't, this is just the top of a tree, a bunch of classes. And if you look at this on the left, you can see that about half of the top-level classes have the word GIST in front of them, which meant that we used them directly from our upper ontology. And half of them were from a consumer products materials management ontology. We thought that was pretty good. We at least got some bonus or reusability out of GIST at that level. But as we did more and more projects and we kept working on our methodology, and we're going to talk a fair bit about the methodology in this talk, revolving how we do this, we were also in parallel evolving GIST. And we didn't notice it, but one of our more recent projects, which was also materials management, although not consumer goods so much, but similar kind of scope. Didn't even notice it until I ran it up in Prodigé and had a look at it. And at the top level, you can't even tell who the client is or that it's a product ontology. You have to kind of get down a little bit deeper into it. And we see we have product references and accessories and auxiliary and all the kinds of stuff you would expect to have and a glimpse here as to what a formal definitions look like. I'm not going to get into detail about how that works, but it was kind of interesting that over the course of four or five years that both the top level ontology and the methodology evolved to a point where we could pretty confidently say that most things and most domains we've been in now are derivable from these shared concepts. So that's going along. We noticed that our clients didn't want to spend four to six months, wanted to get something their staff could understand, wanted something that was going to move them toward integration more rapidly, et cetera. And in a word, what they want is agile. However, there's a real problem with agile at some level. Agile is attempting to solve an individual problem and will iterate its way toward solving that problem. And in some ways, that's exactly how we got in the dilemma we're in now. We didn't do it as well as if we had done it all agile, but large clients have implemented typically thousands of individual applications. These are large multi-user applications. So every project they go after is solving another narrower and narrower problem and it's not cohering to anything whole and creating integration as it goes. It's almost the opposite of what's going on. So we have to be aware that agile can lead you down an incorrect path if you're not careful. But what we figured was what people really want. When they say agile, they would like to get to the future one incremental step at a time. In other words, people are tired of the Big Bang projects that don't work, give me the future, you know, one giant swoop. So instead we have to figure out how can we get stepwise from where we are to a better integrated future. And that's what we're going to kind of concentrate on. We started proposing this idea that it should be possible to build an enterprise ontology in something like seven weeks and then extend it in some very narrow, specific area and do something with it in another, depending on what the do something is in another chunk of time. And if you do that in an environment where everything you do is not only extending the enterprise ontology, but also the architecture that you use to implement systems, that over time you'll just get better and better at including or incorporating some of these new ideas and gradually integrating in existing systems, building new systems that are pre-integrated, et cetera, et cetera. So that's kind of the vision. That's what we've been doing our last several projects. But it's not quite as simple as just saying so. There's some tricks to building something rapidly that's fairly stable because if you just arbitrarily built yet another application ontology, I think you would find yourself refactoring and redoing the work that had gone before a fair bit of the time. So let's talk about what we learn, what we think works and doesn't work. One of the more basic things is kind of a methodological issue. How do you go about finding out what to put in the enterprise ontology? And I tell you there's four main ways. We've done some work with all of the four. I have my opinions about what works the best and what doesn't, but they're all tempting at some different level, I guess I should say. One is to go into the databases you have and harvest all that information. One's work from standards, one's brainstorming essentially, and the other one we call postulate and test. We're going to go through each of these one at a time. Probably the most tempting is we know we've got to live with all these thousands of legacy systems we have and why not just dive into them and turn them into ontologies and reduce that down as if you were reducing down a sauce or something. And it's tempting, it's technically possible, and you can make some headway with this. We work with another firm. We used to think that the large firms had hundreds and hundreds of systems and hundreds of thousands and thousands of attributes, but we've been doing some work with a company called Global IDs who have software where you can go out and mine all this and find out what's really out there. And it's scary. A lot of companies, it's not unusual to have millions of different attributes and all these different systems. And you can boil them up, but our observation is in the act of doing that you end up still with a lot of confusion. I've implemented a lot of things and you can't tell from what's left behind what is similar and different. What was this supposed to mean? And it was an easy temptation to kind of go into agreement with these thousands and thousands and millions of decisions that have been made previously, and we're not finding it terribly productive. We are going to have to come back to this stuff because this exists, but we'll talk about that in a minute how we go about that. The next thought that comes up and it's also a good one and it's tempting is to say, we work in an industry and we have standards and we should base our ontology on the industry standards. And certainly when you're communicating with other players in your industry having these standards are important, you have to do this, but we haven't found it to be a very good place to start your ontology work with. The first place, most standards only cover a little percentage of what you need to run your business. There's a whole bunch of internal stuff that just has to happen that standards bodies are not very interested in. Most companies exist in many industries and therefore many standards apply to them and there's a whole lot of reasons that it's not a great place to start, it's something to come back to. The other two things that we've noticed in attempting to do this, most of these standards organizations don't have very rigorous definitions of what things are and often, not always but often, they're also quite complex, which makes it harder to work with. For quite some time our preferred method, I guess you'd call it facilitated discovery was a matter of down the lower left-hand corner there just getting a bunch of subject matter experts in a room and in a facilitated fashion cause them to uncover some of the important pieces of what they know and how it relates and what it means and things like that and by guiding those conversations and by their knowledge, it's possible to really distill down the essential and the true, if you will. I still think this is a good method. We use it a lot, but we've been moving more toward something I call postulate and test and these slides aren't terribly convincing about what the overall approach is but it's the idea of what happens with facilitated discovery and we've been participating in some consortium where we're doing facilitated discovery across participants from many different companies is that people's expertise very much has come from their use with particular systems and systems and applications sort of give people a vocabulary and a way of thinking of things and it causes additional duplication and things not being in sync. So what we're now saying is in any given domain we need to learn enough that we can postulate the core things that just must be true. If this kind of a domain or this kind of industry then certain things just have to be true so a tiny amount of design around that and then go out and get some of the data and not trying to harvest everything that's there but looking, well, what is there? Conform it to the hypothesis and then test that and say, whoops, I'm going to build there. Test that and say, was our hypothesis correct? Is there significant other information that's important or distinctions that aren't being made or structured or anything like that? But not only is this the last few times we've done this appear to be a much more rapid way to do things, it converges on smaller and simpler models that are still expressive and complete. So if I were to kind of summarize those four methodologies and say what's similar or different, the timeframe on harvesting from your existing data is years, if you do it by hand or do it analytically, months if you have good software and tooling and things like that. The standard-based thing is really a matter of just how long does it take you to thoroughly, thoroughly understand what the standard is talking about and frankly, that's typically months. You may have people that already believe they understand it, but it's surprisingly trying to convert that to something that's really formal, kind of exposed to what is and isn't understood. These facilitated sessions, you know, we've done them, they take months typically and what we're now getting into is things that literally take weeks to do. The other things that are different are just a number of concepts you end up with and what you have to deal with. We're really looking at trying to get down to a few hundred of what the most important things are. Scope, you know, mostly depends on, interestingly, the harvest scenario, you're limited to what's already been implemented because you can't really harvest from things you're going to do in the future. Whereas, you know, with a facilitated session, you can ask people about what they aspire to and where they're going and what's important and all that kind of stuff to the scopes. Interestingly, it can be quite different. But given all that, what we now believe, a preferred sequence is to do this methodology first and then for more detail or for an individual application or for things that didn't fall out from the postulation process, then do some facilitated discovery. And then, very often, the last couple of projects, there have been standards that were worth aligning to and only then, when you really have your own house in order, you will, or your centralized house in order, is it time to kind of dive back in and see of all the stuff we've implemented over the last several decades, how does it align with the compare with what we've invented or designed here. So, some of the other lessons, that's kind of a methodology lesson, more of a principle, I guess, we discovered after building these things and trying them out on people, is this idea of reducing people's cognitive load. The cognitive load is how many things you have to learn, what do you have to have in your head in order to be able to use this model. It's the same as when you come to work in a new industry, how many things you have to learn to be competent and what do these terms mean and what are all the little idioms and things like that. So, and related to that, when you build an ontology, when someone goes to use an ontology either to build a system or do integration or whatever use case they're going to do, we have this concept that we call commitment. In other words, when you include all the axioms from an ontology, you are committing, you're saying that our use of those terms agrees with all those. It's a higher commitment than merely saying it's structurally similar or it sort of looks the same. The fewer things there are, the easier and the more confident somebody can be when they say, yeah, I'm understanding, I'm in agreement, I can commit. The fewer the number of concepts you have to share between any two applications, the less cognitive load there is in doing that actual sharing. So, that leads to a corollary here we call economizing on properties. So, property, when we had the triples before, we said there's a subject of predicate and an object. The predicate is the implementation of or it's one instantiation of a property that was defined somewhere else. So, when I had that little trivial, it said Dave owns a truck. Somewhere we had invented or someone else invented and we committed to the idea of ownership. And when we say Dave owns the truck, we're agreeing in the same way that whoever coins that term. Now, in a large enterprise, if you go look, you will find billions of properties because every attribute on every table and every column on every table or every element or every attribute and every XML document or everything that you've got, it's different is another property and it sort of means something but it really only means something to the designers and users of that application, not well coordinated or organized beyond the bounds of that. And to ask somebody to learn millions of properties in order to use something is a bit much. And most of those distinctions are arbitrary or local or convenience or whatever. So, we've discovered that these properties because for the most part, you can't define properties in terms of other properties. There's a few exceptions to that but at this first level, let's just say, when you possibly the property of ownership that we have to learn that before we use it, this is probably the most important thing to economize on. We want to have ontologies that have very few properties and when I say very few, it's typically hundreds but it's not tens of thousands or millions so it's a huge difference there. One of the other things that when people come to this technology either from object oriented or somewhere else, there's a strong desire to make lots of subclasses of things. I know of the class of person and patient will be a subclass of person, that kind of thing. One of our principles here is to use that concept sparingly. I'll talk about that in a little bit in a few minutes. The desire to use subclass in a lot probably either came from an object oriented background which is where I got it from or many people came from taxonomy design where things are called deep hierarchy of concepts. Either way, there's a strong temptation to want to assert subclasses. We're going to suggest that these things that seem subtly different actually are a big difference. Imagine you decided to make up this concept in the lower left-hand corner here that organizations and people are subtypes of party, like party to a contract. Sounds completely reasonable. In that world in the lower left-hand side you have to learn three concepts. You have to know what an organization is. You have to know what a person is. Weirdly, you also have to know what a party is because it's possible in this thing to be a party and not be either of those two things. Just because organizations are parties and people are parties still doesn't mean that you know what a party is and what are the kind of parties there could be. Whereas if you really lean into the semantic-style design you might design party as being either an organization or a person and that's it. It can't be anything else. In that case, you only have to learn two things because once you know what an organization is and once you know what a person is, you know what a party is. There's no further ambiguity. There's no other kind of party that could be something else that hasn't been grand identified. It may not seem like a big deal but we've been noticing a lot that it does reduce the number of things you have to learn. Related to that is how to go about creating formal definitions. We have this concept we call the distinctionary as kind of a verbal way of starting the process of creating a good definition and we say that a distinctionary is a glossary because it's a collection of terms and definitions, if you will, that's what a glossary is. But it's different than other glossaries or it's distinct from other glossaries in that every time you create a distinctionary definition first you say what is it, what is it most like and then how is it different than it's other siblings. That's the old Aristotle species and genus and species differentiate and all that kind of stuff. So how that actually works out and how it plays out is it plays out in the difference we say between what is the definition and what is the description. I should have put the left or right the other way around but when you create a description of something what you're doing really, you think about it is saying what properties do instances of this class have like an object oriented or what columns does something in a table have but that assumes you already know what the type is. So once you know what the type is here's what else you know about it. Whereas anytime you add descriptive classes you've added cognitive load because somebody has to know how an individual item becomes a member of that class and therefore acquires these properties if you will. Whereas a definition, different from a description is also supplying the criteria for membership in a class. So if you construct these things correctly if your audience knows what the criteria means then they pretty much have to agree with the definition. Now they may not like the label you put on it that's the arguments you get into all the time but that's a second argument. The real issue is if you agree with the criteria you agree that there is such a definition. So again with the subclass and difference if we define a patient to be a subclass or a person we said a patient is someone who's necessarily received medical treatment sounds reasonable, at least it's a start versus a different way to do that a definitional way of doing this thing a patient is a person and they have received medical treatment that is the definition of what it means to be a patient. Now let's take a look at how those differences play out. On the left if you already know that somebody's a patient that's good then you know they're a person and you know they've received some treatment and you don't know what treatment they've received that's the way ontologies work but it doesn't say anything about how you found out that somebody was a patient the system can't help you somebody outside the system had to have assigned someone to that class whereas over on this side no matter how you figure out somebody's a patient if something's on the outside has signed it or if you've inferred it you know all the same stuff on the left but additionally if you merely knew that someone was a person and that they've received treatment then you can conclude that they are a patient and the system is going to create the subclass in here from that definition the system knows that a patient is a subclass or a person and it doesn't rely on you or the designer doing that so that's kind of our distinction between descriptions and definitions. One other thing that we found quite helpful is what we call the sketch you know when you design ontologies sometimes they do get complex and we found that just stepping back every once in a while and creating a picture of the ontology they'll fit on what we call a placemat 11 by 17 piece of paper improves not only the ability to explain it to somebody else but improves your own understanding and it helps the building process so when we say a sketch a sketch is not an ontology it's not formal it's just boxes and arrows and labels and things hanging around but it allows you to sit down and have a conversation and say what do these things mean what are the key things here yes there will be many more details that find brain distinctions that we need but at some level there's something like this that holds pretty much every organization together or at least everyone that we've seen so far so if we're going to have an agile system and in particular if we're going to have one where we can change the definition of things we can change the structure the systems that are already in place we better do the same thing that the agile developers did we just better have a way to detect errors rapidly because something's easy to change it's easier to break you should have some mechanism there a couple that we use that I think are worth reporting on there's many many more that could be done but a couple of them that I think are pretty important we call them high level disjoints and A-box unit tests I don't know not a great name but we'll think of something so when you build ontology and I said it was a formal model that you can load oh by the way it's all I didn't mention this but all those standards that I mentioned with the triples and everything else are all open standards W3C they've been ratified for years there's tons of tools out there I think it's great we're building a system for a client and all the data and all the queries and everything we're being built in one environment we just picked them up and dropped them in the other environment and continued right on it was just like hope it would be but anyway one of the many tools is called a Tableau Reasoner that should be reason or not reason and a Tableau Reasoner will take a look at all the assertions you've made in the formal definitions of your classes as well as individual instances of things you have have access to and will grind through them looking for logical inconsistencies so these are not things like a traditional system will look for cardinality violations and things like that which is good but this is looking for logic errors and it's great it's amazing how often just putting a property backwards it'll detect logic errors like that but we've noticed that if you drill down in order for a Reasoner to find a logic error there has to be some disjointness somewhere I'll show you what disjointness is here in a minute if you remember Venn diagrams and there's a good way to think of classes in ontology some of those triangles being members of sets some of those triangles are members of more than one set the overlapping ones as well as the larger, higher level one and let's say we had another set of sets being disjoint in semantics means that two sets can have no members in common like that because the two highest level classes are disjoint they can't have any members in common which means any subclasses of them also cannot have members in common and it turns out that if you've done that the more disjoints you have in your system the more likely you are to find logical errors so for instance we say at a very high level you should say things like people and geographic regions there's no person who's also a geographic region there's no geographic region there's also a person if you say at the high level they're disjoint then anything that's a subclass of them is also going to be disjoint as well as any individual and you can know that Chevy Chase, the comedian is not the same as Chevy Chase, the city of Maryland which otherwise you might get confused about they would have to have two different URIs if you follow it all the way from the beginning of the talk another thing that's great for catching errors is I'm going to introduce two or three other things here very rapidly but the A box is a term in semantics that refers to the assertional box and it means logically the low level triples that we make when making assertions about individuals like Dave on the truck was an A box level assertion and the patient or persons is a T box or a terminological assertion now they're all done with triples they could all be mixed and matched in the same ontology but every once in a while you have to keep them straight and it turns out that at the A box level we can write these queries as an example of a query language for ontologies it's called Sparkle, another standard and it works just by, these are all little triples the time card should have a time chart it should be typed in the time chart it should have a start, an actual start time an actual end time and that the end should be before the start that matters anyway you got the idea you can write these and this query will just run and give you a true or false answer and we'll find if you've invented logical errors in your system another thing we've discovered this is another one that's been hugely useful we call it practical modeling there's a huge temptation to put a lot of detail into your ontology that's not doing anything other than making what we call taxonomic distinctions slight subtyping distinctions you can set those aside put them in a different area if you will and put it under a different governance process because they don't affect the overall structure of the model we're looking for those things that form the shape if you will, the dark area in this picture and then allow the taxonomies to form fine-grained distinctions committees can get together and decide how many genders do we want to support things like that that don't really affect the overall shape of the model so we want to move those out what we've noticed is there's a huge temptation for ontologists to put everything they know into the model but that's what makes ontology big and unwieldy and hard to understand a lot of the distinctions should be pushed out to the side where normal folk can get together and not have to debate the merits of ontology and everything they do doesn't destabilize that core there's some thought process about what should be in the taxonomy versus ontology if there's likely to be instances of something like instances of patient then patient shouldn't be in the taxonomy but if they're really just kind of tags and most of the things over there don't have other meaningful properties then they're good candidates for being in the taxonomy so when you go down this path and when you go look at other ontologies I want to bring up a couple of examples we would suggest to be more like the one on the left and less like the one on the right the one on the left is an ontology called geonames there's triples out there that represent some 10 million different geographical facts and names and the population of Paris, France and what elevation it is and how many square meters this particular lake they did all kinds of stuff like that tons of stuff and yet they only have there's only 19 classes in ontology they've got 33 properties they've done a great job of economizing properties like we suggested and then they have a little taxonomy of what they call feature codes so they don't bother to try and distinguish in their main ontology the difference between a river stream creek estuary blah blah all those kinds of things they just have some feature codes that people can tag them whereas snow med which is in the medical domain is taking the office of tax and yes medicine is complex but so is geography and snow med ends up with 303,000 classes they did do a good job of economizing on properties it's kind of incredible because any traditional citizen of that size would have millions of properties so they have 152 that's great but it must be that in those 300,000 classes almost all the taxonomic distinction there aren't that many things going on in medicine I want to talk a little bit about modularity I'm going to move kind of quickly because I do want to leave some time for some questions but you'll have these materials here shortly modularity is a two-edged sword it allows groups to work independently allows groups to screw things up just a handful of things to think about if you're going to divide things into modules if you inherit somebody else's modules don't introduce disjoints to the things you inherited you're essentially changing the definition of something that you've inherited don't pretty much all these things mean don't change the definition of something you inherited but it's okay to derive definitions from things you inherited in fact use those classes combine them with properties and make new distinctions and also when you get into this whole modularity and namespace thing this comes up a lot many of the ontologies you're going to use have their own namespaces you know, FOF is friend of a friend it has a namespace if you go get data from Wikipedia it's called Gevipedia and so you already got to get used to using other people's namespaces when you do this though there's a temptation every time you make a module a little bag of ontology facts to give it a new namespace but that kind of overdoes it means that once you've done that excuse me, once you've done that it's very hard to move things from one ontology to another if that means renaming them because anything else that referred to them in the past now has to be renamed but there's a plus there people can work independently what we recommend is to give some thought to how are you going to govern this thing in the future how many governance committees or who gets to make up new names because that's going to tell you what the breadth of your namespacing is going to be not besides your ontologies but you should the namespace should be the broadest any group of naming authority can get their heads around so for instance if you had a governance group CRM stuff and they made up the concept of an account an account of somebody you're going after and you only had one namespace and the finance group said oh yeah we know what an account is it's what you charge your expenses to that creates a problem the answer if those two aren't in one group and can't agree on things is they should have their own namespaces the CRM people's accounts and the financial accounts are now different things however the flip side don't overdo it you know if you had one CRM group and they were make every ontology they made up they made up new things for prospects and leads and customers and clients and all that that hides from you the fact that many of those things are either the same or shared or overlapped if you can have one governance group over that it's probably better off having one namespace over all that and finally OWL mention of the modeling language for how to describe classes and membership in classes and inference is all about meaning but people want it to be about structure as well but it's not about structure and there's a new standard just coming along right now called RDF shapes it is about structure so if you find yourself living in ontology and ask trying to answer the question which properties quote go on this class then you collapse these two ideas together really the ontology and the modeling and OWL should be about coining new terms saying what they mean rules for inference and it's something like RDF shapes that should be saying cardinality constraints and which properties should really go on this class there's a little example there's standards out there we're pretty excited about it I think it's going to really helps that separation concern idea so finally we think it's necessary to unwind all the complexity that's been created and luckily it's now possible to create these enterprise ontologies in a fairly short amount of time and there's a whole bunch of techniques that we went over you can go over them again when you get the slide deck here I think I'll probably turn it over to Eric and answer questions I've got a little bit of contact information some places you can go for more data and Eric do you want to pull over the questions we've only got a few minutes left we'll do a few sounds good we will do a few let's go ahead and dive right in the first one is how does an enterprise ontology differ from master data for an enterprise is it basically the master data for an organization? yeah, good question we have helped people build ontologies for master data typically when people build master data systems with current technology they have a traditional table style design of each of the master entities you know customers typically in vendors and things like that and then what properties you want on them I actually think you'd get a better more flexible answer if you did the master data management with an ontology but they're related but they're not exactly the same also for one client who was building about two dozen master data management designs simultaneously we took them in because you know these models are computer readable we wrote in all, read in all their power designer files turned them into an ontology rationalized them and turned them around again and just helped them integrate those a little bit want to do another question? sure thing how are predicates found in unstructured data? oh that's a good one so the first and the questioner probably guessed that the subjects and objects were relatively easy to find the subjects and objects you can use this technique called named entity recognition that's pretty well established where you parse through unstructured text and find things that look like names or look like places or look like dates or whatever that's named entity extraction back to something you already know about the relationship extraction is harder it typically takes more powerful tools and learning in a particular domain but it's the same idea you read like when we were working with Lexus they would read until they found something that looked like a lawyer and then go and check and find that it really was a lawyer and then the software looks in the near vicinity of that for recognized sentence fragments you say that this lawyer represented something or is it you know and then you take those things and map them back because every document will use a different verb there or a different phrase if you will but eventually you want to figure out that this lawyer was representing this client or was affiliated with this law firm or was you know so it's a matter of a lot of machine learning and teaching and stuff like that but that's essentially how you do it the more regular your corpus of information legal documents have some regularity to them the easier that is but that's the basic concept and the next question what is your definition of an agile system? an agile system especially up here on the slide we've got this thing we call the data centric manifesto trying to help people get to where they can make change incrementally to systems in place and one of the telltale signs that's not possible is if you have an existing system that doesn't do what you want it to there's some feature that's missing right now the 95% knee jerk reaction is to go out and write some requirements build or buy another system and do a data conversion from the data you add to the data you want that's my definition of not agile agile is when you find you need a new feature a new system you can add that feature to that system in a credible amount of time without going to a data conversion okay let's try to squeeze a couple more in here can two disjoint classes be used to predict one another? oh yeah that's a great question yeah so back in that example where I had an organization and a person if they were disjoint and in that slide I didn't say they were but in gist and we usually do say that no people are organizations vice versa I then said that we had this concept of a party and we harvested data out of our existing system and we found customers and we said oh that was a party to the sales transaction but frankly because of the nature of our business we forgot to ask that party whether they were a person or an organization so now you've got a party and you know it is an organization or a person but you don't know which one it is as soon as you find one fact that would kick them out of either of those classes like if you find something where you now know with certainty they're not an organization they immediately inferred that they are a person so yeah you can use disjoints to help you with inferences it's a very insightful question cool unfortunately we do have a few more questions but we're just not going to have time to get to them thanks so much Dave for a great presentation today I do want to let all of you on the air know about an opportunity to meet Dave face to face later this summer at the 2015 smart data conference taking place in San Jose, California August 18 through 20 Dave will be presenting on building an enterprise ontology in less than 90 days so a little bit of an extension of what he was presenting here today also to remind everyone we will be posting the recorded webinar and the slides from today to dataversity.net within two business days and I will send out a follow-up email to all registrants to let you know how to access that material thank you again for attending today's webinar I hope you have a great day and thanks again Dave McCom great thanks for all your attention appreciate it take care