 Yesterday I want to talk about search, the importance of search, and I dare to say that search is not everything, but without search everything is nothing. And I want to talk about integrating views and not integrating things. This is something that Ben and I were addressing yesterday. We have a wonderful structure on the wiki and now we crash our brains about how do we access that. What kind of dashboard do we set up, what kind of entry page, what's the main page? And with search this is all superfluous because there's one single interface to humanity that 99% of the people use and that's Google for everything. We book flights through the Google search box nowadays. So let's look at the business case. We want to make hotline history easily available to hotline agents. That is a concrete business case of one of my customers. And we'll see what the challenges are here. So the use case is that some customer calls the hotline or contacts it by email, explains problems, and quotes an error code just to keep things simple now. And of course the hotline agent then answers the questions and at the end finally we want to expose a search interface to our knowledge base so that the customer can search himself. Now let's have a look at the business domain ontology. I mentioned Mermaid yesterday. So that is the online editor for Mermaid. We won't go into this but you can really alter what you see here on the right by just playing around with what you have on the left. I highly recommend to use this twice a day for whatever you want to do in life. Good. Just very briefly we have topic types as we were seeing yesterday. So let's say our SMW manages FAQs and troubleshooting articles. That is one resource silo. Then we have a ticketing system that of course manages tickets as OTRS for example. And we have a code repository that would be GitHub or something like that for a company where the error code information is stored. With my concrete customer the solution to error code handling is actually not contained in a documentation but it's within the code as comments. Now the solution to the problem is tied together from several topic types. So you need the ticket and the error code to figure out what the problem is and you need the FAQ and the troubleshooting article to actually solve the problem. And you can see that these come from different resource silos. Now you are tempted to say we have to get the error codes and the ticket into the wiki. We need to integrate. I don't think that is a particularly good idea because again let me just briefly say where we are now. So we are looking at this search part. Yesterday we are talking about how knowledge workers would expose their knowledge in articles. So now we are talking about the search. Just very briefly what is search supposed to do, a refresher. We have a topic article in our semantic media wiki that for example would solve easy system cloning which is in English and we have a search request which is a German expression for actually saving data and of course system cloning would be one solution to that. So what a search engine then does is to take the features of our article that is the solution to the problem and matches it to the search signals. And of course datensicherung has by string nothing to do with system cloning. However they both address the same thing. So what you want actually the search engine to do is give me system backup piece stuff. That doesn't mean that necessarily articles that are providing a solution to that problem contain the word backup and system. They can be about something else in a different language but you want to make them find it. So at the end a search engine or actually elastic search which is an information retrieval system is spectacularly simple. It just matches ASCII code strings and if you write system in capital letters on the search query part it won't match it to a feature which is system in small letters. So the way to do this between is with match concepts you have to take into account that things might be in multi-language then you have a lot of antonym synonyms, hyponyms and hyponyms. You have for example previous searches that you can tie in so we're not talking about the signals and you have explicit features of a topic or synthetic. We were talking about that yesterday and you will see what I mean by that right away. Now what's the problem with the resource silos now? And you can see my talk was named enterprise knowledge management including SMW. It's not with SMW. Two years ago SMW was me up front and center. Now it's moved to the bottom right and search has moved up front and center. Why? Because now we're facing this. That is a typical customer setup that you will encounter. Now this is Sabine's website, I just used that as an example. Let's say her company runs a website, it runs file systems, it runs email accounts, it has a wiki and it's got code repositories. Now you want to make sure that the search engine covers all of that because if we introduce a semantic media wiki and we say we've got a silo search again you won't make a lot of friends. So the challenge now is of course to mold all these resource silos into one single index mapping design and these are the classes. Now my terminology is that I call this a actually that's an old graph. This should be resource silo. So these are resource silos. What we want to end up with is with entities. An entity is a piece of knowledge, a piece of information that is typo style. It is useful in itself. And you will know about entity relationship diagrams. So knowledge semantically can be organized in entities and their relationships. So we're talking about subjects or topics and the properties. Now what's important we have one step between the resource silo and the entity and that's the resource. Why? Because in most cases a semantic media wiki page represents one entity as we saw yesterday. Even my ontology that I recommended yesterday models one page as one entity but that does need to be. You could say one knowledge entity is a section in a page. So this class which is a code class would have to break up each single page into multiple entities. And of course most of the time one GitHub repository file would represent one entity. However email messages the message itself could be one entity and each attachment by itself could be other entities. So one email message which is a resource that's coming out of the Gmail or email account these is inheritance could contain multiple entities. And of course for file systems and websites we use site maps that you tie in from wherever you want and it does a type analysis through Tikka which is Apache Tikka I'm not sure who knows that. That is able to detect the mime types of about 2000 documents and automatically selects the right parser to extract the entities from that. And then so I would provide a core functionality that does that and then you have customized functionality. You'll see what I mean by that later and then you submit everything to the same index that is then queried by code that you can deploy. Now we'll see two interface examples and this is the only code you need to install the interface anywhere you want anything that can handle HTML. Now what we're talking about is for example this, I'm sorry there's a lag. So this is my website and now semantically what you can see now there's no design optimization here. Okay. But what you can see when I type it extracts properties and property values and we're talking about the top 10 property values and that would also work just to show you. So if you can deploy those four HTML lines wherever you want and it's tied in here. So that is exactly the same interface wherever. Now what's nice here is that if the customer tells you we've got an intranet, we've got an intranet I don't know somewhere else, we've got something else, you don't have to tell them well people have to navigate to that URL to search for things. You just literally deploy whatever you have, I mean you deploy the interface to wherever the customer wants it. And then one thing, this is obviously Google design because you will make a lot of people very happy if they see Google design. I had people actually asking me why do you integrate Google search in our website now? And I said no it just looks like Google search but it is not Google search. Then talking about semantics, what I want to do now, what I want to expose, for example this is a topic on my wiki which is a recipe is a system cloning, we saw that before. This is now whether all of this is necessary is another question. This is the topic type which is redundant with this but I just use it right now here. And then you've got keywords and these are extracted annotations. So these are links that you can use to build facets. This is actually the connector for drill down search because now we could say give me all the recipes that are provisioned with a certain answerable role. I have no clue yet how we are going to design this but it's just the Lego parts are set up to start with that. Now I'd like to have full control over that and I show you where the bits and pieces go. So when you create a search engine you need an index mapping. That means you have a semantic media wiki page that has certain metadata and that needs to be mapped to the index. So of course we have a first level which is the resource level. So I want the epotopic to know where it was coming from. So the resources of a certain type SMW page it has a name which is the page name and it has a resource URL where it's coming from. So actually sorry I forgot to mention one thing what you see here is one document that covers one entity. So we're talking about this recipe. Then you go down to the entity level and you've got the entity type which is a conference entity name which happens to be identical with the resource name but just because of the fact that the SMW page is an entity in this case that needn't be so. Then you have the title and here you've got entity type and entity title remember yesterday I told you that I had a redundant unnecessary annotation in my ontology and I said I'm doing work for Elasticsearch. You could of course merge these things here but I like pushing work as far down into code as you can to keep the upper interface layers thin and performant. Then you've got entity keywords, entity content which this is obviously what most people would end up looking at and then highly important now you've got the entity annotations that we were mentioning yesterday. And as you can see now these annotations for example remember included rojatam harbour tour they are not a document field because we want to have this flexible that any non or any property that wasn't declared by some administrator in the first place can go in here. So your user comes up with a new property and it can go into the index. So the subject would be either the page name or a sub-object page name in case you're indexing sub-objects and then you have an annotation predicate URL that would in the case of semantic media wiki that would be the property page. But only if you deal with the resource silo semantic media wiki if you use the file system that would be probably different because there's no semantic layer inherent to that resource silo. Then you have the object URL, you have the html tag that is for example we had include rojatam harbour tour yes and in order for the display to be very simple I store the entire html tag here so my interface code doesn't have to come up with that. And then you have the simple annotation object value. And here you see an example of an elastic search document and what's important now so this is an example which is just an instance of what I explained right now. Now remember we were talking about a semantic media wiki ontology yesterday which would come up with has entity type has entity title and so on. But if you have a different resource silo of course that information is not designed like that. So when we take for example this is an error code page so this would be a github code file that is parsed for error messages. And now you have a good example for a fact when a resource the entire code file contains multiple entities okay it could deal with hundreds of error messages so you don't want to index the entire page but just each error message by itself. And then you artificially set these fields of the elastic search document so that the elastic search server does not need to know whether it's dealing with a semantic media wiki page or with any other type of resource like emails or github repositories or normal text files. But that is just I try to mention or show you the how we do the annotations for non semantic media wiki content. And this is again just to summarize it up, sum it up. The idea is to have this search experience semantically with stuff that is actually oh we're actually here you see it you see oh no sorry this is media wiki across many resource silos that a customer typically has in his company and not only semantic media wiki. Sorry it was a little complex in the end but I hope you understood the message. Thank you. Was that more or less clear? Is it possible or is it required to provide any weighting to the silos? Sorry. Is there a requirement or an ability to weight silos like if I want to push the answer that people find from a search toward a wiki versus you know a company intranet or internet is it you know if there's still a hit for a search is it possible to get you know the highest ranked hit to be from something? Oh yeah yeah yeah of course search relevancy that is one of the well this now goes into search engine design but for example you've got signals here. One signal is the fact that this article is coming from this website another one is it has two properties or it has the marketing page property so you can say if I'm looking for business services then you should boost search results that have a explicit marketing page. You see this is it this there's faculties at universities that deal with that question how do you create relevant search but of course that that's at index at search at query time you can design the system that certain results get ranked higher but still 75% of the performance of a search engine or even more is decided at index time that's why it's a give and take there but this is something you have to figure out with the customer how do you want you know is this ranking good that's why you cannot have an IT specialist telling you whether this is a good search approach you need the domain specialists to tell you that if you look for faceting is this supposed to be the first page or is it somewhere else and how do you tune this and a lot of people ask me about machine learning the problem with machine learning is you need a lot of data to be able to apply machine learning and here we're talking about 600 pages for example that's not enough by all means. So let's say that I I'm an IT law okay and I'm searching and there's elastic search would it bring up suggestions of other content that might be of interest to you that's your role you know yeah elastic search has two boundaries it requires JSON documents with your content data organized in whatever way you want and it exposes a query interface where you can ask for stuff by providing signals so you can say I'm looking for legal articles and I'm a female worker and it's 12 o'clock in the afternoon and I'm in Houston and then elastic search this map takes these two or it takes your query with those signals matches it against the feature and come up with that so that is that is cold yeah so it's all pre-coordination is that does that happen in the index is that or in the mapping or well there's several layers you can put this code and I'm not that familiar and proficient yet with tuning all this up so what you see here is just the brickwork the foundation so that I have my back free now to address all these questions because these are actually philosophical questions they're not necessarily technical questions a big issue is for example when if I change the content of feature faceting when is it a new page when is it a new revision this is something where Cindy explained to me the dirty diffs right it could be semantically irrelevant diffs but that the system picks up as something new again we could use these spaghetti carbonara example is spaghetti carbonara without cheese another menu a menu different from spaghetti carbonara with cheese or is it just the same menu with a different option although I would say that like a lot of our users are expecting them and I get what you're saying philosophical they want that technical extra layer of oh you know it you know it brought up these other documents that I wasn't looking for but they're very helpful yeah of course but that is something you have to just code come up with so for example if someone is looking for faceting we also want I don't know it's you know information on metadata but the the engine itself doesn't know that it is actually pretty simple minded it just matches strings for the time being questions good so we keep that yes yes that is you see here when I so this class semantic media we key I use API functions that come up with each single page and then if that page type happens to be file it calls any class that would deal with that and of course you could then add a image processing OCR so if you have pictures you could have a face recognition but this this is the most important layer of the entire system because and then it needs to be extremely flexible so you can add different things here like ERP CRMs whatever you know newsletter I found that the funny just one little anecdote I found a solution to my newsletter problem because you know when we subscribe to newsletters you get 25 newsletters a day you cannot read that so what I do is I have now newsletters at data specs.com so all the newsletters go in there and they're indexed into the same structure so if I look for let's say elastic search had a newsletter yesterday about faceting and three months down the road I'm looking for that well this search approach will pick it up I don't have to go to Gmail and you know do a in document search so this is this is surprisingly useful little side effect I discovered in that yeah whatever whatever what you might be interested in looking into is this who has heard about this software you know it and it's pretty nice it's got a simple jar file you fire it up and you can drag anything you have on your computer into that window and it will scrape out any little bit of text there is and for a practical reason I send everything through t-con out and here for example we have to decide whether we use the wiki text whether we use the parsed wiki text or whether they use the entire page right and yeah there's lots of things I could tell you now and I'm tempted but I won't sorry you mean so my follow up question for Tika you said it scrapes out the text what about other formats will it do drawings but that's that's that's this guy's work or now it's it's under I think it's about 10 years old or so because 10 years ago there were many little little libraries scattered around the web for any type of documents and he just molded everything into yeah I mean you have do you can you see the okay formats is there anything about format yeah but but all the types so this saves you a lot of work and it's super lightweight yeah yeah no let's let's leave it by that if someone's interested in more you can always ask me and I'm more than happy to explain to you how this is done behind the scenes and this will be my focus for the next certainly 12 months because this I use it every day you know we eat our own dog food right but when I look up stuff that I don't know where I stored it well then I use this good okay