 Good afternoon ladies and gentlemen. This is quite another subject as all the previous talks I'm going to talk about is basically about the features of Apache sore Which is obviously one of the Apache projects and which is also an embeddable search engine that we use For our own search engine that plugs into our content management system, which is PHP based So what is it in a nutshell? It's a front-end to the enterprise grade Apache sore Which is in his turn based on the the scene libraries in written in Java And then it's a search engine for our easy published content management system, which is written in PHP It provides basically all the features that you may want from a search from a search engine like tunable relevancy ranking sorting filtering Drilled-down navigation like facets which is used on more and more sites and which really helps a user to find quickly what he wants Which also provides heuristics for providing automatic related content It provides spell checking highlighting of self keywords and as well as some mechanisms for external content indexing Now Apache sore itself. It has a very nice architecture and basically it follows a flat database model Document field paradigm that you also may find in databases like couch DB That makes it also very much suited for structured contents Because in which structured contents you can apply all kinds of heuristics and rules To influence the relevancy have filtering and facets Sometimes even much more powerful than you can get that from a database With respect to texts and languages It provides also language dependent analysis both at index and at great query time The bindings are very simple. It's a rest interface It's a false through everything over HTTP and it provides currently quite some response format So it's very easy to integrate with your applications written in a various set of languages Provides 6ml gson php python and so on Okay, a bit more into the features now It provides tunable relevancy ranking both at index time which you can regard this as a kind of Really hard-coded way of saying this type of content is much more important than the other regardless of what the user wants Or you can do it at query time when the user searches for it and provides of course with some interface elements to control this process How does it work as solar is a document oriented? Kind of storage I wouldn't call it a database, but it's sometimes close to it It provides boosting on the documents Some pages may be more important than others as well as at the field level. So if some Title it contains typically the the most important words of an article You can say, okay, if the matches are in this article the search results will be so that that article will be pushed to the top You can also influence it in a much harder way You can what they call in sore elevates predefined pages to the top when certain keywords are entered This is typically used also in e-commerce Applications where for example when you search for iPod you definitely get the most important pages related to that You can also tune the relevancy ranking in much more interesting ways You can use customized functions for that and that allows you to put more weight on for example more recent content like news If you have a large news site, which has articles over many years You may say, okay, I still want to have the most recent articles a bit pushed up Other uses are boosting on geospatial parameters like proximity searches if you want to have a restaurant which provides a nice Kind of lobster Preparation while you can search for it and that it's not too far away from you if your application is designed by that of course Filtering and sorting It's incredibly powerful. It provides fuzzy Boolean and all kinds of Combinations of it and it can also be used even to speed up queries of databases Facets Illustration here, I think says enough on the right. Oops on the right to see Kind of a navigation menu. We just chose the hits For the search for the query that you entered so basically along the main query you can specify Fields like typically metadata as well as queries and these and can be used to build back the user interface is required To have a very fast Navigation to the results you once another nice feature. This is tip work quite some with quite some heuristics It's automatic related contents. This is used by the way by more and more large news sites like BBC And it tries to correlate Articles that exist with the page that you are viewing basically it uses heuristics to perform a query in the background and Assembles the assembles that for returning the normal search actually without user interaction All the features from so I can be used here like filtering sorting also the facets that I Described before it's really really really powerful. It requires some tuning though You need to experiment with it before it really starts to function like you once This is a small example. I hope it is more or less visible but here on the top you have a part of a page and The back end search engine sore is here used oops. That's more Yeah To search for other articles that resemble the page that you are looking at and provides and content that you may be interested in Spell checking is provided in two different possibilities to different strategies actually you could provide a dictionary yourself with all kinds of terms Or you can use the indexed terms Insights the search index and this is actually recommended because if you have a site with Typically proper names or expressions that are not really that common or names or whatever You will not find them in a normal dictionary But since they are part of the contents the spell check becomes then also much more relevant and complete It is possible to ask for a kind of Google approach best guess so for example if you look for green bottle that it will try to Put out an alternative phrase Entirely There are much more other possibilities here. There is it's actually endless with sore seems language features It provides a kind of semantic search and normally this term should be used in a real semantic language scientific context But since so many Commercial search engines abuse it. I don't mind to do it here as well So is Multilingual out of the box, but it is not so trivial I think it is a bit due to the fact that most of the developers are from the US and they do not care too much about Alternative languages or think that a site should be monolingual Anyway, what we did is try two approaches. It's an implementation with Dedicated fields metadata fields which provides just the the language in which the page is next That means that there is a common index for all the languages And it has also some some problems with the statistics used to to provide the relevancy scores Because you basically dilute terms Between languages the algorithms to make this relevancy happening don't work so good so the second approach is what we used where we used is By using shards it basically creates multiple indexes one per language and where you can also apply all the tuning about analysis per language One of the problems typically is also that Yeah, all kinds of algorithms like stemming really depend on the language for English It's relatively simple if you look for Dutch it becomes a bit more complicated if you go for finish It's Very very difficult Anyway, but using shards you can tune per language as long as your content is monolingual of course and then it will Work a lot better Stemming that is used to reduce words to a more common form Basically as a side effect that it corrects almost magically spelling errors Although spell checking is provided as I explained before Another nice feature in the analysis steps that you can do is activate normalization For example, Latin one characters the characters you find in French and Eastern European languages or German They can become all normalized to a common form the same with the query terms that the user enters So the match is always there regardless if the keyword was spelled with the accent or not Another way of catching some of the typical spelling errors is introduced by users Another nice thing with solar Which you will not be able to do immediately with the Lucene libraries between be beneath it because it's that's much more elementary Is that there are already mechanism provided to Index the contents which could be called external because usually you use solar as an embedded search server for your Application and it can do this in various ways one of the ways that is possible is what is called the data import handler and it is a mechanism Where you just need to configure a few XML files and that allows solar to pull Contents either from a database or from some kind of XML feed or store That works out of the box and this really important and catches basically quite some Application requirements other ways to do it is Crawling why not what I did or what we did was writing a custom plug-in Which basically uses much which is another Apache project that provides crawling extraction of content with the ticker libraries and Just feats in this case to the sore index by itself If you have much more complex things or you you need to be more much more in charge which typically Is the case with all kinds of content management systems? Then you need to write your own plug-in for the databases that you're typically use This is basically also what we do with easy publish So I explained this a little bit The native feature of solar are already very powerful Or you can write custom plugins or use much. It's really cool Project if some people use much and looked also to aperture There's currently we're coming on to also provide integration between aperture and solar So it's all this it's very scalable and fast although some people said always tell to me Hey, this is Java. This must be slow. No, it's on the server side When hotspots compiler kicks in it becomes really really fast and typically benchmarks that That we did and that other people have done a normal server can easily easily serve An index of 10 to 100 million objects with typically five search requests a second The only thing that it needs is memory lots of memory One of the reasons that it is fast is actually that it also uses all kinds of Internal caching strategies Typically when you create a crash caching application and something is updated after this caches are discarded and are rebuilt At the first opportunity that needs it in solar is different is proactively So if a cache exists and it's not felt anymore because you add new content to your database It will recreate the caches in the background until everything is ready and switch them live So they're immediately available for the next request that's coming Can easily host typically multiple sites if you think of content management system on one engine and it has even for the very large Requirements on scalability and performance kind of built-in clustering secure Because it is a document oriented storage system. You can easily add fields Metaphills that you can use To implement security as long as your security model of your application resembles a kind of role-based system The nice thing with that is also that these security rules are converted to cached filters inside solar and It really really makes it very very fast if you don't have it Usually you're running into big problems if security is needed Because you usually need to do some post-processing on your data for every request Okay, that was it for more information Please do visit the Apache source site if you have never done it It's a really cool and very emerging project, especially for web contents Management systems or any data sources that need to be search inquiries and it is really relevant as typical web sites If you look at the traffic they get when there is a search enabled function in it It rises to easily 30 40 percent of all your visits If you're interested in our implementation with solar, which is called easy finds you can also find it We are basically a multinational company with headquarters in Norway or If you have specific questions on solar or easy find or whatever related to it, you can email me It's a very short email address Okay. Thank you. That was it