 Martin Anderson Klutz, I'm from Digital at Kidna in London, Ontario. Here to talk to you all today about using Drupal to provide federated search results. Is everyone here familiar with the term federated search? Show hands, maybe? Alright, that's a good number of you. How many are actually using federated search on only your web properties currently? Okay, still a couple of you, and anybody doing that in Drupal right now? Okay, good. This is the Wiki definition of search, sorry, federated search, but the short version is that it's the concept of it is from a single user query providing results that are a mixture of various sources. So here's an example of a site that's doing that where it's, I guess you'd call it a bento box style approach, where the results from each data source are actually kept separate so that the user can sort of visually see where each set of results is coming from. So you can see there's actually different even media types that are being shown to the user as results. But what we're going to be talking about today is mostly this, where they see a single set of results that they can then filter down and narrow to find exactly the type of information that they're looking for you. So you can see here, as an example, the color labels are actually even showing different formats that the various kinds of data or the various results are available in. Some of the use cases for federated search might be, you know, searching across multiple web properties. It might be something like providing results in a variety of different applications, either online or some kind of internet or other types of application, as well as giving a user an ability to search across collections of different information types. So like the one we saw before where it was providing results that were manuscript or images or, you know, different kinds of things. There's a variety of different approaches to federated search. A query federator is one where it just accepts the query and then does kind of a real-time search that goes out to different databases and gets those back and sort of presents them back to the user, either, you know, making some effort to integrate them somehow or just, you know, as we saw with the bento box one, keeping them separate. A data lake would collect together all of the remote information and without doing any kind of transformation, just sort of provide that as kind of a raw mixture of results to the user. A data hub would sort of normalize some of the data, so probably metadata and then still preserve kind of the native original format of the core data itself and that's mostly what we're going to be talking about today. And then a more formal data warehouse would really try to integrate and homogenize the data so it's really more of following a common data standard. So any questions about federated search before we move on? Okay. So what we're going to be talking about today is really kind of a case study around a project that I worked on where we were asked to replace the Google search appliance. Anybody here have experience with the Google search appliance? So Google search appliance for those of you who don't know is kind of this big yellow box that you drop in your server environment and it's got the Google algorithm kind of pre-installed on there. If you ever try to tamper with a thing, apparently these loud alarms go off and probably Google people will show up at your premises in short order. It's intended as a way to use sort of the Google search expertise for your local data. One of the benefits to having that as opposed to a service is that if you have content that's not available to the website at large, for example, you know, internet type content, you can still have it crawled through that method. Google has kind of a web accessible or locally web accessible, you know, online configuration, kind of a GUI for its configuration. Their documentation says that it can index not only web pages, but over 200 different kinds of documents. And there's an existing contrib module for integrating that with Drupal, although I think the Drupal 8 version is not yet stable, may not be, because Google has officially end of life the Google search appliance. They told all of their customers that as of next year, all of those boxes need to go back to the home of the mothership. So that's why our client had asked us to help them build a replacement. So in terms of what they were trying to do, they wanted to provide visitors to their website with results not only from the main Drupal website, but they also have a Wiki site with lots of technical information. There's an e-store that's really a document repository. Some of them require payment, but a lot of them don't, in a variety of formats, so PDFs, Word documents, Excel spreadsheets, and maybe a few others. And then also some sites that are really sort of more Ajax-driven in terms of how a user would navigate through those. So our goals for the replacement were largely a like-for-like replacement of the Google search appliance. So they wanted to blend results from all the properties, autocomplete suggestions as the user is typing in their query, spelling suggestions and synonyms in terms of how it interprets those query terms, and then the ability to sort of promote results or have, you know, if somebody types a specific query, then feature this example page at the top of those results. And then because we were moving to a robust platform, we also talked to them about adding fasted search as a new piece of functionality. So the solution that we proposed for them was really based around Apache Solar, which probably in the Drupal community most people are going to be at least familiar with. It shows hands to everybody how many people have used Solar. So yeah, it's pretty common nowadays. If you're using Pantana or Acquia, you have free access to add that into your site and it's, you know, extremely robust. I'm sure all of you have experienced. And it's course using the Lucene library for indexing and search. And Lucene is also what's at the heart of other popular search platforms like Elasticsearch or Lucidworksfusion. It gives you things like fasted navigation, highlighting geospatial search and, you know, a whole bunch of others, notes on there that you can use plugins to provide even more functionality depending on, you know, if you have the server access to do all of that. Typically the configuration for Solar is done through XML configuration to, you know, the GUI interface like the GSA. But it also has a bunch of APIs and it does also have an admin interface. But in addition to being popular in the Drupal community, Solar is in use by, you can see a variety of household names, some of which are pretty major and handle all kinds of traffic on a daily basis. The piece that we added on to be able to provide the federated search was Apache Nutsch because we needed something that could go out and crawl all of these sites, not all of which were even dynamic. So Apache Nutsch is a crawler that was designed to plug into Solar. It doesn't actually do any of the, I'll say true indexing, but it's a crawler designed to go out and scrape content from websites and then push that into Solar for where it will get indexed. One of the things that was interesting and we'll get into more detail about this is that it comes preconfigured as kind of a whole web crawler. So it seems, at least in its initial configuration, intended to go out and crawl from site to site and sort of crawl and index everything that it can. Whereas it feels like a lot of people are using it how we were intending, which is to crawl a smaller number of client-owned sites and not venture out beyond that. There's a big integration around that that we'll talk about. But yes, so you can provide CDURLs and also place restrictions on where it should and shouldn't go. Another piece that we added was Apache Tica, which is a powerful library for document parsing and integrates well, being part of the Apache suite of products, integrates well with both Solar and Nutsch and it can detect and extract metadata and it attacks from over a thousand different file formats. So definitely an upgrade there. And then in terms of integrating all of that with Drupal itself, we were using Search API and Search API Solar. Basically, I guess the core intent of those is to, in integrating with Solar, it'll take all of your Drupal content and push that over to Solar where it'll get indexed and then you can run queries against that and display those results to your users. And typically you would use Views to configure or configure and structure those results as the means for presenting those within your website. But it's extremely configurable, so you really have a lot of freedom using that in terms of what actually gets sent to your user. And then we also added Search API attachments so that any nodes that have attachments within Drupal will also get those indexed. There's a couple of different ways. Typically, as an organization, when we're using TECA, we'll have that on the web server where the Drupal installation is and then in some ways that's an easier way to use it but because of some of the client restrictions in terms of how they wanted things configured, we had that on the Solar server so we used this setup Solar to provide access to TECA as kind of a service. But Search API attachments does allow for that. And we also used a module called Sarnia which is a contrib module. Right now it's only D7. But it allows Drupal to display results that don't originate from Drupal content. So you can use any kind of a Solar index and provide results from that. But because of that it's really read-only. It doesn't have any mechanism to push content from your Drupal content into the Solar core. And then the other note is that it's currently in use or at least when the last time I checked, it was only in use by 191 sites which a lot of times would be kind of a warning sign in terms of whether or not to use it but for our case it was really, it really filled the need for what we were trying to do. So I wanted to show here kind of an illustration to demonstrate how what we were trying to accomplish really doesn't necessarily map exactly to the intended use for either Search API Solar or Sarnia. So as I mentioned, typically with Search API and Search API Solar you're using it to push content through to Solar and then run queries and get results back and display those to the users. So it's a nice clean sort of two-way communication. Whereas Sarnia is typically more of a one-way communication in terms of only reading from Solar. So what we had to do was build kind of a hybrid solution where we were using Search API and Search API Solar in kind of their I'll say non-Sarnia configuration to push content to Solar having Nutscrawl content and push the data that it was getting into that same Solar index and then using Sarnia to pull that out and kind of read only that configuration. Does that make sense? Any questions? So taking this approach some of the extra work that we had to do was normalize the Solar field names. The Search API Solar has a very particular way that it actually kind of needs to structure the Solar field names to work properly and Nutscrawl out of the box is a very different standard for field names and so we had to make sure that those two were using the same field names so that we would get kind of enough commonality in the data that we could pull them out as a single result set. The other thing is once you pull all of that back into Solar or start into Drupal by pulling it through Sarnia you lose any of the I'll say the benefit so because you're starting with Drupal fields it takes some of the the work that you've already done in defining the fields and turns those into kind of intelligent decisions for example in your view as to how to present that data and the minute you start using Sarnia it treats everything as just sort of raw data that it knows nothing about so you actually in your view have to put extra effort into some of the configuration around how to properly present all of that data There were places where you have to write hook implementations for things like facets so again with Search API Solar any time you have as an example a taxonomy field it would automatically know to associate the term ID that's in Solar with the actual facet name when it does things like display facets whereas we had to actually implement custom hooks and write some code to make all of that happen when working through Sarnia with the same data as I mentioned before we had to manually configure all of the places where the nut crawler should and shouldn't crawl as well as making sure that it didn't just venture off and start trying to crawl the internet as a whole and there was some extra I'll say code based configuration as opposed to what particularly the client web team was used to in terms of web based management of all of that so from our standpoint it was a plus because it was easier to manage the configuration between environments but as I say it's slightly different from what they had been used to so I thought it would give you all a bit of a sense of what some of this looks like under the hood or at least in the Drupal UI and give you a sense of how it differs from what you may be used to in looking at standard search API so here's your search API index page normally you'd see a single server and index but here because we're using that hybrid approach we've got a server and index setup for I'll say our standard communication with solar where it'll actually push the content to solar and then we've got a separate server index setup for the Sarnia read that will pull in not only the Drupal index content but also the content that was indexed through match and here's a look at your standard search API configuration of the fields so you can see it's got the machine names it's telling you which ones are indexed and the type one thing to point out is that the most used values here are really query time boosts so any tweaking we would try to do here would sort of not actually make it through to the final implementation because the read is really coming through Sarnia so that's one thing to be aware of if you ever decide to venture down this route the other thing I'll just point out and I think it sounds like most of you are used to solar but just going out for anyone who isn't the field types that are being fed into solar as strengths are available as you know filters and facets and sort by type fields but not aren't rendered or interpreted towards relevance whereas the full text fields are the opposite so they are sort of tokenized and used towards relevance but then aren't available as facets so that's one thing to be aware of so here's where we start to get into what the Sarnia configuration looks like and again anybody who's used to Search API will notice that that far right tab is something you haven't seen before for Sarnia and it actually provides again a new row of tabs below that local tabs for managing the interaction that's specific to Sarnia and a lot of them are basically read only in terms of you don't really need to do a lot within them for example this one is basically just giving you some of the information about the basic configuration and telling you that it's automatically creating an entity for you this next one is similar to the Search API interfacely saw earlier but again read only because it's basically showing you the data that it's getting from solar and the critical thing on this page is that little link at the top left that says refresh server fields because that's what tells Sarnia to go out to the solar index and really pull down all of the data that it uses and once you do that you get all of those field names and that's what makes it available all of those fields available when you go later on to build your view and so what happens when you use Sarnia with views that's a little different from what you're probably used to is all the fields that you're pulling from the solar index you pull basically from a single field called data and then it has a solar property in there that you define to actually define the solar field that you're going to read from so as opposed to Search API solar where all of your fields individually you could pick from them as you're adding them here you would add data a bunch of times and then configure them in this interface you can see here's all your solar fields so that's where you're telling it which field you use as opposed to as you're adding the fields so a little bit counterintuitive at least for those of you that are used to Search API in more of the standard configuration and again because you're having to tell it about all of the fields and what type of data is in there it's important to pay attention to the format or make sure you're using something that's appropriate to the actual information any questions so far so what are some of the surprises with NUTCH well by default NUTCH uses something called the OPIC scoring filter and again this goes back to the idea that NUTCH is preconfigured for you know whole web crawling and it's meant to emulate one of the core early ideas of Google which is this idea that as you crawl inbound links are treated as votes that should be interpreted as a way to tell which content is more relevant to a topic or not and so what happens is when you use NUTCH in its default configuration for more of an intranet crawler where you're re-indexing the same content on a regular basis this OPIC scoring filter will cause the re-crawls to act as additional inbound links every time it re-crawls them unless you sort of you know totally discard your index every time and so what a lot of people have found when they use NUTCH for intranet crawling is they have to discard this otherwise over time all of the crawled pages keep going up and up and up slowly but steadily over time being treated as more and more relevant whereas the Drupal content is unaffected so gradually your crawled results start pushing out all of your Drupal index or Drupal index content so fortunately it's just a single change in the one NUTCH XML configuration file and then that's dealt with but that was a surprise also for the our clients use case they had quite a large library of particularly PDFs but as I mentioned other types of content in their document repository and between that and the number of sites that were being crawled they ended up with some really huge index files that actually filled up the data store that they were using having NUTCH store its indexes on so it actually filled up and ran out of space a few times so we had to keep pushing that up to accommodate the you know the several gigabyte files that were being generated one other thing to know with NUTCH is that because it's not able to because you're not giving it structured data it's taking an assembled web page and ingesting that it doesn't really know the difference between what we call the content of the page and all of the text on the page so it's by stripping out the text it's got the page title it's got you know if the title is in text in addition to a logo that'll be in there all of your main navigation all of the text of the page is really ingested along with what we would typically call the content of the page at one point in the project we did actually look at forking NUTCH so that you could pass in almost like regex type patterns for each site to help it parse out you know the parts that we would consider sort of the body content as opposed to some of that Chrome but ultimately we ended up just going with the stock NUTCH more for the sake of maintenance but anyway if you use NUTCH that's one limitation to be aware of there is also some complex logic particularly with their document repository in terms of making sure that it was crawling or properly interpreting the language for again particularly the the different documents and because you could have English pages linking to French documents and sometimes which language someone was getting was based on quick session variables sometimes it was complex to make sure that we were giving NUTCH the proper indication of which language to interpret that so that ended up being quite complex and then also again because NUTCH is out of the box configured as a web crawler it has default caps on how many links per page it will crawl and the size of page so we had to sort of find and disable those for our purposes but once we did those it actually worked quite well some of the surprises with Sarnia a couple of times I think it was maybe if Solar was offline or you know some various configuration issues it would occasionally kind of forget schema so you remember we run that page where it was the second tab of its local tabs that had all the field lists if you went in there it would be completely empty and from the client's perspective more critically if you went to search anything it would just not give any results so you just had to go to that page hit retrieve Solar schema or whatever it was Solar fields and it would hang the Solar core again and get all of that information everything would get working again but we had one or two panic calls from the client where they couldn't understand why everything could stop working unfortunately that one was a really easy fix there were a couple of times where as we talked about Sarnia is really not in wide use right now so I think we were probably trying to use it for things that maybe hadn't even been used for before so there were places where we actually got sort of like critical PHP errors that gave us wait screens of death and had to patch Sarnia along the way so definitely some learning there and happy to contribute back to the project and then the other thing that we encountered that was interesting and I guess an example of what we were just talking about is that we realized that there was actually no way within the search results that we need to display results that were in the language of the current users so we ended up having to patch both Sarnia and the token module so that we could use a token to accomplish that so more generally some notes about crawling when we're using search API attachments it takes the content of those attachments and it adds it to the content of the node so any relevance to a particular subject within the attachment counts towards the node whereas with an indexer you don't necessarily have that kind of strict sense of what the relationship is of a document to a page so really those two things are treated as separate and so there's a difference there in terms of how those things are treated and I don't necessarily know that that's something that you can fix it's probably just if you're using a mixture of those two things to understand that there is a difference between how those two environments will be treated in our experience the nut crawler really put a heavy load on the server so we did some things to dial it back a little bit but then also worked with the client to make sure that it was crawling at night or at non-peak times for the particular web servers and we discovered that because the Drupal data has metadata things like taxonomies that have been associated you can do things like fastening but obviously you're not going to have that same metadata for the crawl results so anytime you apply the facets you immediately exclude any of the crawl data and then as we talked about some of the content doesn't necessarily lend itself particularly well to crawling so particularly things that are AJAX built and especially things that don't have URLs that can be indexed properly that isn't really going to be content that you can index well with something like NETCH so some lessons learned as we talked about fastening across data sources is challenging you really have to have common metadata that you can use if you want facets to work well in terms of helping users to identify common content across data sources duplication can also be an issue so there were definitely places where they had made Drupal nodes to help people find for example content that's on the document repository but then once all of that gets into the same search index you might end up with both of those in a search result and really pointing to the same place and then relevancy across data sources can be a challenge so because you're using you know solar ingestion through Drupal that is you know high structure and you have lots of metadata and a lot of sophistication in terms of how you can parse that versus NETCH which is you know much less structured and as I say you're investing all of the text of the page as opposed to I'll say the most meaningful parts of it the degree to which it's really interpreting all of that content the same is sort of hard to keep equivalent and can require a lot of fine-tuning and I also thought I would touch on here what would it look like to try and do some of this in Drupal 8 so if you look at search API solar in Drupal 8 this in the configuration the bottom part of it here has this checkbox for multi-site compatibility I think it was in the one keynote speech yesterday where they were talking about how in Drupal 8 search API solar is using the solarium library to interact with solar so it provides a much better layer of abstraction in terms of dealing with solar in a way that isn't specific to Drupal and so one of the benefits is that it is able to provide kind of that multi-search capability it does still at this stage expect the standard schema.xml so it has still a lot of expectations in terms of how the content should be structured in order to be able to use it properly so it works better for providing federated search across Drupal sites as opposed to Drupal and non-Drupal or a variety of things they did touch on the search API data source which is a promising possibility for providing again non-Drupal content but at this stage it's available on GitHub and on Drupal.org but being a search by everywhere I suppose so that's really it and I was going to throw it open to questions now it's sending me the yeah there are some questions I'm not going to be making it oh okay awesome so I really gratified the direction yeah I know it's really I mean it was a big help for us so yeah yeah we submitted some yeah and I think yeah some of them got into a couple of the releases earlier this year so it was great yeah yeah awesome oh great any other questions, comments and I'll let you have the rest of your afternoon so thanks everyone