 All right, whoo Drupal con Very happy to be here. I want to quick thank all the organizers all the people putting all this together. They're amazing This is an incredible event. I love it. Thank all of you for coming. I appreciate it I prepared for you a brand new talk brand new first time I've ever given it Considerations, thank you Considerations of federated search. My name is Adam Bergstein. I go by nerd steam This is a live-action shot of me at a beach just you know, maybe a couple days ago, right? The real thing I am the vice president of engineering. I work for hook 42. We are an awesome company Out of the Bay area. I live in Pennsylvania personally We have a great team. We do a lot of good services. Here's my information. I go by nerdsteen on Drupal.org Nerdsteen with a three on Twitter someone squatted my brand Not cool, right? Okay, so that's how you can reach me So about this talk, let's have a brief conversation first We're gonna look at some basic concepts of what we're talking about then we're gonna review search back-ends We're gonna look at some data transmission. How does that happen? Search features for you to be mindful of when you're evaluating solutions Let's look at interfaces and then we'll sum it up Some basic concepts All right, so how do we define? Federated search, how do we do that? What is the key thing that is that we're talking about? information retrieval that allows for Simultaneous, that's the keyword in that sentence search of multiple searchable Resources, okay So what do you really want you want something like that? You want Google, right? I want to type something in I want to be able to search a bunch of different things I want to be able to pull something up. I want it to be relevant, right? All of those things So that's gonna be our analogy for the day as we walk through this story, right? But what can your sources be what are the things that you really? You know want to try to get or grab in this context. We're a Drupal con, right? So we're gonna be looking at websites and it could be anything it could be raw HTML It could be generated with Gatsby. It could be any of that stuff. It could be a Drupal site Heck, it could even be WordPress I said it So what what's a real key thing? To to be mindful of as we're looking at these data sources. What are stuff that we need to kind of frame our reference? Well, let's look at like availability. That's kind of a key concept, right? So this data needs to be available to access so that we can get it to be searchable, right? So data basically can be anywhere we can get, you know data from any site It could be over, you know over here over there But we need to make sure that that data is readily available to to be searched, right? Another key concept to be mindful of is data formats, right? So we need to be looking at things in a standard kind of uniform and conventional way And if we don't then we're gonna get off the reservation, right? We want we can't scale We can't support every unique thing that happens on every single website So we're looking at generic things generic tools and practices and standards that we can use across the board So that we are successful Right another key concept to be mindful of is how often do you want that data to be refreshed, right? How how quickly, you know, is it you want it daily? Do you want it every 15 minutes and what kind of scale are you looking at, right? So if you're doing that across two or three thousand websites, right? Then you start getting into some complication there. So this is a good thing to be mindful of as you're looking at All right, let's dive right into the next section, which is about search back-ends Who's familiar with the term back-end? I just want to make sure that I'm okay. Cool. All right The first thing to really understand about a search back-end is you need some uniformity We've talked about this right so we need a shared schema, right? If you're going and looking at multiple websites and you're trying to get elements You need to understand what it is that you are Getting from those sites from those sources. So you need to agree on a shared definition That is standard across all of those sources, right? And then those that definition should have the associated Fields we could call them right and data types. So this thing I'm expecting like a shorter string or this thing is a big Block of HTML or maybe this thing is an image, right? So these are the different ways or different fields that you could look to set up or established But it's really relevant to whatever it is. You're searching And also don't forget about cardinality, right? Because HTML can have more than one heading on a page as an example, right? It's kind of important and these fields end up ending end up getting stored within an index And that is really like the analogy is a non-relational table, right? So you have relational databases that you can link things. It's not it's flat like it's big big epic big data table, right? So cut here's a kind of a good analogy to frame this a good visual that I thought might communicate this well You know, so if we're going and we're scouring our sources, you might have a title on the page You might have a h1 and h2 or a body tag, right? And those things kind of map to some other Oh, I'm sorry. So that's a schema. So those would be our fields within the schema So we're gonna say I want to get a table. I want to get all the h1s I want to get the h2s and then the the field types and everything are on the right side from framing that right So we have text this that one's a list of text and a list and that one might be long text, right? So that gives you just kind of some example of how to set up the schema in the fields make sense Cool all right, so what are some platforms that are kind of known for this sort of thing while you have elastic search That one's really up and coming. It's pretty cool. It's got a lot of advanced features Solar is kind of the conventional go-to stable Been around a while and Algolia is more of a platform as a service kind of thing That is really up and coming and it's pretty exciting too, and they're doing some good things very similar to elastic All right, so what are some considerations of these platforms as you're evaluating what you need to do for your search back end I think the most important thing today and right now is API is an interoperability If you don't have a good API to work from when you're working with a search back end It you're you're gonna shoot yourself in the foot. It will not scale The next thing is I've seen features around data types especially for elastic search They have things called semantic data types like geo location IP addresses and if you have semantic data types That gives you all these incredible features right at your fingertips like robust querying Reporting graphing charting through cabana like that kind of stuff is really really amazing. Chris is a super expert in that so Definitely look at the data types as you're going through and evaluating these platforms and it's gonna have to have all the basics right you Still need querying you still need filtering and functions that you can run So you can do some advanced calculations and summarize the data that you need and don't forget about environments, right? This is like the classic thing You know so you have your local system and you have development servers and then you have production You should have the same thing for your search back end, right? Don't be putting it all on production. That's really risky All right Cool so far so good everybody hanging in all right amazing data transformation. Let's look at that Okay, who's familiar with the migrate framework in Drupal right a lot of folks cool same thing Etl framework extract transform and load. We're doing the exact same thing This is how you can model what the data transmission looks like in any Federated search solution because you have to get the data in there right and we'll talk about how to do that So first we're going to extract all the information from the various sources and it could come from anywhere, right? We're going to do some sort of transformation, right? We probably normalize the data some maybe you know We have to do some sort of transformation I've never had a solution that hasn't you know We have to process each field make sure it's the correct data type etc And then we're going to load it in to the actual index itself, right? We're going to store it All right, so this this is cool So I took umami and did a little screenshot to try to communicate the idea of the fields and the idea of getting the Elements off of the page as you're doing this data transformation. So again, this is the umami the out-of-the-box initiative in Drupal and so you can see like I grabbed a recipe right and That is like one of the titles right? So that's the that's a source page and that's the field that I want to get from the page And I'm putting it into the index into the title of the index itself same here So here's an h2 on the page and there's a body and that goes into my headings field in the index Body goes into the contents field. So that's kind of how that ETL sort of works now I didn't really cover any of the processing, but that's the general idea So we're going to extract these things from the page and put it where we need it inside of our search index Makes sense. All right Now here's the example of the processor right so I have a body field and it's markup It's got a bunch of stuff in it. It's got punctuation. It's got a whole bunch of HTML tags, right? But maybe all I want is the actual content. I want the words right if someone's searching on something I want to make sure I'm getting the right words So what I might do is I'm going to grab that body value and as part of the transformation I'm going to remove the tags Right, I'm going to strip them out and then I'm going to ship it to the search index So this is really the ETL process for one of the fields What's going on? Cool Didn't get the diagrams in there. I apologize So there are two ways that you can actually extract data The first way is you could crawl a site. That's what Google does, right? So you could pull data from multiple sources you define The sources that you want and then you crawl them in a consistent and uniform way and that can handle your ETL Right as part of the crawling you could do the transformation at the exact same time and then store it as you're crawling right The other way is if you actually push data to the search index directly from the source So for those familiar with the Drupal stuff, right? We have search API Search API has the elastic connector module or Apache solar Modules and it can interact directly with an index, but that index could be receiving data from multiple sources, right? It's pushing it up. So you could have multiple Drupal sites pushing information So that's a good way to frame how you want to perform the data transmission I will say That with the crawler you have to be mindful of the fact that you are doing things in a extremely predictable and rigid way Every single time, right? I would not recommend putting in site specific logic inside of a crawler That would get really really hairy really quick We Build a crawler using the scrappy framework, which is in the Python ecosystem But you have access to all those really cool Python plugins and modules that like right at your fingertips, right? So lots of like net natural language processing and data normalization tools that are right there Right in your grasp super easy to use. It's a great framework I we really liked it and then you know You're gonna run that and you're gonna spider your sites and get all the information that you need And so you do need to be mindful though the scrappy framework is really cool It provides some out-of-the-box extraction logic So for those that are familiar with like CSS selectors, you can get elements of the page Grab the information or you can use XPath something like that But you need some way to be able to query what is on the page, right? I want the body tag or I want this heading with this class or something like that, right? And So there's some really nice stuff too especially for Drupal with the natural language processing that you could work very well with Multilingual as an example, so that's another big consideration is the language Yeah, and then scrappy was cool because it has a whole bunch of plugins for source storage So like if I wanted to use elastic it had a plug-in if I wanted to use solar had a plug-in I could write it to a data file that would work too, you know But it can do a lot of different different things The source API is really I think the great way to frame this the model is every source is responsible for performing its own Transaction right and it has to conform to what the central index is doing right and the data types in the fields And so this often ends up being that you have to have custom logic per site You have to write all the processing and then you have to perform the mapping from my elements on my page To what is in the index itself? So you almost have to do like what the migrate tool is doing if you go this approach, right? But the benefit is you can really refine or very uniquely specify what you want to set per site Which is advantageous in some cases and we did talk about this with the search API and the indexes and processors cool So far so good. We're we're blazing. All right. We're blazing All right search features Again, let's level set. We're trying to do this. Okay We are trying to do this. That is what people want All right So I love that image. It's a little blurry, but I think you'll get the point Semantics semantics is everything when looking at search features. This is like the biggest like aha moment, right? So When someone searches for something right if they type in the word fly Do they want a bug or are they going on an airplane? Right if they do bat are they talking about an animal? Are they talking about a device to hit a ball? No one knows right. This is very confusing So search terms can actually have various meanings meaning is the key the semantics are really valuable to understand in this context So how can we try to like extend this a bit more a lot of search features? They consider this idea called tokenizing right and you could split up what is searched and try to understand it better in more Context so if you split up this input into specific terms someone might type fly on a plane Right, and then you have a specific semantic right or they say fly on a wall and that's implying a bug Right so having those tokens to split up the different words can help you to achieve that Semantics that we are looking to achieve Another thing that you can do as part of search is really stripping out stop words And these are words that I you know, maybe they just don't really add value to the semantics They don't have any real meaning so if someone types in of or the or my It's not really Getting you anywhere. It's not adding any value to what is in the search index, right? It's like a dead word right and so they call those stop words and Stop words are really critical because they don't add to the semantics, right? So you can basically strip them out The other thing to be kind of mindful of Around semantics is the idea of synonyms. So a lot of search backends search platforms have support for things like synonyms right where You might search for car, but you know, maybe someone wants auto or maybe someone wants vehicle They're all kind of the same idea, right? So like if you go through and you're searching for car Maybe you do want it to allow to search for auto or maybe you give someone a suggestion and say like hey Did you want to search for auto as well? Something like that. So having those kinds of features can be really useful for the actual experience That someone might want when searching stemming is also really cool in terms of trying to drive semantics, right? because Say someone Looks for study, right, but they type studying or studies These are the same thing. They mean the same exact thing, right? So you could take the stem of the word and try to search for that, right? And there's tools to do that within some of these Search, you know platforms. So that's really really helpful because then you're driving towards the semantics, right? You're actually getting what you want from a search, you know Even if they use just slightly there a different variation of it, right? So that can be really helpful Limitizing it's basically the same idea except it's not starting with the stem of the word It's the variations of the same word. So grow grew grown, right? If you in the back end semantically it basically is all the same form of grow So you could understand that semantically you could look for that All right, everybody's favorite right spell checkers You know I mistyped my name going to Google and I really like the idea of having the did you mean the suggestion engine, right? So these are also features that sometimes come out of the box with some of these search back ends and search platforms So you might want that. It's a good consideration, right? I mean I fat finger things all the time. I want it Okay, so now we're getting into a little bit more of the You know kind of science here a lot of these search platforms have Specific ways to perform algorithms most of it comes through the querying engine But there is a lot of things here that you need to be mindful of right so the key thing is relevance Right, and if you're crawling multiple things and you're storing multiple fields per You know scrape of a page and you're putting that into like my title field and my body and my headings You kind of need a way to say you know what I want these headings to have more weight to more more value more relevance Then say the general contents right because headers are usually pretty important especially for screen readers and these other things and So naturally we need a way to be able to say well I want this algorithm to make this field a little better a little stronger give it a heavier weight then this other field over here, right and so What this what the outcome of this is is basically trying to define the sort order of what the the matching results are returned So I'm going to get them in a very specific Way in a very specific order and that usually if you see the terms weighting or boosting That is usually the terminology that is used to help with this algorithm And there are a lot of different features and things to be mindful of that can be a part of your algorithm on your site The other idea is filters and facets Okay, so if you have this field index split up by titles and headings and bodies, right? Maybe you want to very specifically filter based on one of those fields That is a very relevant use case right so someone can do the Google and type in men's Merrill hiking shoes Right and they'll just go and it'll get lemmatized and tokenized and try to get exactly what you want, right? Or maybe someone is shopping on a shoe website and on the left side is a facet that says Oh, okay. I'm gonna start. I want men's boom. I picked that all right next. I'm gonna pick hiking boom I picked that oh and all right. I see Merrill boom. I picked that They want to drill down because they know exactly what they want right so that's the idea of filters and facets So if you go you want to make sure that any search platform that you select is definitely capable of doing queries and This kind of feature set and they primarily do All right, are we hanging in? I'm blazing. I'll tell you Search interfaces. This is a very fast talk. All right There's really two primary ways to build a search interface in my experience The first way is what I would call integrated. Okay, and integrated is kind of the same idea of like I have a site And it's pushing data in my index This integrated approach is basically putting in a Feature on your website specifically that pulls information from the search index right so in this case This is exactly what like the search API is doing To get and retrieve information or views right it's a Drupal specific solution, right? And so you can look at that as kind of either the CMS doing its work or whatever framework you're using But you're building it specifically for the site, right? You're actually building an interface directly into your Drupal site that is working And you still do need to be mindful of the fact that you have to be able to generate queries you have The ability to process very specific records though, which is kind of cool, right? So you can really finally tailor the experience that you want site by site if you're using this kind of an approach Right, it's just like doing custom coding, right? So that's nice and all but it's not really for everybody And we're talking about Scenarios right where we're scanning and crawling thousands of sites, right? But what about a solution where we could look to have an interface that can work on all thousand All of those sites all two thousand three thousand of them, right? Let's look at that. All right In that case that would be more of a decoupled solution, right? I've heard a lot about react. I'm sure it's been discussed. I think a good bit So that's one framework view JS So you pick your modern JavaScript framework of choice of do Joe right soup of the day and you get You can build something that is agnostic to the Site that you're plopping that on right and that's you know going to be rendered through html. It's going to be rendered client Sorry Built client side right so you don't really have to worry about all the site specific stuff Now what you do need to worry about though is how you consume that you have to have an endpoint, right? So you can either do that directly to your elastic endpoint. I don't usually Recommend that I feel like you should have some proxy as sort of a security layer like a best practice, right? And that might be one other server, you know But we're talking something still central that is that can be put into any of the websites And the great thing about this approach too is you can still theme it in whatever way you want every every site Can still have its own CSS but you're going to keep the JavaScript the same and you're going to render it the same way And you're going to hit the endpoint the same all the logic is there just never changes, right? And the other good thing about it is this is really highly performant because you know you're pushing all that logic off to Off to the client so you're not going to slow down your back-end processes of the site You're not going to have to really worry about anything for the site You're going to push it all right, which is kind of cool And so you can really share the same exact artifact the same, you know app in As many sites as you want Pretty slick right we made it Not bad About seven minutes Any questions can you come to the mic that it's on my instructions? Thank you so much Architecture then you actually describe it here is kind of interesting and I'm thinking about implementing something very similar Okay, so you're basically saying many Drupal sites and then you have a back-end and it's kind of Pulling all this information right and then you search on any of the sites and it's returning the search results on Specific place so you're not you're not really going to a third Right server correct Is that Models that has already been done to accomplish this is this something that Absolutely this already a recipe. Oh, you mean is there like some code available to do this or anything like that Yeah, I mean, I think the there is I think so to repeat the question Is there what are some of the the tools that can get this from end-to-end to just you know probably build it, right? I mean, I think quite honestly that's really where you're looking at like a SaaS solution So like elastic has something called Swift type where you can just go and pay for it And it has its own crawler and do things in terms of something open source I'm not a hundred percent sure like I mean elastic itself is open source So is solar so you could like stand them up and have you know parts of the solution But I think that if you're looking at something like scrappy, it's extremely flexible like it's a framework So you still need to build in that logic. You still need to create your index and I think that's the piece It probably is going to take a bit of time I'm not really sure nothing comes to mind of something that you know You could just install quickly and kind of run with it But I that's that's certainly something I think to to look at you know in the future Yes, so you're talking about scrappy. Can you tell us like any specific features that you have used for scrappy? Oh, I could be here all day Where do I start? so Part of it is not just scrappy part of it is the whole Python ecosystem, right? So we use the NLP library that was really phenomenal for getting like language detection language negotiation between the sources and the contents We use tons of the string libraries to like strip out and process data across the fields that we were getting And scrappy itself has tons of plugins So it has like a CSS selector plugin that we made use of we did not use the XPath But you could if you want and also has all these plugins for like the sources. So there's like HTML Plugins or XML. There's even like markdown and things that you can get So we specifically made use of the HTML one there, but there's a whole host of Ones that you can take advantage of So it's really fascinating You said you don't recommend Like the separate server for Elastic search I was just wondering what are the considerations because we're using the decoupled approach with reactive search library and Elastic search sits in Sass solution in a cloud platform and it works like really well. So I was like wondering what are what are the considerations? So considerations like self-hosting versus using sass is that is that the question? Yeah, you said like the separate server for elastic search. You don't really recommend that. Oh Sorry, so maybe I wasn't clear on that. So when I'm talking about a separate server I'm actually just talking about separating it from like a source So and sometimes like you might pay for say like a pantheon host or an aria host And it's got a solar index that comes with it, right? You need to make sure that that is separate from your say Drupal database or your Drupal web server or something like that Because you don't want to slow it down, right? So make sure that those architecturally are split up So you want one central server and that makes sense across federated, right? Because if you're federating across two three thousand websites, you can't park it in one place that's tied to one website You should really put it somewhere completely separate in a central location. Yeah, good question. Thank you. Yeah Anything else? No, thank you so much