 The federated, I'll have to just, I would, myself with Atanand, and I don't know, Digital Agency, Drupal Agency based in New Delhi, India, and we are talking about federated search. The use case of standard federated search comes up when an organization has multiple sites and they want the users to be able to cross-search between them. And a simple example could be of, you know, finance company, a banking company which have different sites for mortgages, loans, general banking, insurance, and everything else. And since they share their customer base, it's imperative and it's important for them to, for the visitors to be able to explore content across the sites. Now, if generally what happens, all the sites have their own search which, you know, allows the users to, you know, explore the content on that website itself. While federated search unifies the search back in for all those sites and allows them to cross-search between different properties. So, essentially, and what are the key benefits of having a federated search in this scenario is obviously the content is, you know, can be explored from all the sites. And this gives a better content discovery or information discovery mechanism to the visitors which results in better engagement and conversions. In terms of, you know, building, this is a very high-level understanding what happens is, so we had five, the enterprise were managing five, six different sites in our case. And they were all built in different platforms, Drupal, Lumbako, WordPress, you know, everything. And all of them being managed by different vendors. If the first thought process generally comes, why not just connect all the content to, you know, via APIs and allow them to search. But the fiction comes because all these different technologies and, you know, different scenarios are maintained by different vendors. It takes a lot to really coordinate between so many people and, you know, to be able to build a solution. So the thought process, one of the clear guidelines to us was to, you do not have that bandwidth to communicate and engage all the teams together. So anything which needs to be done has to be done quite independently, yet it has to, all the content from different sites has to be searchable. So, you know, this is another diagram which explains, so we had five sites, they're all indexed in a single search backend, it can be elastic or solar. And all the different search interfaces on different sites, you know, connect with that elastic, single elastic server or solar server. So this is the key, you know, high level idea that it works on. We'll go a little deeper with this diagram. Just, yeah. So this explains the detailed architecture and this is where we'll spend most of our time talking about this. So since we did not have the leverage to connect to all the sites via an API, you know, talking to the different vendors, we decided to crawl all the sites. So Scrappy is a Python based open source framework which opensource crawler written in Python. It's a pretty stable project. It's quite popular in the crawling space. We used Scrappy to, you know, wrote crawlers in Scrappy to, you know, crawl all the different sites and get the content. We used Redis to manage the URLs which are being queued, which are being crawled and manage a queue of them. And then the next step was to really make the content that we're crawling because when we crawl pages, it just gives you the HTML dump. And in certain cases, it might be important to get extract structured information out of those pages. For example, events, it's important to have the start date and end date and the venue apart from just having the generic content. So we had to write, we wrote a generic parser which identified just the body element of all the pages and extracted that content. And in some cases, for example, of the events, we had to write, you know, custom parser for specific content, content URLs. So we used certain tags to identify this is an event page coming from site number two and we wrote, you know, specific parsers which, you know, which we extracted structured information out of it. So this was, again, done, so this all went in a pipeline. The first step being the crawler, the second and the crawler back end, the queue was managed by Redis. And then the second step was making that go through the second step of parsing where if site-specific or content-specific parser existed in the system, it would use that, otherwise it will go ahead with a generic parser which will just extract the body of the, you know, and remove all the HTML. Then the next part in the pipeline, then there is a Drupal back end to it which is not a public-facing Drupal website. This was more because the way Drupal out of the box works with Apache Solar or Elastic. It was an easy decision for us to, you know, just whatever content is coming from the crawlers and the structured information we are getting from the parser. We were feeding that all in Drupal site, the structured data and also the metadata regarding the meta description and the page titles. And these were all stored in Drupal and from this location it was regular Drupal and Elastic search connection. So all the content was indexed in Elastic. So Drupal was essentially used as a middleware storage for all the crawled content from the different sites and then Drupal was responsible for sending it to Elastic and we'll come to the next part. The Drupal itself, we built certain custom functionalities to allow the administrators to have more control on the crawl. So if they wanted to force a crawl on a specific site or if they have added a new event and they wanted to be updated immediately. So they had the force crawl feature. They also had the granular control in terms of if they want to make specific content featured within the search so they could mark up, flag a page as content as featured. So all those granular control and force crawl features were provided to the administrator through the Drupal interface. So that was the part. And once all the content is indexed in Elastic the next part was to how to integrate with these different sites because we knew that we do not have that leverage of asking them to do a lot of different vendors to go and make custom development to connect to Elastic Search and rebuild their search on all the sites. So we created a React application which integrated with Elastic Search and we created a drop-in snippet which they needed to create an empty div and place that snippet on the different sites be it Umbako or WordPress. They just need to place that drop-in snippet for the JS and the JS code which was being hosted from our end was responsible for connecting to get the data from Elastic Search. And so this was about building that and then the next part was about making it voice-enabled so all the content which was indexed in Elastic that was used by Bert from Google this is a question answering algorithm and we fed all the content to Bert which helped us create an API which gave answers to a question-based query rather than just a keyword-based query and that was the next step to connect it with Amazon Alexa because we already have the question-answer API so Amazon Alexa works pretty straightforward with that. So this was overall architecture. The key challenges that we faced here was because it's a full crawl of all the sites and the content needs to be updated together all the sites had around 400,000 pages all the four or five sites had 400,000 pages and to crawl all of them this was a hectic infrastructure task. So we created this, we added a tweak of this algorithm where we now, yes, important one so most of the modern platforms, modern frameworks they use last updated date in the headers pretty well but there are older platforms who do not respect that standard too much so we could not rely on the headers to identify if this page has changed or not so the key idea was to reduce that full crawl every day of all the pages we had to find a way to identify whether the page has content has changed or not so the one way to us to look at the header but that did not turn out to be a reliable way so while doing, when we were using Redis as a queue for managing the URL to be crawled we also used managed, stored the hash of the HTML dump and every time the crawler was trying to was trying to crawl a page it checks that the hash has changed or not then only proceeded with the further steps in the pipeline so this helped us reduce overall time of full crawl to a large extent so that was one of the key challenges which we solved this way and yeah that was more or less the architecture so yes any questions or any discussions regarding this? yeah access requirements sorry do you have any specific access requirements when people were searching content? so all the content on all the sites were publicly available so in this case the access was not a concern but in certain cases it might be a use case not with us but it might be a use case and the good thing about Scrapy is this framework allows you to log into a system before crawling so if we provide the parsers when you write the crawlers on Scrapy you can actually log in into systems and also crawl content so that is our possibility but that was not the case with us yeah sure what was the most resource intensive service? the crawling yeah the crawling that's it yeah sure so storing all of that data in triple so you said it was 400,000 yes did you have any issues with my C4? no no not at all because the thing is we were storing all the content in Drupal yes but Drupal was never accessed directly by the users or anyone it was just whenever the content was pushed into Drupal with the REST API it used the Drupal Elastic Search API the module automatically triggers the update on Elastic and Solar essentially that was the only use case of Drupal so we never ran into that scalability of performance issues yeah Search API so the connection for Drupal and Index and Solar Elastic was done with Search API modules yes so yeah yeah so I think so yes one just a few more important this pointers I'd like to bring forward is so in terms of the different stack that have been used here the crawler was as I said it's written in Python we use that for writing the crawlers the parsers we wrote in simple PHP because that was just in a pipeline it could have been done in Python as well so there was no really dependency being a pipeline structure and yeah I think that was more or less yeah that covers most of it yes yes yeah what are you using in terms of structure? yeah so they were done on dedicated servers so there was a client preference but this could have been done on any AWS or resolution anywhere but it was done on a dedicated server right sorry yeah yeah yeah yes yeah just these sites the end sites they were in their own infrastructure they only plugged our JS snippet on their sites that's all but overall the complete system was running on a dedicated server that's it no not here so right now we are working to so the next step is to that we are working right now is to make this overall setup open source that's what we are working on that's where we are trying to dockerize especially the crawler and the setup part crawler and the parser mechanism so we are working on the dockerization okay in terms of development how did you test all of that so during the development phase you managed to setup all of this stuff in 5 steps? yeah so the testing yeah that was another challenge so essentially since the crawlers triggered every 24 hours to debug an issue there are only two ways either to manually run a crawl full crawl which will essentially take at least in our case taking 18 hours for the full crawl so every small debugging which needed a full crawl used to take 18 hours to let the crawl run and then take that step so as the debugging was a little painful process there yeah that was a that remains a challenge but generally the initial development that happened later on in the beginning of the development we kept the crawl data crawl to be crawled pages data set really small so that we were able to do full crawls and test the complete life cycle of the all the pipelines and the steps much quicker but when it was in production and then issues came up we had to wait for 24 hours to let the full crawl finish so yeah that was a painful part yes yeah yeah so yeah yes so yeah so Google has released a BERT algorithm which takes any content for you and it generates it you are able to search in a question format rather than keyword format so we use that algorithm and created an API over that so we got that question API working for us and Amazon Alexa it's a very simple integration to have when you have an API ready the tricky part is getting the API ready when a lot of people try to create their own algorithms using NLP but we also tried that but then nothing comes as accurate as BERT so that's the part so yeah I think that was it yes yes yeah yeah yeah yeah yeah so I'll give you example from our website just like that example already with me so your website has a page talking about say you know so you do Google Security You know, so, and you ask in the general search where you ask, how to secure your Drupal site of important modules for Drupal security. But now with this, a lot of voice enables applications coming in. People are changing their behavior to asking questions. What are the important modules for Drupal security? So now, if you might have a 4,000 word article and regular standard search, keyword research, will just give you a link to that article and you need to read through those 4,000 words. While this word, question answer mechanism does not give, gives you the link to it, but also identifies that snippet within that article where the probability to having that answer is maximum. So you get 400 word snippet paragraph in the result that, which talks about the modules and modules names about Drupal security. Rather than asking you to read the 4,000 word content. So, and that's the snippet which Alexa reads out as well. Okay. So essentially the key is to not allow the users to be able to explore the specific part of the content rather than asking them to read the whole content. That's where a bird helped. Can you go a little more detail? So, when you say that, so, I'm not talking about the word specific. Sure, yeah, yeah, that's an API, yeah, yeah. So, bird must have interpreted the question and it's converted into, like, you know, you press it in the question to set up keywords with which you put very elastic search. Yes, yes, yes. Are the pages also broken down in some, like, a house? Yes. What's helped you identify the section of the article? Section of the article, bird helps you identify the section, you feed the whole article to it and when you make a query to it, it extracts in the end, using NLP, identifies the keywords from it, including, but when we do keywords searches, they only identify those, you know, I would say the contextual words and not the linguistic words, how, why, these kind of words. But bird identifies at what exactly you are asking, looking for a reason or you're looking for a name, you know? So, bird looks at the whole question, tries to identify what you're looking for and in what context and in that relation, in that context, it gives you the specific snippet out of that article that where you can find why and where you can find what, that kind of thing. Last thing. Yeah, sure. So, this was actually a snippet from the article? From the article, no, no, yeah. So, bird does not generate, bird doesn't generate, it just identifies the snippet where the probability of having the answer is maximum, yeah. So, to generate, there are other ways to do it. I think it's from brain.ai, so they have released an algorithm which is able to generate literature, generate content from what you have feed to it, but there's accuracy is not enough and for safety reasons of AI, they have not released the full model, they have released a partial model. So, generative text has a lot of repercussions, you know, people using it, you know, yeah. But yeah, this one doesn't generate. Yeah. Yeah. All right. Yeah. Thank you. That's it. Thank you. Thank you. Thank you.