 I will just hand it over to him. He will present us Frontera large scale open source webcrawling frameworks Welcome Alex. Thank you Thanks Hollalos participantes So few words about myself I Was born in Yekaterinburg in Russia. It's like in the middle about 1,000 and a half kilometers from Moscow I Was working five years at Yandex Yandex is a so-called Russian Google number one search giant in Russia I was working in search quality department and was responsible for development of social search QA search at the moment like So we had access to the whole Twitter data. So we built our search based on Twitter data Later I moved to Czech Republic and worked two years at the vast anti-virus. This is like Now later the most popular one in the world it has about 200 million users and I was responsible for automatic false positive solving And a large-scale prediction of malicious download attempts So let's go with Frontera I put I put this quote here because Crawl frontier became such a common term in the webcrawling society. So basically when Crawling works that way you pull the seats in crawler starts to go there gets some Links from there and then continues to get these links out of there. So A place where these links are stored before we will be fetched is called frontier So here this term comes from shipping Obviously all the Spanish guys knows what is front era But I just realized that front era is like not so not so or frontier is not so used used word especially in the countries when When where is no C This is a place where all the stuff is like people and goods are Before they go to the land or to the sea So a few words about motivation. Why we decided to build from terror We had a client They came to us and said on building of pages per week so and then we want you to process these pages and And tell us what are the biggest hops frequently changing So we just have a look at it one billion. What does it mean? It like means 150 million's per day and About About 1,000 and a half per second. So that was quite a lot Later, I will show you that current scrappy abilities is One hundred thousand and five hundred pages per minute not per second so Here you see the very important picture. It's a Illustration of hyperlink induced topic search from joint clan Kleinberg Hubs are the nodes with a lot of outgoing links, but bigger nodes are a 40 sites Which has lots of incoming links? It's similar to the scientific publications if you are sighted much it means society gives you support How to calculate hub and a 40 score on a link graph? And that became history So now every major search system is using this to rank the pages Do research with a link graph? and Another thing is like It's not like scrappy wasn't suitable for broad crawls. That's not true, but Like broad crawls with scrappy was hard really hard and nobody does that So like favored Apache nudge instead of And we didn't like that. So we wanted to make him able to crawl with scrappy whatever they want So there are two modes of execution of front era single-tradit and distributed Front era is mostly all about what to grow next and when it's basically guiding the crawler what to do next Single-tradit mode for up to 100 websites. This is a proper because It's heavily depends on intensity of your parsing task Your documents can be can connect can can have like a lot of links, which is like gives and gives more documents or your websites can be like less like Not so responsive as as others so and also Sometimes spiders Additional post-processing which is also CPU intensive It's basically all about spew For performant broad crawls, there is a distributed node Here's the main features of single-tradit version The main feature from my point of view is like it's real time nudge What does that mean? It's like When you work with nudge first thing you do is like you pull the seeds and then you run crawling of one second then the whole thing stops you need to run Command to process what was crawled to generate a new links to crawl like new batch and then continue with crawling So it's like a batched with and always has steps Frontary is opposite everything is online. So it never stops It catches every batch like at the end of every batch That she's requested and then it's continued. Therefore, we avoid waiting for last year rules, which are like taking too long to download and this is like For those of you who has experience with your crawling you probably know about it Actually, can you raise the hands like who was doing? broad crawls before Okay, one person. Okay, who knows about scripting Okay, it's much better Well another thing is we have a storage abstraction. So You have out of the box you have SQL Hemi and HBase SQL Hemi means you can black any popular database, you you know my sequel was grass Oracle and so Or you can implement your own. There is like pretty straightforward interface Third thing we have canonical URLs resolution obstruction this is a like usually underestimated problem and you have each page if you Just take it as a Unique content each page from each website can have many URLs So it's always a question which one to use if you Find the same content by using two URLs And will not pay attention to this you will end up with duplicates in your database and here we provide An interface to implement your own canonical URLs It's could be different depending on your application. The last thing is like scraped ecosystem. We have a big community Good documentation. I believe It's like really easy to customize mostly because of Python so benefit from from terror when you have a need of A few rule metadata storage or content storage, so you have a website and you want to show the content Or you have internet and you want to show the content from database or from metadata so basically Right, right, right thing to do If then second thing is when you want to isolate your URL ordering or queuing from the spider And the third thing and they have like pretty advanced URL ordering logic Big websites or if you want big websites means like If your website is so big and there is no way to crawl it full you can Adjust crawling logic so it will like select the best pages to crawl Here's the architecture single-traded version Let's go probably from right to left You see the database and you see the back end back end is responsible for communication with database Mostly also in back end it is coded the model For your ordering and queuing so you just It's tightly connected with type of storage you use therefore. It's in the back end Frontera middlewares allows you to modify the contents of requests or responses As like as you want so you can put your fingerprinting of your or you can Change the meta fields at scoring fields or another thing you need Fentera API is basically the API looking outside of Frontera framework, which is like Possible to use by any other process management code or crawler So crawler is basically the stuff which makes DNS resolution and fetching the content from the web You can Put anything you want here Obviously we have everything for scripting and also we have a Example just to demonstrate that Frontera is working well outside of Scrappy and sites is the Internet Fill the shots you ring a bell I put that image here because it's like How we are friending with Scrappy basically Frontera is implemented as a set of custom scheduler and spider middleware for Scrappy so All that stuff is pluggable and Frontera doesn't require Scrappy can be used separately Mostly Scrappy is used for process management and fetching Yeah, so and of course we are friends forever Guys from Scrappy always like attacking me like to well not from Scrappy from Scraping Hub always attacking me like Let's integrate it even more So my task is like to stand against that Because it's like I have to think about the community and so here's like short quick start Try Frontera in single thread mode First you have to install it then you have to write a simple spider maybe like 20 lines of code including imports and Well, you can take example one Edit spider settings by and put scheduler and Frontera spider middleware there. So Scrappy What's scheduled to get and scheduler later will load all the Frontera stuff Picro finished. That's it. Check the database if you use database Here is a list of use cases for distributed version. It's like completely different story Single version is meant for Like maybe 50 or 100 websites and you know all these websites But when you like have a broad grow, you don't know what you will face so If you have set of rules and you need to revisit them like set of rules, I mean like hundreds of thousands If you are building your search engine and you need to get content somewhere If you are doing some research on web graph Frontera also could be useful Therefore like you don't need to save content, which is like making work a bit easier You have a topic and you want to crawl the documents about the topic imagine like you have like You want to crawl about sport cars So you run the front After sometime you have a lot of documents much better than Google Because Google will show you like only first few pages and still It's like how to get this pace out of Google More general focus scrolling tasks as I mentioned previously like if you have If you want to search at some pick for a big cups will get benefit from Frontera So here's the architecture of distributed version Let's go from scrappy you pull I will just describe the data flow and Operation how all this stuff works. So you pull your seats in spiders Above then these seats are passed to spider lock by means of keffa transport This is a keffa topics And then from keffa we will get to strategy worker and DB worker strategy worker is responsible for all the scoring stuff and for Making decision. When do we have to stop the crawling? It's like when crawling goal is achieved DB worker is responsible for Urals or old ones doesn't matter and producing new batches Scoring block is the place where all the scores about URLs are passed to the DB worker so Seats are going to strategy worker and DB worker strategy worker saves Sites that are new URLs and we have to crawl them Calculate score for them Score is propagated to DB worker and DB worker is Making a new batch for them and you Propagated to spires and spiders are now loading to this batch of URLs after that we get a We get a content and we send this content Well, we actually do also parsing and then we send this content by means on spider lock Again to strategy worker and DB worker strategy worker extracts links Look at them if they are new they needs to be scheduled It calculates score again puts this quarter scoring clock and DB worker is saving the information about what was downloaded and so on. So basically we have a closed circle So Actually Now I'm running out of time so You can put any strategy you want in strategy worker. It's implemented in Python 30 worker and DB workers and Well, let's go Here's the main features of distributed frontier. Well, we use Kafka as a communication layer We use a rolling strategy abstraction As I mentioned in strategy workers, so you can implement your cloning goal your ordering scoring model in separate model polite by design it means you will not get blocked because Your Website will be downloaded by at most one spider. This is achieved by means of partitioning in Kafka And yeah, I don't everything is Requirements so you need to have H base and Kafka scrappy 024 at least First two is easier to get by installing cloud era CDH DNS service Because we are making DNS intensive stuff. So it's better when your DNS service will be pointing top stream service Servers like some big ones from big providers, maybe American Bayzone or open DNS Hard requirements Quite interesting slide So here's like how to calculate from your need what hardware you need for her for front Typically each spider gives you one page a spare minute. It's including parsing and Spiders is about four to one. So here's an example if you have 12 spiders That will give you 14,000 pages per minute. It means three strategy workers and 3db workers total 18 cores Because each worker will consume one core Memory would be would be nice also for strategy workers So some gotchas I would better skip this if you want Because we are running of time so here's like Short Quick start, but it's not quick at all So prepare each base and Kafka and install distributed frontera. I Think if you have each base and Kafka, you will need like two three hours to get it running from scratch So it's like all the instructions are mostly at this website Of course, we will be like Working more on this at the moment documentation is like is not the best It's its best state So we made a quick Spanish crawl. I just told like before the presentation so The test frontier we just wanted to find out what are you guys doing here in Spain besides playing football? We decided to check out. What are your biggest websites? And I just took from demos all this pain Spanish content all the Spanish Urals and Pull them as a seeds Having like 12 spiders and running in this for one and a half months. So probably You are now at least one of these websites at the top And after all we crawled about 47 millions of pages You know that you have at least 22 websites with but Considering this count of them domains found it's like We should found much more. I think Here's some future plans We won't definitely revisiting strategy out of the box. So Yeah It means if you perform the crawl You then probably you need to recall it for some to get What what was the changes in your content? And also you want to recall it by some ordering which is based on how content is changing Yeah page rank and hits based. I already told about Pete's page rank is just another link algorithm in graph algorithm We want our own URL parsing Like scrappy So we will I guess we will get it soon because of that And yeah, we will test it on larger scales pregoontas pregoontas questions Anyone Okay, so I have a small question How do you guys work out canonical URLs because I think that's that might get really tricky in some pages There are like few approaches Actually some Website webmasters they provide canonical URL in the content so you can get if it's there That's the best if it is not there like what you can do you can you can probably Analyze the structures for example if you have a chain of redirects you can get the last one in the chain and Yeah, basically with some set of heuristics. It's like there is no clear decision The target of front error is to provide interface for this and so that's it actually if You look and you will find out that we are just picking last one from video chain This like gives us ability like to avoid duplicate to do the thing thing by Stephen I Have a question SNL scrappy have has a web based it's dashboard and Do you spider work with it too? Mmm Actually, actually they should work. Yeah, because you can put your own scheduler and spider middle We're in these spiders and that potentially should work As I know this the boards should create some rules and then Scrub use this rule is so I'm sorry. What rules? Oh Like a rules like like a spider rules, you know Sorry, I don't remember The thing is like, you know, I'm kind of more dedicated scrolling. So I'm Honestly, I'm not well aware of what scrappy is all about. Yeah, so Let's talk later. I will can't I will just point you to the right guys Second question. Do you use some as in her own library? Or as I know If you run your Application as a single traded Do you do you use some as in chronos? Yes, we use twist it mostly because it helps to Calls to some functions and let just makes code more readable So, yeah, thank you Okay, thank you. Maybe one quick last question before we change rules. Is there anyone I think otherwise Alex will be outside. I I can show you something interesting if you don't have questions Okay, really quick. We have 45 I think the next talk will begin and so That That was done 15 years ago This is was done by Andre broader and others. This is a big they are from Yahoo research This is a structure of the internet They think of so in the middle. We have a strong Strongly connected continent They think it's like there are a lot of links Websites highly connected inside and Here Butterfly, you know, here's like incoming links to this strongly connected component right there are like outgoing links and a lot of them and You have this butterfly has a tendrils. So it's like a beats like So These tendrils they have outgoing links and some tendrils have only Ingoing links like to the end or to the out and they have tubes You can bypass strongly connected component from in links right to the out links so And actually you also have a disconnected stuff that means There is a pages we will never find If we will just go and try to grow the internet So I wish some day these days Refine respect prove that it is wrong or it is true Okay, perfect. Thank you everyone