 OK, hi, everyone. Thanks for coming. I'm very sorry about the technical difficulties. We clearly should have had a bit more time to set up and prepare. And I really don't please try nothing to look ahead too far in the slides. I know it's going to be difficult, but there you go. OK, so I'm going to talk about web scraping best practices. I originally called this advanced web scraping because we're going to touch on a lot of advanced topics. But it's not advanced in the sense that you need to be past the beginner level or anything to understand it. So I changed it to best practices. And I hope that everybody can follow this talk and understand what's going on. If you can't, please just shout or let me know. So a bit about me. Let's see. Eight years ago about that, I started scraping kind of an anger. And that was around the time when we did the scraping web scraping framework. And since that time, we've been involved in a couple of other projects, including Portia and Frontera. If you don't know what they are, don't worry. I'll get to them later. So why would you want to scrape? Well, lots of good sources of data on the internet. And actually, we come across a lot of companies and universities and research labs of all different sizes who are using web scraping. But getting data from the web is difficult. You can't rely on APIs. You can't rely on semantic markup. So that's where web scraping come in. These are some stats. You probably can't read them very well because it's small. But basically, web scraping has been on the increase recently. We've seen that ourselves, but this has been also something we've seen from other companies reporting. These stats are from a company called Incapsula that provide anti-bot scraping technology. And it's a sample of their customers. So it's probably not completely representative of the internet as a whole. But still, it's very interesting to see. And another thing that I can see from this as well is that smaller websites have a larger percentage of bot traffic, probably because they have less users, but it's something to keep in mind. Especially if you write bad bots, they cause more trouble for smaller websites. Smaller websites might have bandwidth limits, for example. And many HTTP libraries, they don't compress content. So you'll easily go over their bandwidth limits. Also, of course, doing a bad job means your web scrapers are very hard to maintain. This is a notorious problem, of course, because websites change. So when I think about web scraping, I like to think of it in two parts. The first is actually getting the content. So it's finding good sources of content and downloading it. And then the second is the extraction. Actually, extracting structured data from that downloaded content. And I've kind of structured this talk in two parts as well. That follows this. So as an example of web scraping, I just said that scraping help gets scraped all the time. And it's not just people testing out scrappy or something like that or our tools. But actually, a couple of weeks ago, we posted a job ad on our website. And the next day, it was up on a job listing board. And none of us posted it there. So we thought, well, how did that happen? And I think we were probably scraped. So a question for the audience would be to think about how would you write that scraper? I would break it down into, OK, how do I find good sources of content? And how do I extract that data? It turns out that we tweeted about the job. So hashtag remote working. So maybe somebody picked it up from Twitter, got retweeted. That would be an easy source of content. And we did use semantic markup. So perhaps they extracted it from that. And that's relatively, to write such a scraper that could do this is a relatively easy task. You could do it in a day, maybe. But then if you wanted to handle cases where people didn't use semantic markup, or you wanted to find people who didn't tweet about it or post it to some other website, then it becomes a much bigger and much more complex task. And I think that highlights the scope of web scraping. From the very easy pool of fun hacks that don't take very long to the very ambitious and very difficult projects that happen. So moving on to downloading. Yeah, I've got to mention the Python Request Library. Probably many people know it. It's a great library for HTTP. And doing simple things is simple, as it should be. But when you start scraping at a little bit more scale, you really want to worry a bit more about a few other things. Like, for example, retrying requests that fail. Certainly, when we started out, you'd run a web scrape, and it might take days to finish. And then about three quarters of the way through, you get a network error, or you get the website itself that you're scraping, but suddenly return 500 internal server error for 10 minutes. So if you don't have some policy to handle this, it's a huge pain in the ass. So yeah, you want to think about that. In this example, you can see I'm using a session. Well, I don't know if you can see it or not, because it's small. But consider using sessions with Python requests, sessions handle cookies. They also use ConnectionKeepAlive. So you don't end up repeatedly opening and closing connections to the sites you scrape. But I would say, as soon as you start crawling, you really want to think about using Scrapy right away. This little example here is not much code. It uses Scrapy's CrawlSpider, which is a common pattern for scraping, for crawling. Just defining one rule, a start URL, and that's enough to go from the our your Python website for this conference to actually follow all the links to speakers, and you just need to fill in some code to parse the speaker details. So it's really not much code. And it solves all the problems, like highlighting or solves all the problems, like retrying, et cetera. You can cache the data locally. Which is good if you're going to live demo stuff. Yeah. So a single crawl like that often turns into crawling multiple websites. At PyCon US in 2014, we did a demo. And it's up on Scraping Hub's GitHub account. It's called PyCon Speakers, where we actually scraped data from a whole lot of tech conferences. This is a really good example to look at because it shows a way to manage and how a Scrappy project looks when you've got a lot of spiders. And Scrappy provides a lot of facilities for managing that. You can list all the spiders that are there. A spider is a bit of logic that we write for a given website. And it also shows best practices in terms of it's easy, but Scrappy, to put common logic in common places and share it across multiple websites. When they're crawling the same type of thing, there's a lot of scope for code reuse. So definitely for Scraping multiple websites. Yeah, Scrappy's no-brainer. So some tips for crawling. Find good sources of things. Some people maybe might not think about using sitemaps. Scrappy actually has a sitemap spider that makes this very easy and transparent. But often it can be a much more efficient way to get to content. And that also means, of course, don't follow unnecessary links. You can waste an awful lot of resources for everybody following stuff that doesn't need to be followed. Consider crawl order. So if you're discovering links on a website, it might make sense to crawl breadth first and limit the depth you go to. This can help you avoid crawler traps where maybe you're, I don't know, repeatedly scraping a calendar, for example, and just going through the dates is a common example. I used to work in a company before that had a search engine, and crawlers every now and then would follow some link into it and would follow all the search facets and turn every permutation and combination of search. And this generated huge load, of course. So as you decide to scale up, so I was talking here about maybe single-website scrapes, which is, I guess, the most common use case, at least, especially for Scrappy. And single-website scrapes can be big, right? I mean, we frequently do maybe hundreds of millions of pages. But at scale, say, for example, you're writing a vertical search or a focused crawler, then we're talking maybe tens of billions or even hundreds of billions of discovered URLs. So you might crawl a certain set of pages, but the amount of URLs you discover on those pages, so your entire state that you need to keep in your URL frontiers can be much, much larger. So maintaining all of that is a bit of a headache. It's a lot of data. And one common way to do it is people just write all that data somewhere, and then perform the big batch computation to maybe to figure out the next set of unique URLs to crawl, typically using Hadoop or MapReduce. It's a very common thing. Maybe not just a good example of that. And then incremental crawling would be where you are. Continuous crawling, actually, would be where you're continuously feeding your URLs to your crawlers. This has the advantage that you can respond much more quickly to changes. You don't need to stop the crawl and resume it. But also, nowadays, maybe you want to repeatedly hit some websites. Maybe you're following social media or something like that or good sources of links. So it's much more useful, but it's much more complex at the same time, and it's a harder problem to solve. Maintain politeness is that little point down the bottom. But it's something really you want to consider when you're doing it on any scale. I think almost anybody can fire up a lot of instances nowadays on EC2, or your favorite cloud platform, and just download loads of links, download loads of pages really quickly without putting much thought into what those pages are, particularly the impact it's going to have on the websites are crawling. In a larger crawl, where you're crawling from multiple servers, you would typically only crawl a single website from a single server. And that server could then maintain politeness. So you can ensure whatever your crawling policies are, you don't break it. So, Frontera, I thought I'd briefly mention it. Alexander Sibirikov gave a talk on it yesterday. This is a Python project that we worked on, or we're working on, that implements this crawl frontier, so it maintains all the state of it. It maintains all the state about visited URLs and tells you what, you should crawl next. There's a few different configurable backends to it. So, you can use it embedded in your scrappy crawl, or you can just use it via an API with your own thing. And, you know, it implements some more sophisticated revisit policies. So, if you, say, want to go back to some pages more often than others, and maybe, you know, keep content fresh, it can do that. And I think Alexander particularly talked about doing it at scale. And he's going to be talking about that in the poster session as well. So, please come visit. So, just to summarize quickly, what we talked about downloading. Request is an awesome library, for simple cases. But once you start crawling, it's better to move to scrappy quickly. Maybe you wouldn't even want to start there. And if you need to do anything really complicated, or sophisticated, or at scale, consider Frontera. So, moving on to extraction. Extraction is the second part that I wanted to talk about. Of course, Python is a great language for extracting content, or for messing with strings, or messing with data. There's probably a lot of talks at this conference about managing data with Python. But even just the simple, you know, built-in features to the language and the standard library make it very easy to play with text content. Regular expressions, of course, is one thing that's built into the library, and probably, yeah, I should mention something about it. Regular expressions are brilliant for textual content. Yeah, it works great with things like telephone numbers, or postcodes. But if you find yourself ever matching against HTML tags, or HTML content, you've probably made a mistake, and there's probably going to be a better way to do it. I see this code all the time of regular expressions, and yeah, it works fine, but it's hard to understand and to modify. And often, it actually doesn't work fine. So other techniques, well, use HTML parsers. So we have some great options. Yeah, so if you want, this is when you want to extract based on the content, based on the structure of HTML pages. So often, you will say, okay, this area here, underneath that table, HTML parser is absolutely the way to go. Yeah, so, just a brief example. Oh, yeah, on the right-hand side, I just had some examples of HTML parsers. LXML, HTML 5.0, 5.0, beautiful soup, gumbo, and of course, Python has its own built-in HTML parser. I'll talk about them in a bit more in a minute, so don't worry if you can't see that. So, just as a brief example of what they do is take some raw HTML here that looks like text and create a parse tree. My favorite way of dealing, and then use some technique. Usually, these parsers provide some method to navigate this parse tree and extract the bits you're interested in. I don't know if you can see that, so I'll skip this quickly, but I quite like XPath as a way to do this. In this case, just select all bold tags or a bold tag under a div or, you know, the text from the second div tag. It lets you specify rules. It's really worth learning if you're going to be doing a lot of this. Here's an example from Scrapy. You don't really need to read that. But basically, it just lets you Scrapy provides a nice way for you to call XPath RCSS selectors on responses. So this is probably definitely the most common way to scrape content from a small set of known websites. I definitely want to mention BeautifulSoup as well. This is a very popular Python library. Maybe in the early days, it was a bit slow, but the more recent version, you can use different parser backends. So you can even use BeautifulSoup on top of LXML. The main difference with the example I showed previously is that BeautifulSoup is a pure Python API. So you can navigate content using Python constructs and Python objects versus XPath expressions. The other thing is, of course, you might not need to do this at all. Maybe somebody has already written something to extract what you're looking for. So, yeah, definitely, maybe there's stuff you wouldn't even think of. So in the early days, we wrote a log-in form module for Scrapy that automatically fills in forms and logs them into websites. We have a date parser module that takes textual strings and can build a data object from it. And webpagers, another project that we wrote which looks at a HTML page and will pull out links that perform pagination, as well as webpages. I was going to live-demo this, but I think we're probably short on time and, yeah, maybe it's not worth tempting face. We had enough technical problems already. But Portia is a visual way to build web scrapers. It's applicable in many of the cases where I had previously mentioned where we would use XPath or BeautifulSoup. But it's advantages. It's not as applicable. It's usually say, oh, I want to select this element. This is the title. This is the image. This is the text. I was going to demo this about scraping the EuroPython website. Maybe if somebody wants to drop by our booth later, I can do it. I can show you. But it's really good. It can save you a lot of time. However, it's not very good. If you want to use extraction logic, it might not always work with this. Of course, if you want to use any of the previously mentioned stuff, like automatically extracting dates and things that might not be built into Portia yet. So scaling up extraction, Portia is great. It's much quicker to write extraction for websites. But at some point it becomes pointless again. You might be scraping 20 websites. But what about tens of thousands or maybe even hundreds of thousands? At this point, you want to look for different techniques. There are some libraries that can extract articles from any page. They're easy to use. I want to focus quickly on a library called WebStruct that we worked on that helps with automatically extracting data from HTML pages. In the example, I'm going to use this named entity recognition. In this case, we want to find elements in the text and assign them into categories. So we start with annotating web pages. So of the type of stuff, we label web pages basically with what we want to extract as examples. We're going to use a tool called Web Annotator, but there are others. Here's an example of labeling. In this case, we want to find organization names. So the old tea café is an organization and we would label it within a sentence, within a page. That format is not so useful for machine learning and for the kind of tools we want to use. Of course, that text is split into tokens. Each token in this case is a word and we label every single token in the whole page as being either outside of what we're looking for as being the beginning of an organization or inside an organization. Given that encoding, we can apply more standard machine learning algorithms. In our case, we found conditional random fields as a good way to go about it. But an important point is that it needs to take into account the sequencing of tokens. The sequencing of information. Some features. We feed it basically not just the tokens itself, but the actual features. The features can be things like about the token itself, but they can also take into account the surrounding context. This is a very important point. We can take into account the surrounding text or even HTML elements that it's embedded in. It can be quite powerful. One way to do it, and this is what we've been doing recently, is to use our WebStruct project. This helps load the annotations that were done previously in WebAnnotator. Call back to your Python modules that you write yourself to do the feature extraction. And then it interfaces with something like Python CRF Suite to actually perform the extraction. This is just briefly to summarize. We use slightly different technologies depending on the scale of extraction. We have HTML parsing and Porsche are very good for a single page or a single website. Or for multiple websites if we don't have too many. The machine learning approaches are very good if we have a lot of data. We compromise a bit on maybe the accuracy. But that's the nature. I just wanted to briefly mention a sample of a project we've done recently. Actually, we're still working on it. You might know the Saatchi art gallery. We did a project with them to create content for their global gallery guide. This is an ambitious project to showcase artworks and artists and exhibitions from around the world. It's a fun project and it's nice to look at artworks all day. Of course, we use Scrapy for the crawling. We deployed it to Scrapy Cloud which is a scraping hub service for running Scrapy crawls. We used WebPager one of the tools I mentioned earlier to actually paginate. The crawl, we prioritize the links to follow. Using machine learning so we don't waste too many resources on each website we scrape. Once we hit the target WebPages we then use WebPager to paginate. That's the crawling side. On the extraction side we use WebStruct very much like I previously described. One interesting thing that came up I thought when we were extracting images for artists we often got them wrong and we had to use a classification to... We actually classified them based on the image content using face recognition to see which one were artists versus artworks. It's working pretty well. This is in Scraping 11,000 websites hopefully to continue and increase. And so one important thing of course is to measure accuracy to test everything, improve incrementally and it's also good to not treat these things too much like a black box to try and understand what's going on. Don't make random changes. It tends to not work so well. So briefly we've covered downloading content, we've covered extracting. It seems like we have everything to go and scrape at large scale but there's still plenty of problems and we're just going to touch on a few in the last five minutes. Of course web pages have an irregular structure. This can break your crawl pretty badly. It happens all the time from people using... superficially some websites look like they're structured but it turns out somebody was using a template and a word processor or something and there's just loads of variations that kill you. Other times I don't know, maybe the developers have too much time in their hand and they write a million different kinds of templates. You can discover that they're halfway through that the website's doing multivariate testing and it looks different the next time you run your crawl. I wish there was a silver bullet or some solution I could offer you for these but there's not. Another problem that will come up is sites requiring JavaScript are browser rendering. We tend to have... We have a service called Splash which is a scriptable browser that presents an HTTP API. So this is very useful to integrate with Scrapy and some other services. You can write your scrapers in Python and just have the browser... you know, have Splash take care of the browser rendering and we can script extensions based in Lua. Selenium is another project. If you start thinking like, okay, follow this link, type this here. Selenium is a great way to go. Oh, yeah. Finally, of course, you can look at Web Inspector or something, see what's happening. This is maybe the most common thing for Scrapy programmers because it's quite efficient. You can just... Often there's an API behind the scenes that you can actually use instead. Proxy management is another thing that you might want to consider because some websites will give you different content depending on where you are. We called one website that actually did currency conversion so I thought I was being very clever by selecting the currency at the start but it turns out the website did a double conversion and some products were like a center two different so didn't discover any of the changes. So they ban hosting centers often where they've had one or two abusive bots. It could be somebody else. This is just part of the nature of Scraping in the cloud. For reliability, sometimes for speed, you might want to consider proxies. Please don't use open proxies. They sometimes modify content. It's just not a good idea. Tor, I generally don't like it for large content Scraping. It's not really what it's intended for. But we've done some things with maybe government agencies or security in the security area where we really don't want any blowback from the Scraping and it really needs to be anonymous. Otherwise, there are plenty of private providers but very in quality. Finally, last slide is just briefly want to mention about ethics of web scraping. I think the most important question to ask yourself is what harm is your web scraping doing? Either in a technical side or with the content that you scrape or are you misusing it? Are you hurting the sites you're getting it from? Yeah, so on the technical side crawl at a reasonable rate and it's best practice to identify yourself via user agent and to respect robots.txt especially on broad crawls. That's when you visit lots of websites. That's it. We have some questions. Thank you. Wonderful talk, one question. Imagine you have to login into some website and if you use a tool that will generate some fake credentials and stuff, for example, you have a profile of a programmer or a farmer or a rockstar and so on. Thanks. Okay, so about logging into websites the tool I mentioned just finds the login box and lets you configure your user ID that you want to use. So it doesn't handle managing multiple contacts. I have seen people do that but it's not something I've done myself. Yeah, so sorry, that's all I can say about it really. Any other questions? Hi. First of all, thanks for the scraping library. I mean, it's an awesome thing and we're using it on a daily basis. That's great to hear. Actually, these guys you should be thanking in the audience. I don't know why they're all rolling but stand up guys, stand up. But these are the contributors really here. There's more of them up there but I don't know why they're being shy. Sorry. I probably have a few questions but I'll only ask a couple, I guess. First is I'd like to mention PyQuery. That was an awesome development change for us from Xpath and this is one thing we use regularly and it proves... Yeah, I've heard of it but I haven't really paid it properly so we'll check it out. I think there might be scope for including other approaches to extraction. And one is did you maybe think about master spiders or spiders that can... You said that APIs are brittle but you could still think of web frameworks and some behave in similar ways and maybe you could get a way to extract certain information from certain kinds of websites. Yeah, absolutely. We have a collection of spiders for all the forum engines, for example. It's not individual websites but it's the underlying engine powering it and that works really well. We're building collections of those kind of things. One example about APIs, I didn't really meant to diss API in general they're often quite useful but in some cases they don't have the content you're after and in some cases the content is maybe lags behind or it's a bit less fragile than what's on the website. That's been my experience but definitely if there is a web API available you should check it out. It works fine, it's creepy too. Okay, and just last question a little bit more technical. Do you have plans for anything to, I don't know, to handle throttling or to handle robots.txt or to reschedule 500 errors or something like that. I know there's an autothrottle plugin but that does I mean it slows you down significantly on a good website though it does work for slow websites. Thanks. You're welcome. Yeah, throttling is an interesting one and often internally what we do is we deploy with autothrottle by default and then override it when we know the website can do better differently. So it is a case, especially when you're crawling a single website or a small set of websites, it's worth tuning yourself, it's hard to find good heuristics and definitely it's something we do all the time when we write individual scrapers. I'd be interested in your thoughts on how you could come up with some better heuristics by default. It's definitely a very interesting topic. Yeah, and retrying, again Scrapey does retry stuff by default but you can configure for example the HTTP error quotes that signify an error that you want to be retried because they're not always consistent across websites. Thank you. So a slight follow up to the retry thing. You mentioned this briefly under the talk. Do you actually like do things like backoffs and jitters and stuff because from my job we have very interesting situations with synchronized clients and other fun. It's good to avoid. Yeah, definitely. And actually I glossed over a lot of details. I mean I said we run in Scrapey Cloud but that takes care of a lot of the kind of infrastructure that we typically need. And Alexander gave a talk on the crawl frontier which is crawling at scale and there's a lot more that goes into that that it happens outside of Scrapey itself. The first thing of course that we noticed as soon as we started crawling from EC2 is DNS errors all over the place. But there's several technical hurdles that you need to overcome I think to do a larger crawl at any scale. Okay, thank you Shane. Thanks very much.