 OK. Good morning. Well, I'm a developer evangelist that's keeping half. I'm a Pythonista in Django now from the early releases of Django. And I also like to reverse engineer stuff. So OK, let's check why we need web scraping. OK, first of all, we always like APIs. I mean, if every service offers an API, it would be really awesome. But people doesn't know all the trade-offs that came with API. For example, they know how you access the service. And well, the most interesting part of those services are not normally available through the API. I could think of a couple of samples. For example, if you are checking Google Places, they only offer five reviews of which business. And well, normally you want to get all the reviews from a single business. So well, the unique workaround is to use web scraping. And well, there's also this semantic web term that came a couple of years ago. And well, who of you knows what is RDF? Are those semantic terms? I mean, nobody uses them. The web is really broken. There was some stats five years ago from Opera. They were checking how really broken were the web. And they told that the most popular tag was a title, not body. So you can make sense of how really broken is the web. So what is web scraping? Well, the main goal of web scraping is to get the stricter data from unstructured sources. In this case, well, web pages. And you may be asking, what are the kind of things that we can do with web scraping? Well, as the last bullet point says, your imagination is the limit. But for example, most of these examples are for price monitoring, for lead generation, for aggregate information. Let's say I want to aggregate jobs positions or any of the kind of information. And well, if we want to start with web escaping, we need to know HTTP. We need to speak HTTP. Some people think that it is obvious, but it's not. So well, all of us know that there are some methods that get, post. Those are the typical ones. But there are way more methods than those ones. We also know to know all the status codes, like 200. That's OK. 404, that's not found. For example, 418, that's a T-POT code. 500 are all the error codes. And who knows what's the code 999 for? Well, that's the code that Yahoo responses when you get blocked by them. We also need to know how to deal with headers and the query string. For example, the asset language header is quite interesting because it depends on which language are you receiving the website. Also, user agent is quite useful not just for emulating being a real browser, but also you can try to emulate being a mobile device. And well, on many occasions, it is way easier to scrape a mobile layout than a desktop one. And we also need to know how to deal with persistent, with cookies. So if we want to perform a request using Python, well, we just check the standard library as we remember Python is badly included. So let's check some of the library when we found this error leaf too. But if you check the API, well, I don't recommend it to you unless you want to suffer to use your leaf too. But this kind of rate needs a Python request library. And it's like HTTP for humans. It has a really clean API. And I recommend it. And it is an easy to use as this. So if you want to perform a request, use, well, this is one line, plus import. So you perform that request and you get a chunk of a bit of string. So how do you deal with that big chunk of data? Well, many people think about using regular expressions or string manipulation methods. But what they don't know is that HTML is not a regular language. This is a really famous Stack Overflow answer. Basically, the last line is, have you tried using an XML parser instead? OK, so what HTML parsers do we have available on the Python ecosystem? Well, we have lxml. It's a phytonic binding for some really fast C libraries. And it's the de facto way to parse HTML, I think. And there is also a beautiful soup. Beautiful soup is it's not a parser. It's just a wrapper around parsers. For example, the HTML parser is the standard library one. It also offers a wrap over lxml. And also, there is this HTML5 library that it works really good if you are trying to scrape really broken websites. I recommend that one. OK, so let's check a full example of how to perform a request and get some data. So in this case, I want to get all the talks from this conference. So what is performer request? I parse it through lxml. And we just perform some expats to get the data. It is really clean. It's really easy. I think everybody understands that piece of code. And one thing that I want to say is many people try to don't learn expats. And well, if you want to be in the work escaping business, you need to run expats. Otherwise, there is people that try to reinvent expats. I have seen so many, for example, Golanck libraries lately that try to do some kind of work expats. It doesn't work. So well, we have this piece of code. And let's say I want to perform two million requests to Amazon.com or whatever site. How does the piece of code scale? How do we test it, et cetera? It is not that easy. So you can say, OK, let's spawn some threads, or let's use the event, or even let's, I mean, it doesn't work or it works for a little time until it gets painful. So what I recommend is escapeify early on. It has a bit of a learning curve, but it's really worth it. So for those who already know escapee, Sain has been telling you about escapee. So well, the creators are here at this room of escapee. So escapee is an open source and quality framework for extracting the data unit from websites in a fast, simple, yet extensible way. Of course, it's open source, and it has a really healthy community. And OK, let's get to it. We have one escapee has an interactive console, an interactive cell. So we can just launch it through a escapee cell and the URL that we want to check. And it's a really good tool for checking some expats, doing some quick tests. But well, it's also useful for debugging your spiders. So it's as easy as when we perform this common escapee cell URL. And we get on the interactive cell so we can play with some objects that are already populated. For example, we can access response URL. We can get an expat from the response. And we can open the response on the browser or even fetch a new website. So if you are starting with escapee, I recommend you to launch the escapee cell console and play with it. So let's start the escapee project. It's as easy as escapee is the project on the name of the project, and it creates a layout. This is way similar to what Django does. And yeah, I mean, we got a lot of ideas from Django. It's a really good project. And what I really like about Django is that it offers a unique way of doing things that works good every time. And it's creepy, enforced, and I really like it. So OK, let's check what a spider looks like. Not that one. It's more like this. Well, it is just a Python class with some attributes, and you can see a method. So how we do a spider with the anatomy of a spider? While we need, we have some mandatory attributes at the class level, like the name of the spider. There are always domains on the start URLs. The start URLs are going to be the requests that are going to be performant. And that method, parse method, it is just a callback. So basically, the scaping giant performs a request to a sample.com slash one HTML. And we just get the response on that callback. In this case, we're just logging that we got a response. So let's check a more advanced example. In this case, we are not using a start URLs, but a function, a generator, in this case. So we are generating, in this case, three requests, example.com slash one slash two slash three. And we are pointing them to the parse method callback. And we are extracting data from those websites at the callback. So basically, what we're doing is we're performing an XPath, going through some H3 elements. And we are dealing with items. We'll get later to see what are items. But basically, they are like dictionaries. It is a structure of scraping, and it's quite useful. And then we are going through the links, and we are dealing with request. So on the same callback, we can deal either items or either request. It works out of box. And, well, the same example. In this case, we have released a scraping 1.0. And we are really allowed to don't use items. I will show what are items. But we can just deal dictionaries, normal Python dictionaries, and it works as well. So items. Items are just a class with some attributes, and they define how a structure looks like. But they are pretty good. To validate data, we have an intent pipeline. And we can do plenty of stuff with those. So what kind of stuff? For example, we have the content of item loaders. So basically, we just populate items. It's like an ORM. And we have input and output. So pre and post processors that are normal Python functions. So basically, imagine that we are scraping, let's say, a date from any website. And we want to format that date into, let's say, the ESO JavaScript standard. So with item loaders, it is just a function that gets the date and transform it. And then we have item exporters. So SKP has a built-in support for generating feed exports in multiple formats, JSON, CCB, XML, and storing them in multiple backends. So it's quite cool. We can just run any spire. And the result could go to an FTP, to Amatonicity, to the local file system, out of the box. And for all of you who use Django, there is a thing we have, I think, called Django item. So basically, we map an item to the definition of a Django model, and it just works. That's pretty useful. So what happens under the hood? Well, that's sector architecture. Basically, we have a thing called SKP engine that runs on top of Twisted. And we have one thing called the scheduler that is in charge of a scheduling request. And it goes to the downloader that fetches the website from the internet and feeds them back to the spires. And we have different stages and middlewares through all the stages. So we can modify the request, modify the response, modify all the items. It's pretty pluggable. And then we have, I mean, the spires return either request or either items. The request goes back to the scheduler, and the items goes through the item pipeline. So well, I think it's quite easy to check what is the flow. So what kind of things we can do on an item pipeline? One, we can set some default values for fields. Imagine that we have some fields on an item, and we want to set the default value. It's also quite useful for validating SKP data. So we can say which items are not valid. It's also quite useful for checking for duplicates. So imagine that some website is really broken and pagination doesn't work okay. And you are getting the same website again and again, while you are not getting the same item in this case. It is also useful for storing items. Imagine that you want to save items to Amazon DynamoDB or any other DB of there. So you can just write an item pipeline to save each item on Amazon DynamoDB, for example. And also it's the place to write third-party integrations. So if you have an item with a couple of product description and you want to translate it to another language, you could say, okay, I will integrate with Google Translate API, translate some fields, and well, the item get the translated fields. We have middle words, what are middle words for? Well, they can process requests and items. Basically, they are useful for session handling. That's out of the box on the SKP. That's already working. So it handles cookies for you. But it is a red request. So imagine that you get 500 response or a malformed response from any website. You can say, okay, let's try this request and it gets back to the scheduler and it will be a scheduler later. We can also modify request, for example, say I want to proxy this request to a specific proxy. And we can also use it for randomized user agent. So we can say that each different request use different user agent. And well, SKP is batteries included. We have, it has login. From SKP10, it's the Python standard login. It has also a very powerful model for stats collection. It also supports testing contracts. It's called contracts. I think it's a twisted term. And we have, it also offers a tenant console that it's a way to inspect an already running SKP process. So you can introspect an already running a spider. You can do quite a lot of things like, for example, check if there are memory leaks or post on receiving the spider. And the tenant console is really handy. So for all of you that want to check and are an example of a SKP project, just go to github.com. It's the SKP half, it's the Python speakers. And they'll have a lot of spiders from different conferences. And we basically scrape data for all the speakers of these conferences and do some visualization. So we can know how many of them are male or female. So it is quite interesting to see how women are, I mean, more at least are growing, year by year. Okay, this is the interesting part for all of you that are really not scraping. So how to avoid getting banned? Well, there's a handful of quick tips. The first of all is rotate your user agent or use an username that simulates a real browser or even Googlebot. And also disable cookies is mandatory. So if you are not accessing a protected user data, you can just disable cookies and it works. And it works really cool. I mean, normally these websites try to track you through cookies. Also randomized download delays helps. They might be tracking on how much time are between all the requests. So you can randomize the time between requests and well, they can detect the UTC. Also use a pool of rotating IPs. Well, that's the most classic approach. So you can buy a bunch of proxies and proxy all the requests through those proxies. But there is also Corona. This is a product from us from the scripting half. And basically we provide you an IP and you perform all the requests through that IP and we take care of handling bands, blocks replacing proxies, rotating them. So it's like magic, okay? And let's say I want to perform two million requests to one atom, okay? It works. So if I've been speaking about scraping more in terms of doing a targeted calls, but in cases you want to know how to, how you can approach a broad call, there is a library that we have open source recently. It's called Frontera. And well, just today my co-worker gave a presentation about this slide. So please check those slides and check YouTube or come by our booth and we can discuss how to use Frontera. So okay, let's say we have a bunch of spiders. And now we want to deploy them somewhere. Well, we have a tool called Scrape Demon. Of course it's open source. And it provides a web service where you can send, it's all over JSON. So basically it's a service demon to run Scrape Spiders so you can upload your, I mean deploy your project and run Scraping New Spiders, Scraping New Jobs, checking the status of those Jobs and it works okay. But we also have Scraping Cloud. It's also from Scraping Hub. It's a commercial platform, but we have a free quota and basically it's a visual web interface where you can deploy your spiders, schedule them, manage them, monitor them. It's really also useful for key way people. And well, I recommend if you are into Scrape, I recommend you to give it a try. We have a free quota and if you come by our booth, we can provide you a bigger one. About us, a bit about us. Well, we do tons of open source starting from Scrape. We have open source recently from Terra. And well, you can just check our GitHub profile and it's, I mean I'm really proud of our team and how they approach with open source. And we are fully remote distributed team. We are 110 people worldwide, fully remote, and we have really great talent out there. So, well, this is the mandatory sales slide. Basically, we do professional services of Scrape and we have two products, Scrape Cloud and Corelac. I've already told you what's out there. So please, if you're interested, just ask our booth about them. And well, we're hiring constantly. So, if you went to Freelac and Spiderman, get in contact with us. It's a nice place to work, a fully remote team. And well, that's all. Gracias. And I think it's time for Q&A. So if you have any questions. We have a lot of time for questions. No worries, so I'll ask something. Did somebody hit the hand? Oh, there, sorry. What about JavaScript intensive websites like with lots of Ajax requests? I have, in my experience, I've been using Scrape but on top of that, some headless browser like Splinter or something like that. Do you have plans to integrate something like that? It's plus. You can find it on our GitHub profile. And basically, it's an scriptable headless WebKit engine that offers an JSON API. And we have also another project open source called Scrape.js that integrates with Splash. Directly, so yeah, you can use Splash for, I mean, for performing all the requests through a WebKit engine. Any more questions? So I would like to ask all of the previous two talks who have been about HTML. Do you have any solution, for example, if the data I want is in PDF? That's a tricky one. Well, I mean, yeah, I mean, you can fit Scrape with any kind of data, really. But I think that you need to check any PDF library on Python to deal with PDF. But yeah, you can use all the Scraping giant, the pipeline, definitely. But yeah, at the Scrape level, there is no support for PDF. You need to use a third-party library. Thanks. Not so much a question, but a comment. Of course, one of the big drawbacks of Scrape is it's not Python 3. So I just wanted to mention we'll be doing a sprint on that at the weekend. If anybody's available, please come. Yeah, I mean, we are holding a Scrape workshop this Friday. Also for all of you that want to learn more about Scrape, we have our booth outside. So please come by and say hi. And we have some cool swag as well. And also, we are trying to hold some sprints this weekend about Scrape. So if you're interested, please tell us. And yeah, we are really open to that. Thanks.