 So first of all, I would like to thank the organizers of your python who gave me a chance to stand your end speak in front of you So my talk today is web scraping in Python gonna one So if you're already experienced with web scraping, so this is not the right place for you to be because it's a one-one So first of all a little bit introduction about me. I'm Muhammad Yasubullah Khaled. I'm a programmer a high school student I'm a blogger python star and tea lover So my experience I'm a creator of free python tapes. I made a couple of open source programs Oh, just wait a minute. Okay, so I've made a couple of open source programs. I'm a contributor to YouTube DL It's a download for videos which supports almost 200 plus websites where you can download videos from and Finally I teach program to my school at my school to my friends and this is my first ever conference So what this talk is going to be about this talk is about web scraping the libraries Which are available for this job in python and which library is better for which job? I will also give an introduction to scrappy and some of his internals and when I will also tell you when and when not to use scrappy So what is web scraping web scraping web harvesting or web data extraction is a computer software technique Of extracting information from websites usually such software program simulates human exploration of the worldwide web by either Implementing low-level hypertext transfer protocol or emptying a fully fledged web browser such as internet explorer or muska Firefox So that was Wikipedia. Now, let's come to my understanding of web scraping So in simple words, it's the method to extract data from a website that does not have an API Or if we want to extract a lot of data, which we cannot do to an API due to rate limiting Some websites allow you some specific amount of calls which you can make to their API So if you want to extract a lot of data, then we you cannot do that to an API So you have to focus on web scraping for that purpose And web scraping we can extract any data which you can see while surfing the web So the usage of web scraping in real life. There are a lot of use cases of web scraping where you can extract job product informations job postings and internships We can extract offers and discounts from the deal of the day websites We can extract information from curation websites. We can call cross forums and social websites extract data to make a search engine just like Google Yahoo and Finally, we connect together by the data and these are just some use cases. There are a lot of other use cases as well So the advantages of web scraping are using an API First of all web scraping is not created limited. You can use a site There's IP addresses for web scraping a cycling rotation rotating of IP addresses It is anonymous. You can enormously access a website for a torn network You can some websites not have an API for example, Wikipedia Do not have an API for some years ago So you could only use web scraping to extract data from Wikipedia and some data is not accessible to an API like some For a YouTube if you use YouTube's API, you cannot access the direct URL of the video for the MP4 URL and many more So the essential parts of web scraping the web scraping for us some basic workflow First of all, you have to get the website through using an HTTP library You have to pass the HTML documents using a parsing library Then you have to store the results for further usage and Analyzation I'll focus more on parsing because it's the main bottleneck in web scraping So the library is available for job in Python. Basically, these are the parsing libraries. First of all, we have beautiful soup LXML We this irregular expressions library of Python It is not really for web scraping and it's unpopular in that regard and I'll explain that later Lastly, we have scrappy. It's a fully blown away framework web-scraping framework made by Pablo Hoffman So some HTTP libraries for web scraping for that purpose, we have the request libraries You can simply do request dot get the URL and then dot HTML. We have the HTML file Then you can either use your live and your live to You can just really live to dot URL open the URL and then dot feed and finally you can use HTTP live And HTTP live to if you want to go low levels But most of the time request library is the best one for this purpose Then we have the parsing libraries. First of all, we have beautiful soup. It's really easy. It is a beautiful API You can simply do a beautiful soup pass in the HTML document as the argument and then you can traverse through that Through a tree using simply dot title to get the title dot B to get the B tags Then we have LXML. You can simply do LXML dot HTML dot from string pass in the HTML document as string Then you can simply apply X pass to extract the data from the document Finally, we have regular expressions. You can simply do redot find all or redo redot find and then pass in the regular Expert one and that's your document So let's focus on them in a little bit more detail First of all beautiful soup. It has a beautiful API You can just do find and find all and it's all that you just need find method and find all method It's really easy to use. It can handle broken markup really easily A lot of websites do not have a proper markup HTML markup. So if you Go across a website, which does not have a proper markup You should use beautiful soup only because it can handle broken markup It's purely in python, but it's really slow. So most of people disregard beautiful soup in production production So we have LXML The LXML tool kit provides pythonic bindings for the C libraries of LXML 2 and lip XSLT without sacrificing speed This is just a wrapper around the C libraries. It's really fast. It's not purely in python as it's a binding for the C libraries If you have new pure pattern requirement use LXML So when google app engines started they didn't support LXML in the beginnings right now they support but in the beginning They supported other libraries, but not LXML LXML works with all python versions from 2.4 to 3.3 or should I say 3.4 because there's this that's the latest version right now Then we have regular expressions the re-library It's a part of the stranded library. It's a rejects library for python It's used only to extract my new amount of text entire html parsing. It's not possible with regular expressions It's unpopularities you do requires It first requires you to learn its symbols, which are really difficult like dot Asterix dollar sign upper carrot backslash b backslash w backslash s backslash d and it can become quite complex You have to combine all those symbols and then you have a regular patron to extract the document extract data from the document However, it is purely baked in python which by that I mean it's about the standard library It's a very fast. I will show later and it supports every python version from 2.4 to 3.4 The now the comparison of beautiful soup re and LXML here. I've written simple test to calculate the speed parsing speed of three libraries after in the test we find that beautiful soup took 1851 milliseconds LXML took 232 milliseconds and rejects took seven milliseconds just to parse the title from a html document So we can conclude that LXML took 32x more time than re and beautiful soup took 245x more time than re So if you want to extract my new information, you should go with regular expressions So what to do when your scrapping needs are high you want to scrape millions of web pages every day like google You want to make a broad-scale web scraper. You want to use something that is thoroughly tested So is there any solution? We have two solutions. First of all, you can deploy your own custom-made scraper or either you can use a framework like scrappy So I'll focus more on scrappy because of fully tested framework. It's really fast It's a full blown away thoroughly tested framework. It's asynchronous. You can make a lot of requests in parallel It's easy to use. It has everything you need to start scraping from the html libraries to the parsing libraries to the storing libraries And it's made in python So how does scrappy compare to beautiful soup or LXML? Beautiful soup and LXML are libraries for parsing scrapies an application framework for writing web scrapers They crawl websites and extract data from them in other words comparing beautiful soup or LXML to scrapies like comparing ginger 2 to Django, I hope all of you know about ginger 2 and Django That's scrappy docs says that so but the major negative point about scrappy is that it only supports python 2.7 But not python 3.x The main reason for that is it is based on twisted networking library They're already working on getting 3.x support for twisted So when twisted gets the 3.x support scrappy is on the way So when to use scrappy when you have to scrape millions of pages when you want to synchronize support after the box when you Don't want to reinvent the wheel and when you're not afraid to learn something new So there's a beautiful quote I ran across recently if you're not willing to risk the unusuals you will have to settle for the ordinary by jim ron So starting out with scrappy The workflow in scrappy is very simple First of all, you have to define a scraper define the items you are going to extract from the html document Define the items pipeline. It is optional. It is just to store the data and finally run the scraper I'll just just demonstrate the basic building blocks of scrappy because I don't have enough time to write a full scraper And in scrappy a scraper is called a spider. So if you see the term spider, don't worry. It's the same So using the scrappy command line tool scrappy provides a handy command line tool which you can use to generate a Basic skeleton of a scraper. Just run scrappy space start project space The project name here. It's the tutorial So you'll get the following directory structure this configuration file and the package with items dot py pipelines dot py settings dot py and then the spiders folder a project can have multiple spiders So what is an item items are containers that will be loaded with the scraped data They work like simple path and dictionaries but provide additional protection against populating and declared fields to prevent typos So, you know, which data you are going to store and which you are not going to store So declaring an item class is really simple Just import scrappy Then define a class here. It's the most item. It's taken from the scrappy Tutorial just I have to find title link and description fields scrappy dot field is the simple It's really simple if you want to Make for classified further you can pass in arguments to scrappy dot field So extracting the data you can if you want to test your Your x pass you can simply use the scrappy handy tool the shell tool scrappy ships with a shell tool You can simply type scrappy space shell space the url on which you want to test your x path The scrappy will open a session for you It scrappy provides x path css selectors and rejects to extract data from my html document Extracting the title using x path is really simple SEL dot x path bracket the x path and then you simply do extract and that's it. That's how you extract data using scrappy So writing the first scrapper a spider is a class written by the user to scrape data from a website Writing a scrappy easy just follow these steps. First of all, you have to subclass scrappy dot spider Define the start url's list this the list from which thus your spider will start crawling Then you have to define the path method in your spider That is how you want to store the data and parse it So here's a full spider First of all, we have to write the name of the spider in the class The name is required to run the spider later on The allowed domain so that your scrapper does not deviate from The required domain then the start url so that The spider knows from where to start scraping Then there is the parse method here We have we are looping over the response And then we are just storing that it in the title In the fields and then you have to use yield items and that's it So unleash the scrappy parse Have we have just defined a scrapper we can just type scrappy space cross space dmos by getting into the project folder So storing the scrape data Here we have two choices first fall we can either use feed export It's really simple. It stores the data based on the fields which you have to find Secondly, we have the items pipeline. It allows you to customize the way your data is scraped data is stored So using feed exports, you can we can simply do scrappy space cross space dmos and we simply have to add minus o which stands for output and then the file which where we want to store the scrape data item pipelines are a separate topic And will be covered in the future if you want to read on that just open scrappy docs. They have really good information there When not to use scrappy There are certain points you have to keep in mind while using scrappy If you just want to make a throwaway script, don't use scrappy if you want to call a small number of pages Here you don't need to use scrappy because it is really useful if you want to scrape a lot of pages If you want to you make something simple Don't use scrappy if you want to reinvent the wheel and want to learn the basics and make a project to counter scrappy Go for it So what should you use? So if you want to make a script, we just not have to extract a lot of information and if you're not afraid of learning something new Then use regular expressions You should use them only if you want to extract minute amount of information from a web page If you want to extract a lot of data and do not have a pure python library requirement, then use lxml It's really fast If you want to extract information from broken markup, then you have to settle with beautiful soup And if you want to scrape a lot of pages and want to use a mature scraping framework, then use scrappy So what do I prefer? Seriously speaking, I prefer regular expressions and scrappy I started web scraping with beautiful soup as was the easiest and all the stack overflow questions had scrappy as as beautiful soup as the preferred solution Then I started using lxml and soon found beautiful soup really slow I already showed you the test that it took a lot of time as compared to lxml Then I used regular expressions for some time and fell in love with it for its speed And now I use scrappy only to make large scrapers or when I need to get a lot of data Once I use scrappy to scrape 69,000 torrent links from a website So now let's talk about youtube bl. It's a program I developed and it also uses web scraping on the back end It's a python script that allows you to download videos and music from various websites like facebook youtube vimeo, dailymotion, metakef and almost 300 more websites So well that was it. I hope you learned something from this about web scraping from this talk So it was my first conference. So forgive me for any mistakes and if you want to talk to me Just meet me outside If you want to ask something then don't hesitate and I will try to answer It's finally questions Thank you very much for very fast talks. We have plenty of time for questions and Please go ahead I mean one challenging thing In using the scrappy scrappy is that Let's say there is a change in the html or DOM structure of of of any website. I mean, is there any like Exception handling. I mean we can use to detect Changing in the DOM structure of the website and how we fall back or handle such case So if you use any scripting library you will have to See if a markup changes your scrapper will break and it will not save any stored data You will have to change your scrapper based on the change layout of the website So there's usually not any other way across it. So you'll have to modify your scrapper a bit If the site blocks your ip address So if our site blocks our ip address we have some We can use work rounds. First of all, you can we can use ip rotation or you can use the Tor Tor network. There's some websites some professional websites, which allow you to scrape through their domain They have ip rotation. You can buy a lot of ip's and then rotate your ip's That's the only way you can bypass that Yeah scraping hub. Yeah, I've been there. I've used it Yeah, it's good. I like your tractor a lot your logo somewhat related question Does scrappy have any kind of rate limiting support? For example, I don't want dDoS site and I don't care much about latency So I want to rate limit my scraping to one page per second. Yeah, you can do that There's an option this configuration file you can limit how many web pages you want to open in parallel You can either open two pages or you can use one page If you open one page, you can also set an option where you can your scrapper will wait before opening another page Like for if you want to wait for two minutes for opening the next page And if you want to put a lot of if you don't want to put a lot of load on the server, you can use those settings Yeah The only negative point of scrappy is that currently don't doesn't support 3.4 Everyone is rushing for 3 python 3, but still scrappy doesn't support it The twisted networking library. They already have 60 percent support for python 3 and their miles They are going to achieve their milestone in some couple of months. So I hope scrappy will be there Any more questions? Yeah, one one question. How do you deal with pages that are purely ajx based or they render the page with javascript? What is your work around because scrappy uses the dom path, right? Yeah, so what is your suggestion there? The way which I work with that problem is that you can use simply chrome inspector and inspect the ajx calls You can copy those ajx calls. Usually there's an api You can make a pattern of the api urls and then pass those urls to scrappy and will use those api urls Because those api urls return the data in html form Any more questions? No, then thank you very much again and