 So, we can start the next session. Welcome back everyone for our next talk. Anton is going to be talking about scraping the web. So, let's help. Welcome. Thank you Fabio for the introduction. Beyond scraping, that is the main title. And of course, what is beyond scraping depends on what site you're coming from. And I'm coming from the past. If you look at how scraping was 20 years ago, it was very easy. The way that the web was built up for the user could be easily retrieved in an automatic fashion. But nowadays, that's not possible anymore. You have JavaScript to make the experience much more nice for the end user. And if the data is presented for the end user but not necessarily in a specific way to automate the downloading, it can be very hard to get something done. Before I start with the proper talk, I would like to see some hands. Who's used URL lib from the standard library? Who's used requests? Maybe use the other hands if everyone's around. Who's used beautiful soup, preferably the version 4? Who's Selenium? Slightly less, but still. Who's used 0MQ? Here it gets interesting. And who's used by a virtual display? Okay, good. Still some people. So this is all the exercise you get, unless you want to leave early, of course. The talk is not very technical. You will not see any Python code. But these are the buzzwords. If you glue all this together in the proper way with the right idea behind that, you will be able to scrape current websites, and I would say like 99% of them without too much trouble. Some backgrounds for me. People don't know me beyond that I fold the t-shirts at the Python conference. By education, I'm a computational linguist. Unfortunately, I couldn't do anything with Python during that time because at the time I was writing my thesis, Guido was writing the first Python interpreter. After that, or partly during that, I was doing 3D and 2D computer graphics. And I actually missed an opportunity in 1993 to start using Python. One of the students from the University of Amsterdam who started working for me introduced me to the language, but they already had a C program with two interpreted languages hanging off it, and I didn't want to have a third one in that program. But I liked Python. I actually liked it because of the indentation. A lot of people don't understand that when they first look at Python. Oh, indentation, but I came from using Transputers and Occam 2, and they used indentation and folding editors. So that was fine for me. I did some stuff with Python, and then in 1998, I finally got an opportunity to do something commercial in Python 1.5.2 on Windows and tickle as the graphical use interface. Some people might know me from C implementation from the Order Dictionary by Ford and La Rosa, a very complete Order Dictionary, much more complete than the one in the Standard Library. I re-implemented that in C back in 2007, and that was my first experience with making Python packages. More recently, I picked up YAML parser. It seemed to be kind of dead. I made it into a YAML 1.2-compatible parser, the PyYAML parser, and I started that because I found it kind of strange to have a human readable data format that would throw away the commands when you read it in and wrote it back out. So it's a run-tripping parser. It now does all kinds of extra things, and those are available from PyPI as packages. So scraping the web, what is the actual problem? Well, you want to download information from all kinds of websites, but sometimes you want to state some state. You want to interact with a website and change the state, not necessarily download the data. You already know what is there, but you want to increase your score somewhere or you want to make sure that somebody knows that you visited, although you're actually on holiday and lying on the beach and didn't want to start up your browser. So before I want to go into detail, let's briefly look at web pages so you know what I use for terminology. For me, a web page, coarsely, is a structure of text, a tree structure. The text can have attributes and the text can have data. So if you look at this small example of an HTML file, the tree structure is shown by the indentation. If you use a debugger within your browser, it often indents that for you to actually see what a structure is. Of course, you don't have to write HTML like that. You can write it all behind each other, but it's difficult to look at what a structure is. If you look at the ATAC there, the second from the bottom, it has three attributes, href, id, and class, and it has some data on the other side. Depending on what kind of library you use to go into HTML, you can also say that the other side is data that is associated with body. It sometimes helps to have multiple things together, especially if you have things like italics. You might not just want to have a superior tag and pick up the data from that tag and it often automatically does away with all the intermediate tags. It just puts together the data that you have. A web page maps some URL to some data, and that's often unique, but it might not be unique. You might get something different for URL. We'll look at that later. We are looking at it right now. The old version of changing data is like you use some form data. You submit a form and depending on how you filled out the form, you'll get a different result on the page that you go to, although it's the same URL. What also happens is if you have some state in a cookie that might influence what kind of data you get given a specific URL, and nowadays it's depending a lot on JavaScript, what you actually get in there. You have websites that have only one URL. It never changes, but all the time you get different data depending on your state in the JavaScript that is executed on that single page. Brief interlude. There's different ways of developing software and I just want to touch on that so you understand why I did things the way I did. You can use a complete framework that covers anything that you want to do and learn it and then implement the little part that you want to do within that framework using configuration or writing some code depending on the framework. There's some frameworks for doing web backend development. There's also more framework-like tools that you can use for scraping. The other way is going from the bottom using some existing building blocks and gluing them together with your own code. If you develop like I do for some customer who is interested in getting some results, a framework is not necessarily the best way to go. If the framework exactly does what you need to do and you don't have to change the framework itself, then you might better go with the framework. But if you need to go and dive into the framework and change the 10% of code that you used there, you first have to find the 10% and do the changes and the biggest problem exists in that after running the code for a year and not looking at it, you have completely forgotten about how the framework works so you have a big problem updating your own code and understanding your own changes. If you glue blocks together and the blocks essentially do what you want, you only have to look at your own glue. That's the code you wrote yourself in the first place and after a year you're much more likely to understand what you did a year ago. You might have to start from scratch. You might do it in the same way. So I'm going to present something like gluing the building blocks together that I showed you earlier on or that I raised the hands for. Simple websites. Those are the ones you can actually access by using URL2 and request. Sometimes you want to use form data to get actually to the data that you need and especially requests help you do that. If you can get the data that you want with URL2 and haven't used requests, I recommend that you actually look at it. And these libraries, they do some basic stuff for you like redirection. Actually doing things like handing over cookies is more complex and if there's some JavaScript on the side, things really get bad because you have to look at what does the JavaScript do? How can I do that by hand? Can I get data that the JavaScripts do with some URL request and then insert it in the page or directly use it? Cookies are used to keep state and I specifically mentioned them because they often use to preserve your authentication information. Data that is valuable to get off the web might not be available for free, so it's not like you get some URL and you get to the data, you might have to log in first and then be able to proceed to getting the data. The authentication, originally there was some building or there is still some building authentication in your web browser. The building used, it has a very coarse pop-up window where you put like a username and a password. More often, there's some form you have to fill out on your web page and that form, the information from that form on the back end creates some cookie and that is used to keep state. Over the last, I'm not sure exactly how much, seven years and so OpenID has come up which allows you as a web developer to concentrate on getting the information across that you want and not have to write too much of the login code and it has an advantage if you have your website redirect to Yahoo or to Google in that you, if necessary, have some more physical, or you can physically trace the person who logged in because nowadays Google and Yahoo, if you set up a new account we'll ask you for a telephone number where they can send some pin code that you have to type in and for instance in Germany where I'm living now it's not possible to get a telephone account without showing your passport. So some backtrace that is being done and so it might be convenience but it might also be that people want to know that you're a real person and at least have a real, there's some real telephone associated with something. With a person that actually accesses the site. So if a site has JavaScript then URL, lib2 and requests are of little use. But I have done, when things came up with JavaScript I have done these parsing of what JavaScript does by hand but you have to read the JavaScript it's often difficult to trace what it actually does. If you have a browser and compare what you get with URL, lib2 what you see in your browser is normally different but that's of course like you switch off JavaScript in your browser and that is often a good first indication like can I easily scrape the website or do I have to use more advanced tools to get to the data that I want. What JavaScript does, you probably all know is it can update parts of the HTML tree and by requesting additional data from the backend. So why do we do that or why do the web developers do that? It's primarily because it's a nicer user experience and if you don't have to update all of the website you get quicker updates which adds to the nicer user experience and reduces the bandwidth that you need. JavaScript has several downsides from a scraping perspective is that you don't get too easily to the website and that was too fast. There's also a big problem is that the JavaScript you essentially don't know when the page is finished. If you do a URL at two requests or to a page it comes back and you know you have all the data. If you have a page that has JavaScript you have to wait till it's done processing but it might never be done processing. It might wait in a loop or it might have some channel open for additional data to come from the backend and you never know what it stops. So if you can see something in your browser you probably can use Selenium to start that browser and then talk to the browser from Python and get your material out. So you just use Selenium like you would using the mouse. You drive the pages and you click on things if that's necessary and fill things out. Selenium originally was used for testing or I used it originally for testing and that is easy. Why is it easy? Because if you test something you made a page and you just have to see if the page actually is what you expect it to be. You already know the structure, what IDs you've used, what classes you've used, do you know how to get to the particular elements in the HTML tree. But the advantage is if you use Selenium that there's never a discrepancy because you're actually using a browser between what you see talking to the Selenium open browser and what a normal user will see. So in principle you can get to anything that a normal user gets. A nice advantage of Selenium is also is because the browser is open if the program is not exited yet because you have a sleep loop or you're waiting for some input. Then you can just start the debugger and you can see what the page looks like, the built-in debugger or firebug, whatever works for you. But the big important thing is that the program has to run. As soon as your program stops, Selenium closes down and it closes down to your browser and you will not be able to see what went wrong because if something went wrong and you try to access an element that is not there in the HTML tree, your program might crash depending of course on how you write it and any useful information that you could get from the browser is gone. You would have to start up a browser externally, go to the page, this structure, what did I expect? Oh, there's a new element there. They changed the back end and tried to get these things. So if you use Selenium, you can do a superset of the URL lib2 and the requests thing that you can do because of all the JavaScript that is handled correctly. And there is a main... there's two main differences but one of the differences is that you don't open a browser and if you use the URL lib, you don't open a browser, you can use URL lib or requests easily from a cron job on a headless server without any problems. That is not possible with Selenium without doing some extra stuff. Selenium opens the browser and the browser needs the window, so you need a desktop. So let's look at these... some more of the problems. Let's take a look at the data where the data is there. The page loads, the JavaScript is started, the JavaScript has to... has some special... function to actually wait till the complete page is loaded before it starts executing and you have no clue when it's stopped executing. Sometimes you just wait for five seconds because you know in normal situation that things will be there and that you are interested in actually got loaded but if you have a table of elements like there might be three elements already loaded and you don't know how many are going to be there. Is it done loading or not? So the second interlude, we saw that the web page has a structure and there's different ways of getting to a particular piece of data on that web page that you might want to extract. The things you probably want to extract is either data or some attribute value URL to a PDF file or to some other page. You can get at that depending on how the web page is built up by using the ID. The ID should be unique although I've seen several pages especially generated but with Microsoft's CMS systems that had reused the same ID on the same page. At that point I decided not to use the ID because I don't know if the browser and using beautiful soup will actually... the browser might take the first ID and beautiful soup the second it is like let's not use that. Depending on how the web site is structured you can search by class. That is of course if something is colored in a specific way on a specific class you can get one item but it's not necessarily like always the case that these classes are not reused on several positions in the web page that you're looking at. You can programmatically walk over the tree so on the top I have HTML and then I have the body and then I go down, down, down it's not particularly fast and there's something called X-Bat which if you haven't used it yourself it's more or less a regular expression to get to a particular piece of data based on the tag names and some attributes. X-Bat is not very complicated but if you don't use it on a daily basis it's kind of hard to remember how to do things. There's a better reusable option that I tend to use and that's the CSS Select it's not as powerful I think as X-Bat but it's powerful enough for my purposes. It looks like for instance this here it says get any URL that is an HTTPS URL on some site.com but it might be longer the carat actually makes sure that it only has to start with that so the href of an HA element has to have these start with this string and then the A has to be after a div element that has import as a class and there's all kinds of rules like this this kind of thing I think Selenium might not support this but this is a beautiful soup as far as I know it does and there's CSS allows you to get to particular elements at this point like you have the A element and you can as soon as you point to the A element you can get if you're interested in that you can get a URL that is the href attribute CSS select has my preference over X-Bat because I can also use it when I make a website in using the CSS files to actually determine the look and feel of the site but like I said there are restrictions that you have to be aware of both Selenium and beautiful soup don't implement these selections as complete as your browser does so what is a typical Selenium session before we go into how to do it differently you open a browser and go to Senior URL you click the login button we assume that you have to authenticate you wait until the redirection to the open ID provider site is reached you provide your credentials this is of course a whole subject on itself how do you automatically provide credentials you don't want to have everybody read your login name and password there's a few things but one of the simpler ones is make a sub directory in the SSH directory if you're running Linux that has already the restrictions and our checked restrictions on accessibility only by the owner of the files then you have your credentials you wait until you get back to the request page open ID has open ID session has notified your website that everything is okay then you fill out some search criteria to restrict the new or look for new data that has happened or has been added since the last time you checked then you might get a table or a list of items and you click on one of these references in that table and then you're finally there you might be on the final page and get the data from there you're extracting from the HTML or you find a link and the link might be to some files some PDF file or some other file and the main problem with this is that debugging is very time consuming every time you log in and you have to wait and it's like you're not talking about seconds if your program doesn't in the end your program doesn't exactly know how to analyze the structure of the last page where you actually retrieve the file information or the textual data then you have to restart your program and it has to log in again so we're talking about tens of seconds if not a minute before you can get to where you want and if you have a client waiting it's like oh your software is not working anymore this is kind of bad so how can we improve on that that you don't have to restart the Selenium every time one, there's probably several ways but the way I solve this is going into a client server architecture where the server talks with Selenium and my client can just crash or can be restarted and continuing where I left off the server keeps the Selenium session open and that keeps the browser open even if the client crashes to do that you need some protocols how do I set that up it doesn't have to be very sophisticated and you get data to the server which is essentially requests so you get data from the server to the client for analysis and knowing what state the program or the website is in so you can take appropriate action or rewrite your client program to take other appropriate action originally when I set this up a couple of years ago I thought about like oh I'll write some files increasing file name numbers and the server will just look at the directory and I'll get stuff from that but then I looked at 0MQ and it actually allows you to do these kind of things pretty easily it allows you to have a many to one among other things allows you to have a many to one connection between many clients and one server that also allows you to have multiple multiple threads within your client or having and still have one server open using 0MQ it's very it's trivial to get the server site on a different machine using port numbers and specify with machine things are running on if they're not on the local host at 0MQ not by default but it allows sending unicode based exchanges and it is especially easy to get data you might not use like special characters in your protocol but on your website that you download you're almost certain at some point to find non-asky characters and you have to deal with those so you might as well set up the whole thing using unicode so if you look at the thing that we did before the session of getting to some data if you have a client-server-based solution then the thing look slightly different you open the browser but only if it's not already opened you click on the login button but only if you're not logged in yet if you're not logged in yet but you're at the open ID site you don't have to go to the open ID site etc etc you don't have to do things that are already done and you just have to pick up where you left off last time and you have to check for those things so it might just be if the final page with the data has changed that you don't do any of these initial things you just check if they're done and then you directly get your data so your turnaround time in starting your client program goes down from 10 seconds to a minute to part of a second and then you have your data so so if you define a protocol what do you need the protocol sends some command with some parameters and gets a result back so we look at what kind of commands do we need and what kind of parameters do these commands have there's only very few of them you have to be able to open a window and I use a specific window ID for that so I can open multiple windows on the server side if you don't do that you essentially have only one window to work with and it's very difficult to do many to one or have multiple clients running because they would be competing for the same window to do something using that window ID you can say go to some URL and the page will show up in the web browser that Selenium has opened in the meantime the next protocol thing that you need is select some specific item based on an item ID the item ID you can reuse again on a specific page the window ID and then you want to interact with the specific item based on its ID you might want to click on it to have a radio button clicked or to go to some specific link clear input or text area there might already be something where you want to write that's the next thing that you want to do so you might want to clear out the old password that is incorrect and give the new password and then very important return some HTML starting with a particular ID you can of course get the complete HTML page but it's inefficient you often already know like oh I'm only interested in this table you selected the table using Selenium and then you get the whole table back and the other thing that is almost necessary to have is like what is the current URL that I'm looking at because if you go to an open ID page and you say click somewhere you need to know that it actually gets back to your original site to continue working so you want to check want to be able to check have the client check ask the server what is the current URL that we're looking at you can extend this protocol with whatever makes things more efficient this is essentially where I stopped a year and a half ago after adding a few things it might be just be more it might be more efficient to do things on the client's side then push them to the server so you get the HTML back you need to do an analysis of that I used beautiful soup for that it's faster than going over three in Selenium and getting the individual items so of course not useful if you have to click on the items then you still have to do it on the server side like I already indicated it has CSS select support there's one cave at though you get a piece of an HTML page back and beautiful soup wants to have a whole HTML page put this in this string between the curly braces and then you can actually hand it over so the first problem that I solved is with the client's server architectures that your clients can crash and you don't have to start from scratch but the whole thing introduced the problem is that you have to have a head a desktop where you actually start the browser and if you want to run something on a headless server or you don't want to have at some point in time have a browser start while you're typing in some email there's a solution using Pi Virtual Display it creates a virtual it creates a display that you can use to start the browser you will actually not see the display but for debugging purposes you can still get at it if you start a VNC session what I normally do is I don't use the VNC or the Pi Virtual Display back while I'm developing and when it starts running it's fine and if my client crashes anyway I'll use VNC to connect to the Pi Virtual Display startup and then she's like oh the browser stopped because of whatever sometimes you get like stupid things like your website requires you to change your password every six months and you haven't done that so you have to complete different pages than you expected because you never programmed it for that there's different ways of extending this what I already have done is is restrict the advertisements I use the Firefox browser often in the back end by using some configuration that of course loads pages that use it ads much faster what doesn't work with Selenium but it's the client service architecture is capable of is using the tour by starting Firefox with its own extensions that you can drive it with slightly less powerful than Selenium but it's for most purposes is good enough then about the availability of the software like the previous talk software is not yet on Pi Pi I need to remove some stuff from the client side that is proprietary for the clients that I develop some software for and you would recognize where I get scraped the software from so I need to get it out but once it gets up there you'll be able to see it on Pi Pi with using Rulomo browser client and Rulomo browser server and I will also update the YouTube video with that information when this is available so that's also almost the end of my talk I can give some questions now I can also give some real world examples for what I use it not for clients let's do the questions first do the world examples there is a microphone and one, two, hi usually these kind of problems we have when the pages kind of single page application or JavaScript driven and it's usually API right? if there is an API available you might just want to use the API to get the data I am looking at pages that are not designed for don't have an API to get to the data the main problem is that you need to be sure that page is completely loaded let's try to look at some specific element on the page if it's already there or not if the table if you immediately check you might not have the table at all then the table gets there but you don't know if all lines have been loaded so there might be some indication that there's going to be like 15 results and you have a table of 10 items you know that 5 results still need to be loaded and sometimes it's just waiting for hoping that everything arrives in time yeah sure now this with Selenium looks pretty complex and a lot of stuff are used isn't it easier to do something like I don't know sleep one second and check I don't know content of the page or something but you still need to use Selenium to get like is the content there or download the whole page but you also have to request Selenium for that if you don't use Selenium but would go back to like using requests anything that gets loaded by JavaScript you will not get at all because request doesn't handle the JavaScript yeah yeah so it's there's different way of addressing these but it's one problem we had when using Selenium to access data was that these pages sometimes have date pickers and other elements that do not allow you to type in data and these are usually very complicated to automate have you had these problems and do you have ideas to handle these well this actually there's multiple things that you can do I have seen these problems so you have if I recall correctly like there are Selenium calls like just right in this field but there's also Selenium calls where you click somewhere and just sending characters and they you have to make sure that your cursor is at the right position and they will get there I've done that with for instance Khan Academy the website has that kind of problems and you can get around that it's not trivial but you there's different ways of getting the data actually in the and I would have would have to see if the protocol has some option of what of the two to use I don't recall for your talk one problem I got when using one problem I got when using Selenium to do about the same thing not really the same way but same tools is that a lot of people don't want actually the data scrapped so they're using services like Distil Networks Cloudflare that is proxies that will try to detect patterns while scrapping and when they think you're not human they will capture did you encounter this problem one of the reasons to do the client server architecture is that one of the the most frequent things I've seen is that they notice that you log in like seven times a day it's like why is that why doesn't the cookie persist and those kind of things if you know that is one of the examples that I have let's see Stack Overflow like this they will actually detect how often you refresh and restrict that and if you if you want to advance on the queues on the review queues and get like a thousand reviews and get the gold badge you have to do special things and load balance where you're actually looking it depends of course on the site like this is a thief and making a better lock and they will look at the patterns that you're using try to detect it but essentially if you behave like a normal if you have your program behave like a normal person they can hardly kick you out and for me that's for some sites for clients that means I do the scraping it takes two hours but they only want to have it done once a day and they cannot disallow you to like they put up some say 10 references to the 10 PDF files on a day well they can assume that you need to read the PDF and they don't want you to like within five seconds download all the PDF files but if you download one every two minutes you still can provide your client at the end of the day with the 10 PDF files that were uploaded that is the way I handled it I just have my program behave like as if it was a human and that has to be accessible and again yeah that's the second account look at it may I have time for one last question yes I just wanted to add there is some ways to run Selenium headless without using pi virtual display with phantom gs and chromium there is ways to run Selenium headless without using Selenium has some modes where you don't get a browser window the disadvantage is that it's not using a real browser so that might be detected and the other thing is if things go wrong you have nothing to look at you just have your html structure and the nice thing is if you use pi virtual display you use vnc and you see the browser that you would normally be using it's like oh it's in that state it's much more recognizable if it now after 6 months ask you to change your password if you see that you have to change your password instead of getting the html back what is it actually trying to do but that is also possible there are multiple ways of addressing these things but everything has advantages and disadvantages okay thank you Empton thank you very much everyone