 Alright, so I'm going to attempt to do this with links to the code examples which are on the GitHub repository. So we'll see if the Wi-Fi makes that smooth or not. All the examples, everything, shared, public, I assume, Martin, send the links out afterwards. Alright, so I'm going to give a survey of web scraping tools. So what is web scraping? Simple process where you're getting a computer to consume content that's aimed at people. So copying prices off of shopping websites, financial data off of stock exchange websites. You could pull tweets off Twitter if you wanted, all sorts of things. There are lots of legal and other issues in this space. I'm not talking about any of that. All the examples are either pages I wrote or pages that are totally fine with you doing this. If you want to do this for fun, have a good time. If you want to do this as part of your job, speak to your legal department. Seriously. It's just very complicated. It's not bad. It's just very, very complicated. Alright, so Python has a tremendous number of packages for web scraping, which cover all kinds of use cases. Python handles dealing with text really well. So when you grab a blob of text off a website, Python's great. Getting that blob of text off can be a little bit of a mess. And many of the packages kind of overstate what they can do. It's also important to say that web scraping is entirely dominated by network performance. So we're even going to see here in some of the examples, the Wi-Fi here or Chrome I suppose are the thing that's slowing it down. So no package is faster than any other. It doesn't make any difference. It spends no time in Python. It spends all of its time waiting for DNS lookups and JavaScript to run and all sorts of stuff like that. Alrighty, and then web content. So this is what's going to drive what scraping tools you need. If you're looking at really, really simple, so here what I've got in the text parsing box. This is basically HTML and it's well-formed HTML. I assume most people here have authored some kind of web page before and you made mistakes with where to put the greater than and less than symbols if you were doing it by hand. Yes? Okay, I'll take that as a yes. So if it's actually valid and it's just HTML, you can maybe even get away with no package and just using Python built-ins. If you've got real-world text, so down here in this lower left-hand corner, you're going to need something that handles invalid text. And you get invalid text on government websites, on e-commerce websites, on really simple websites where you can't believe in 150 characters they have invalid text, but they do. Okay, upper right-hand corner we've got stuff that includes JavaScript, HTML5, any of the hundreds of web add-ins that exist on some browsers and nots and all over the place. You're not going to be able to do that yourself. You're going to need to use something that processes that craziness. And then you've got in the bottom right-hand corner 99% of web pages, which is pages that have errors and use ridiculous add-ins you can't possibly handle yourself. So we've all experienced constant upgrades in Chrome and Safari and everything else. That all lives in this corner down here on the right. Okay, so just a brief delve into what doesn't, doesn't count as complicated. Stuff that's not complicated is stuff you can probably do yourself. That kind of means content from the 90s. Also forms you can post and get yourself. People have done form posting, I'm assuming. Anything that's got JavaScript, CSS3, anything where the browser version matters, anything that freezes your mobile phone, anything that crashes your whatever version, you're really not going to be able to do yourself. If it's just plain text, roll your own very thin tools. Okay, so we're going to write the simplest possible scraper. Bring up the example HTML here, which needs to get zoomed in a bit. Is this text large enough? Can people see that HTML? Okay. I can render this if people want, but I'm assuming everybody's good on basic HTML. Yes? Okay. So that is a very, very simple webpage. That's about as simple as you can get. All right, so some scraper code. All righty. So we import a couple of packages here. We can come back to what packages do what. So this is the destination we're getting to. I've put all the stuff up on GitHub, as I said before. So this uses the requests package that's a simple network access getting stuff package. Pull the information in, then we're down here parsing. So we just say, here's a page, assuming it's a valid page, parse it. And then here, and we're going to dig into this a little further, we're writing an XPath, which is a way of describing in the HTML, XML, blobby text, what stuff we want. And then we just very simply go through, run it, pull it out, and we're printing it. So go back to this simple page here, getting the entry under strong, and then this pulls out the text. Again, we'll go in a little bit to how that works in a second. Should pull out one, two, three, four, five. So if we go in here, make our text big enough, and Python three, we run example one. It should, there you go, one, two, three, four, five. So very, very simple scraper, simplest possible scraper probably. Okay, so this XPath language we had buried in there is a really big, heavy complicated language, most of which is for XML craziness that is not relevant to normal web scraping. But you can still use it, it makes life very simple. So basically, your data's in a big tree. Tags are nested, and you navigate through tags. You've got slashes to separate one level of tags to the next, wildcards, queries. So this tag that should look familiar to people, you would be asking for the image tag with the attribute source whose value is x, y, z, like this, pretty straightforward. And then if you want the a tag whose href is some text, you do this. You can give probably a full day talk on the weirdness in XPath. There will be a couple of further examples throughout this, but it's a really long deep topic that's not necessary at this point. You want to pull out the contents of a link using that. Very, very similar code. We're going to be looking at this XPath. We pull up example code for that one, and you know what? I should pull up the second HTML blob. So here we've got an href and a link to some actually US government website. So if we go back, we look at our example code here. Everything is exactly the same except we've got, this is our XPath, other than that, no change. And when we run example two, we get back out all the links from that page. So it's not going to, ooh, sorry, it's not the links from that page. Forget what my own code does. Ah, sorry. This one grabs the link, ah, and then we go, we get the link, so we go, we scrape, we parse it out, we retrieve the next page, and then I'm just printing the headers out because the page is actually a big blob with JavaScript and all kinds of stuff you don't want. So you can, this is, I guess, the minimal, not scraper, but web crawler. So we've grabbed a link off a page, and we've chased it somewhere else. Simple place to start, okay? One last little tidbit on XPath, this is going to come up in a moment when we look at some of the more famous simple Python scraping packages. You can navigate through this tree using standard directory sort of dot dot dot slash type things. So this, right here, find is stylized. You're going up a level from whatever element you've got, grabbing a div and putting on whatever conditions you've got here. So we can look, and this is some real code pulling from a, ah, all righty. So this is a non-trivial web page. We've got a table here where we're talking about fruit. So we've got a bunch of types of fruit. Apples and pears and stuff, colors and prices, okay? Now we go into our code. What we're looking for here, I hope this isn't too much XPath weirdness. We're going to look for the entry that is the apple, and then we're going to go up one over and look for where the price is, up one over and look for where the color is. This is a little bit contrived because it's a pretty flat table here, but you can imagine on a real web page that's got 20 level deep nested HTML, this is a lot better than searching for incredibly long paths from scratch every time. And they could even change where the tags go, where the ads are listed, headers, and this kind of stuff in general will still work. We'll come back to maintaining scraping code a bit later. So we run that. We find the first element here that we listed, the apple text, and then we can search off of that element by path for the relative entries and print them out. Just as before, this is not a very exciting demo. I think it's example three. Nope, I'm wrong again. It's example two A, and it's going to print out the color and the price from our table here. So that was apple, red five. Relatively straightforward navigation. Okay. Click. Okay. Now we move on to real web pages, which are not things I wrote and posted to GitHub myself this afternoon. So there's a well-known package called Beautiful Soup that many people have probably heard of and some people may have used to try to scrape web pages. It is famous because the learning curve is incredibly short. It is basically a two-function API, parse and search. And that's fine because you can do really simple stuff really, really fast. As we'll see, maybe you start to have a lot of complexity very quickly because it's a two-function API. So if I now look at a, this is a proper US government web page that is invalid as our surprising number of government web pages around the world. So if we try to run using the broken code here, so if we look at exactly as we had before requests get and just parsing with LXML for this page, and this is example 3A, we're going to get a really unattractive complaint about how the stuff isn't valid. You can't fix that because you don't control the web page. So you need to do something else. That something else is, something else is, wrong link, is to load up this other package that seems to be many people use. So Beautiful Soup, we run Beautiful Soup on our page instead of running the simple straight XML parser, and now we can run its only real method, which is find all, and then we can get stuff out of that, print or not. Again, really simple to use. I've just taught you basically the entire API. If you want to do more complicated things, however, it starts to get really bad. So this is an example piece of HTML, or an X path, an example piece of HTML. So this should be relatively familiar to people. You've run a search on some site, and you're paging page one, page two, page three, whatever, out of 10 pages, page info. So we can build these little things in Python. That's all pretty straightforward. That's what Python's good at. But then trying to search for this combination of conditions because Beautiful Soup doesn't have the world's greatest search API. We end up building some ridiculous set of for loops and if conditions. This can spiral out of control pretty fast. Or we can use here, find all we can pass in, because functions, objects, whatever, Python, we can pass in a function that then does this comparison. None of this is particularly quick, right? So our simple two function API just turned into a two function API plus me writing my own functions and passing them into their functions for every case. This is not fantastic. We'd be much better off writing this XPath and just passing it in ourselves. There's no real complexity in that logic. All righty, okay, we want to grab fields out. This is going to look pretty familiar. Once you've got your elements, you're searching whatever your find all, whatever your tags are, you just grab, expand on text. That's where your stuff is. That's where whatever the thing you're scraping for, however you get there, for loops, if logic, whatever it is. And then because it's Python, and really this is clearly not code anyone should use, but it fits well on a one slide presentation. You can then do what you need to do to the text you got back out. And in this case, we're, I think, building a date time object from it. I mean, don't, clearly I don't want anyone to think I'm recommending writing that, but it fits on one line, so, you know. But that's why we're in Python in the first place, because it's really simple to play around with the text that we've extracted. Okay, next, next. Now, JavaScript, beautiful sub has no interest in JavaScript. JavaScript needs its own interpreter, its own stuff to run. And as everyone who's ever used a web browser knows, it can behave poorly at times. Okay, so now we need a proper browser driving package. So we're going to use Selenium. There are a few. That's, I think, the most full-featured one that's around. And then we're going to need to use a browser. We're going to come to this a little bit later, but your only real choices for browser are Chrome and Firefox. You can try to use others. But the thing that attaches Chrome, Firefox, whatever, to Python is a evolving piece of software that doesn't exist for most browsers. So if for some reason you can't use Chrome or Firefox, you really can't do most of what I'm going to talk about from here on out. Okay, the code is very straightforward. You'll notice this is no longer than before. So we're now bringing in the Selenium package. This is all we need to do to launch Chrome. At this point, we are attached to a running Chrome that will do, within reason, whatever we tell it. It has some limitations. Pass it some URL, get. So pretty much the same as it was with requests. So here, you've got a couple more searching functions than beautiful soup, but not tremendously more. So this is some crazy name for a field that's on this web page. We can go through Chrome developer mode, but I think that's a few minutes of people's lives we don't need to spend. Everybody's used the inspect element stuff, I assume. And then we just get the element back. Very, very similar search. There's similar versions that take XPath and whatnot that we can call that, clicks on the button, and the browser does whatever you do when you click on the button. So if I run example five, we will see a Chrome pop up, load up that web page. So my hands are not on the computer. You can see it pressing the download button, and then it got unhappy because I didn't fill all the forms out. Because to fill the forms out makes the code longer. We'll do that in a second. So everybody saw the button press and the stuff up here? OK. So we'll quit this, and we kill that. All right. So let's download the file that that button pressed. This is a little bit more work. You have to do enough stuff to make JavaScript happy to give you the file, and file downloads are not part of the browser. The browser gets the content and says, and throws it off to the operating system. So you end up having to do some slightly ugly things here. Sometimes there's a terrible while loop while we wait for the file to show up. OK. So this code is a little bit longer. So we need to make ourselves a place to put downloads. This is just stuff you end up copying, I guess, from my examples or Stack Overflow, because there's no documentation, to tell it where to put the downloads. OK. Then you create your Chrome telling it where to put the downloads. And again, these are just strings you have to figure out how to find Google Locate somewhere. And then it's all very straightforward. So we've fired up as before. We get the URL. So here, there's some stuff we have to fill out on the page. So I'll go through that as it clicks. But you have to click some buttons and set some fields. This is a brief, brief overview of the Selenium API. You can search by XPaths, this sort of thing here. You can get a select object. So this is how you fill out the value on a form. So it clicks. It pops up. It enters this text as though it's the entry that you clicked on. You can scrape the text out, figure it out, and feed them back yourself if you want. The code gets longer and longer and longer. This grabs a button, clicks the button, all the stuff that's needed to get it ready. And then as before, we grab and we click our download button. Now, this is the, unfortunately, somewhat ugly part. Because the browser, as I'm sure you've noticed, downloading files just leaves you alone then until the file shows up. We end up sitting here in some terrible while loop waiting for a new file to appear and for that to be a valid zip file. I'm not going to apologize because I didn't actually write Chrome. So, fair enough. So, if we go ahead and we run, if I can remember, example six and depending on how fast the Wi-Fi is. Okay. So we got this. All right. Okay, so we've filled out the form. We've clicked the button. I don't know why I'm pointing here, but I guess, and then we got a download going. It says five minutes. So I imagine we've just got a, there we go. So it's just going to sit here waiting. We're not going to wait with it the whole time. Believe me, it gets there eventually. I guess I can leave it running for six minutes. All right. So at this point, we can navigate JavaScript. We can pull down files. We can chase links. We can do most of what it is we need to do. Once you've got the browser up and running, once you've used the marginally more complicated than basic text parsing code, the browser is your parser. And the browser can parse anything you can read with a web browser, because that's how you read all your web pages. Which browser matters in different ways than it does when you're doing this as a person. So you may have a personal preference for Firefox or Safari or Internet Explorer or whatever. Much of the complicated web content out there interacts with browsers in different ways. So things that change tag names don't change the tag names the same way in different browsers. So you end up having to stop it halfway through, use the developer console, inspect, see what they're called. I've come across pages that'll only scrape in Firefox or that'll only scrape in Chrome or that won't scrape in either. Many scrape in both. So you really care how complicated your page is. If you're looking at really basic stuff, none of this actually matters. And the second slide is probably enough. If you've got a really complicated page that's a real mess, it's likely to work at best in one browser or only be scrapable in one browser. People that don't put ID names and class names or put, you know, non-unique strings for things that are supposed to be unique, it gets to be really difficult. So, again, you really don't care much if your JavaScript's pretty simple. As stuff gets worse, things run down and you end up just having to test. As versions change, as the page gets rewritten, you end up having to retest, retest, retest, which is pretty terrible because no one tells you when the pages change. Okay. Now, as I said before, your only real choices are Chrome and Firefox for doing this. This is not because I care. It's because they're the only ones that have well-supported drivers between the two things. There are a whole bunch of other browsers out there that sort of you can hook up to Python. For a while, Phantom.js was okay. It sort of slowed down. There is an Opera driver. It doesn't really work. There's kind of a driver for these two, but not really and they don't work and, you know, that's sort of the end of that. It has nothing to do with how they appear. You get stuff that's not visible that's different. So, when the HTML changes of a way it chooses to cache JavaScript values, you can only tell by running it and seeing what came out the other side. None of this is documented in any way. Having spelunked through parts of the Chrome source before, that's also a complete waste of time because you're never going to figure it out because it's too big. Now, all the browsing I've done so far and this that's still running, there's a screen up here. We don't need a screen because the screen isn't actually helping us because the computer's doing everything. You can run these things headless. You can run Chrome and Firefox headless. Some of the others support variants of headless modes. This puts you in a strange position where you'll be debugging your code and you'll get this element isn't visible at this time and you can't see anything. And sometimes those error messages are not the same as the ones you get when it's not in headless mode because none of this is a commercial product. Debugged by printouts is quite literally the only thing you can do because you can't really attach a debugger to Chrome. I did try that once. There's no point. You're not going to manage to do that. So, I can show you how to do headless. First, this brings up a Firefox. So, it's exactly the same code we had before except we now call WebDriver.Firefox. That's it. I don't think it could get much easier. If we want to do headless, this, I promise you, is the same code except it adds the option dash headless. There's a different set of options you have to add to get Chrome to do this again. I mean, I'm happy to give people these things or I guess you can, maybe I'll put up an example for Chrome. So, if we run, this is example 7a. So, if I run examples, oh, there we go. File appeared. I think it set to sleep afterwards, so I'm going to kill it anyway. If I run example 7, it should bring up a Firefox. Nope, that was the headless, sorry. So, this, just to, there we go. There's a Firefox. One line change, half line change. Firefox, loading the same page, perfectly happy. I'm going to kill this because we don't need it. And I can run a headless Firefox and it's going to go through and pull out some text from that page. I think it's a date field again. It's doing fine. Nothing else different here. Same ridiculous date logic. Same fine element. It's just called a different, called a different thing up here. Alrighty. You can run these things in Docker containers. You can run these things on AWS and Google Cloud. Sort of, there are problems that you don't really expect because I guess Google's had some way of doing some kind of headless browsing for a long time, share that with people and they don't care about JavaScript, so it's probably not Chrome. Getting this to run in a Docker container on AWS took some weird adjusting of Docker default flags and things that you just shouldn't have to learn how to do because Firefox needs a lot of shared memory and weirdness that isn't explained anywhere because it's just not a common use case. But you can get it to work. The browsers themselves are very easy to install. You get your Docker image to run, Ubuntu, APT get whatever and everything's happy. The driver glue between the browser and Python is not. I guess I can give people links. You have to go download the Gecko driver for Firefox and manually unzip it and put it in the right place. Once you get all that working, it's all quite smooth. It's just, again, it's not commercial stuff, so it's produced by Google, the Chrome drivers of Google product, but they don't, you know, I guess they don't care that much. Okay, so we've been through a sampling of what we can do with this so we can come back to our original little matrix. If you've got HTML that's well formed, you can use the really basic stuff if you want. You don't have to, you can use a browser. Poorly formed, you end up having to go a step up and then we end up over here using a driver and some kind of browser for stuff that's more of a mess. Okay, so this is really important. So I wasted a lot of time trying to figure out how to work around problems in text because I didn't want to have to use a browser that's a little bit heavier. And of course there is no chance I can write a better piece of software than Chrome for handling mangled webpages, and I just needed to accept that. And then once you realize that if you use the library properly, it's super, super short, you feed it expats, everybody's happy, you should never have wasted your time doing that in the first place. The memory usage is not a big deal. Yes, it does require more. I don't think you can run it on the cheapest model of Raspberry Pi, although maybe you can. You can just change all the time. There's new versions of these drivers in Chrome all the time, and they do matter. Incompatibilities between the little glue driver and the browser, you go a couple of weeks without updating stuff, it'll fall apart. So similar to what I said before, if you can't use Chrome or Firefox, you can't do this. If you can't change the big pieces of software in your environment frequently, you can't do this. Sorry. So in practice, what are we really looking at? You can use the simple stuff. You have your HTML that's well-formed, which is really saying if you're looking at something like a REST API, if it's not JSON objects, it's just HTML, but it's generated by a computer, that's fine. Otherwise, really just learn how to use a browser driver and a browser and spend the time to learn enough XPath to get the job done and move on. The really short learning curve for some of these things doesn't get you very far up the mountain. So there we go. That's fine. API use. These are real comments from recent releases of the driver software. They're not meant to slag off the driver software. They're meant to let you know that this all works, but you need to stay on top of versions and checking things. That Chrome crashes when you navigate it to Gmail tells you that this is not the world's most mature piece of software. And that I believe the second one's from Firefox, they weren't logging the arguments correctly. That's fine. This stuff does not have heavy test coverage. If it works for your case, that's great. Web scraping, you're not going to do any damage, particularly if it's in a Docker container, if it crashes, who really cares as long as it gets the right answer out the other side, but it's only part way there. Because I'm sure there are still some people who are thinking you can do this yourself, you can. These things are much worse than you think they are until you've tried to do them. Redirects and errors associated with Google just at least use the requests library and not straight URL lib with no assistance or just opening sockets yourselves. You can do this with the HTTP and sockets server that are just built in parts of Python that isn't going to work. And most importantly, you're not going to make it any faster because your JavaScript interpreter is worse than Google's. I think that's a pretty non-controversial statement, actually. I believe at that point, yes, okay? So I hope that's a reasonable overview. I didn't get too far into any of the APIs because that's a long talk in and of itself to introduce a package and the biggest problem I had was figuring out which package I needed to use, not how to use the package once I figured out which one I needed. If that makes sense. Questions? Anybody? Please. Okay. So there is, for the LXML package, a parser version that's called soup parser that uses a gentle parser that's more error-tolerant. You can use that. You can't get back out from it what it thinks it was supposed to be. What you're really saying, what you really want to do, you want to do is you want to use a parser that's really saying, what you really want to do, I think, is load it up in Chrome and then walk the XML tree, walk the whatever object tree in Chrome, and then you could dump it out yourself if you wanted. You can see what decisions it's made, but I've looked at bits of this from time to time. Real web pages, once all the add-ins are rolled in, get to be so huge and the stuff that gets referenced changes so often, you're not going to I mean, if you're testing your own application fair enough, but be on that, you're not going to manage to make your way through it. It's just horrible. It's just horrible. Okay. Okay. Yeah, so, okay, there are other tools you can use. Scrappy does more with X-pads. I just think a lot of what it gives you, you should really just rely on Chrome to do that. So I've been through that exercise and feeding the X-pads into Chrome just does everything. Scrappy's treatment of JavaScript and whatnot is basically non-existent. So there are better tools for weird cases, but all of a sudden the page you're scraping is going to add some new JavaScript add-ins that the thing doesn't handle and then you're going to have to start over. So if you're really, if you're testing your own application have a good time because you know it's not going to use JavaScript. But nothing is going to support more than, you know, Chrome IE or Safari maybe, but there aren't really good drivers. So it's a great package for the slice that it works in. And again this is what I was struggling with at the beginning. But it just doesn't do everything. You mean to figure out how hard the page is? Okay, so here, let's let's do this. This may be a little bit hard to see at this size. I don't know if people can read the text. Maybe okay. So just to figure out where this download button is and this is a page I already know my way around. So, okay, we have to find some unique way of identifying this to click on it. In this case the name turns out to be unique. It's not too bad, but we have to select and search. Then we got to figure out how to click which buttons we need and here that's, you know, this is a slide. Similarly, you're going to have to select this in the menu and the rest of it. I already know my way around the page, remember. I'm not starting from scratch. This page isn't too bad. When you have to you get pages of results, you have to page one, page two, page three. It takes longer to figure stuff out. You can just keep calling get repeatedly and wait. There is in Selenium, and again this is a thing where Scrappy for example doesn't have this concept. You can tell it to wait on this line of code until such and such object is visible. So if I click on page two of three I can say click, all right now wait until the thing for page three of whatever is available and it will hang out until that's done. Move on. That stuff can take a really long time and then you have horrible web pages that pop up in the middle a dot dot dot processing dot dot dot and then you have to wait for that to appear and in case it was too slow catch the error when it didn't and then wait for it to disappear and then wait three seconds anyway and then search for something else and then move on. And that can take three days to figure out. A different question. At what point do you spare pay? I give up. I'm just going to close the job with mechanical turf and let somebody else do it. I mean, after doing a fair number of these pages you get better at it. So that problem with the dot dot dot craziness I can now do fairly quickly. The Peruvian government web page that has that problem was frustrating for a while but you see the same problems over and over. If you only need to do one do you enjoy it? Yeah, you should probably do that. But I don't know the weirdness associated with some of this I'm not sure how much luck you're going to have finding somebody who can do the really weird pages. I mean, they're clearly unlucky. Oh, well that's the scrape it wants but if you want to scrape it every day you want to scrape 20,000 versions of the page I mean, you can't really do this for example well, I'm not sure how to say this since it's been recorded. If you wanted to go to a large e-commerce website where they have prices from there they may have anti-scraping defenses and that may require work or not or whatever and this is true for all kinds of websites. You want to do that in bulk you can't pay people to do that. I mean, I guess you could maybe but you'd have a huge amount of staff and then when they change the web page you'd just be screwed. So what I do for my work actually is mostly financial business so they're okay with it and the formats don't change that often So the web page that is in the jamax will all please see a little bit where people kind of like when you scroll down memory and then they start moving on to new pages Okay How do you do those things? It depends a little bit how that works so you can also scroll the portal so there are commands for resizing the window scroll move until you can tell it to scroll down until something's visible as much as you can wait for something to be visible now whether it scrolls to a certain point and then it covers your thing so you can't click on it you then might have to click on the X on the little pop-up window it depends exactly how the page was done Yeah, you can tell the browser to scroll down Now some versions of anti-scraping defenses might I've seen this as a person not actually trying to scrape pages that's honest, that's not for the camera Yeah, you see stuff that sort of half follows your mouse around and actually I remember the City Bank UK logon years ago used to follow your mouse so you could click the pin numbers I mean there might be something somewhere that properly defeats you I guess the question is how much time you're willing to spend trying to defeat that page I think another way is to also sum up these videos when they go into this infinite mode you can see new requests going through the network and then you can figure out the logic of how it's going to do that programming Sometimes you can, I've been surprised by pages where stuff's going back and forth and there are big blocks that look like it's not a cookie but some and you realize after a while they're not changing so it's encoded and if you just save that in your post request it'll work fine but then sometimes it doesn't and you have to chase it and you Yeah But if you can see it on the screen somehow through some scheme of clicking and scrolling you can probably get either the Google either the Chrome or the Firefox driver to get the thing out for you without having to do that sort of stuff If you're willing to do that you may not need it but then you can't run the javascript without it so it's an edge cases maybe Please So you can't really go on the comments between strippy and medium if you're stripping tens of thousands of images So all of the time is being spent on the network and in the Chrome javascript interpreter it's a straight HTML and you really care about performance you can narrow some of this stuff off but then you're saying you have to rewrite your entire system if that page changes If you're going through Chrome none of the time is spent in your five lines of python code or fifty lines of python code it doesn't matter I saw some names which were regenerated like empty and unregistered so these names are like unregistered and there's a regular expression xpath has all kinds of complicated stuff you can do contains it's not exactly regular expressions but yes yes it's a pretty complicated language you can get big books on xpath and xslt and xml parsing and searching I'm not an expert in that I can use it but it's not xml is a huge beast of a thing sorry What is my option if I want to run like this headless scraping in a server where I don't have a GUI so I cannot install Chrome so you can run I do run headless Chrome and Firefox in Amazon rented computers off in the middle of nowhere that's fine you can fire them up in a docker container and they're perfectly happy headless you need to get it working headless in the same exact container on your test computer because the error messages are not the same for headless and non-headless I guess they'll get better in a year maybe they'll be the same but it works fine headless is no faster or it doesn't use any less memory I think they just suppress the writing of the window it's a very mild so my option is only to use Chrome and Firefox you can use Chrome and Firefox with Selenium you can do that the thing is these browsers are stateful in ways attached to your computer they try not to mess around with your bookmarks and cookies and the rest of it but it leaves bits around and you really don't want to run it on a computer that's doing anything else the web cache is sort of disabled but it's not 100% disabled I've seen weird entries in my browsing history that I don't mean weird entries embarrassing I mean weird like there's no chance I ever entered that ridiculous computer generated URL so you really want to container that up I mean it depends on what page so some stuff we process data from I don't think it's changed since the 90s and that's not a bad thing necessarily the US government census web page is all based on forms has a tremendous amount of information and I don't think the data parts meaningfully changed in decades other stuff seems to change all the time and people use features that they the more you're in the commerce space probably the more likely you're going to see stuff changing and the more you're in some kind of old line business it'll be very different shipping manifests are probably not changing very often I don't actually know yeah I mean I'll be honest the way I do it is when error messages start to appear I go check out what happened but that depends on you know people and how important is the data be there correctly I mean maybe the mechanical torque idea is not so terrible for checking the pages oh sorry ah ok so this is tough as long as you use a browser driver and you got all your python utf stuff correct and you're not calling in code and decode when you're not supposed to be and we're not trying to write anything out to files or anything to the screen or anything like that it's all fine in the browser version the other stuff is not quite so fine because I mean the browsers have to handle utf everything correctly you also start to have some edge case issues with some of the packages where some of the Korean stuff is in utf 16 and I can tell you lots of stuff don't do utf 16 correctly ok ok thank you