 Hi, okay. You can hear me, right? Okay. Yeah, my name is Ethan I'm mostly a Python n programmer and I have about a week of experience with puppeteer So obviously I'm the most qualified person to give this talk Let me get this going Who has heard of puppy who knows what puppeteer is Excellent, okay, so most people don't let me exit Who's heard of Selenium? Oh, more people. Okay. Okay. Yeah, okay. So let me get started general introduction to puppeteer puppeteer is Is Google's web driver? A web driver is basically how you interact with a browser Programmatically so, you know, you go to websites and you click around But you don't want to click around you want your program to click around for you use a web driver and You don't have to look at the code the code There's nothing going on there. Just listen to me first Yeah, so basically as I mentioned, I'm mostly a Python programmer and in Python land That's this thing called Selenium, which is I think one of the most popular web drivers around and it's been I think the most dominant web driver for the last couple years Yeah, and I've used it before it is very troublesome to set up Um, you have to install lots of things and you have to link lots of things and there's lots of Things that always break when you try using it and so I had a new project where I needed to do some web scripting and I heard of this thing called puppeteer, which is like supposed to be magical and people are raving about it online So I wanted to try it out Let me give you a general introduction to what it does right, so, um Yeah, so, um, why you'd want to use puppeteer before that So selenium is kind of like a cross browser solution If you need to test for example, if you're doing unit testing in the browser and you need to test across across firefox And as well as chrome and all the other browsers then selenium is the right Um library for you. It's right solution for you. Uh puppeteer is google's library. It only works as chrome or chromium either one Um, it can do most of the things that selenium can do Um, it's actually quite powerful. I'll cover some of the things it can do in a bit. Um But yeah, um, it's not quite the right solution for you if you're looking for a cross browser solution All right, um Yeah, and the reason mostly I went with puppeteer is it's really simple to set up. It's literally just npm install puppeteer It comes with its own bundled chromium So you don't have to like have chrome installed or you don't you don't need to link any Libraries like every when you just do npm. So it comes with a little bundled chromium chromium is The open source version of chrome Yep And yeah, okay And let me show you what I've been doing. So basically for a project at work. I need I'm exploring some commodities data. So I need to scrape uh commodities news sites So I'm scraping like fairly large volumes of uh data and I've been trying to use puppeteer um It works fairly well. Let me go through bits of code. Um I can give me a second. Okay. So before going Yeah, so my my rough project structure is basically the idea was um for every site I need to scrape I'll write this little site config thing which is Um, which is basically what changes between the different sites Uh, but the general code um Is is actually really simple. Um Let me go through that. Um Yeah, so for example, um, basically what I'm doing is I'm going through I'm going to All the different sites. I am getting the the title who the author is when it was posted what the text of um This is uh, what the what the main body of this article is and I'm saving it a database Like 90% of my code is saving things the databases or Like there's only like a couple of lines of puppeteer honestly, uh, and let me show you, uh that bit. Um, so I start off with Um Yeah, so launch puppeteer launch basically launches a chrome a chrome browser. You can launch it. Uh, if you notice, I have this headless falls or true um, so headless So you can launch it in either headless So headless mode mode means there's no GUI. There's no actual browser that pops up. Um It just runs in the background. You never you don't actually see it But everything else should behave as per normal. Uh, if you run it in non headless mode, uh, your actual chrome browser will pop up And it will you can see the things it's doing if you're just doing like, um Testing or you're just writing a script and you're trying to figure out what you want it to do It's useful to start with headless first and then like once you are actually running your stuff. You can uh, move to headless. Yeah, so The code is literally, uh, launch puppeteer launch puppeteer gives you a browser instance. Um You wait for a new page You go to a new page. Um, for me, I'm looping over a whole bunch of pages Uh, you do new page and uh, let me show you the actual getting stuff out from page so Just a little bit chrome to evaluate. Um This bit, uh, took me a while to get took me a while to understand. Uh, so there's two kind of Things going on. There is the Uh, process that's running puppeteer and separately there is a sandbox. Um, thing that's running inside your chrome um Browser so everything inside evaluate actually runs in it's the equivalent of opening up your console and typing some code there So you can do things like document query selector or whatever standard Um Like javascript stuff, you know, how to interact with a website you can put inside your evaluate block and basically that's what is running in the context of the Um Of the browser so you don't actually need to learn anything new if you if you know how to Select a title like if you know how to open the console and select the title Uh, then you can use puppeteer because it's the same thing. It's literally the running just, um, all this chrome Block is doing is it's running, uh, that bit of code in the context of the browser and then returning a response. Um One issue I had was You can actually pass in, um Something from outside into Into the context of the browser, but it serializes that object So like if you try to pass in something with like functions or functions Just kind of disappear. It doesn't see realize properly. So that's an issue. Um, that's So, um, was that clear? Do you do you get what I mean? Like so, um, if you notice evaluate I'm passing in this selector Uh, um, selector is something I'm trying to pass in from outside into the context of the browser But if that object that I'm passing in has any functions, it doesn't get serialized It just disappears and you get like undefined inside. Yeah um Yeah, and so It basically in summary if you're trying to do like Some kind of browser automation, but it is really easy to use the I didn't find the documentation that easy to follow I think they need to do a bit of work on Um, the documentation still um other than that it's been quite good I still can't get around captures, but I can't get around captures with anything. So that doesn't help um The other issues I've faced are Headless and non-headless should give you the same results all the time But I sometimes get different results and I don't know why so if anyone knows why that would be great. Um and Yeah, so I'm only using it for data scraping but you can use it like we can use it like Like mocha.js for like unit testing or whatever There's other things things you can use and you can use it for things like filling up forms Like you it gives you all the normal things you can do in browser Like you can type you can select uh input and then you can like page dot type Or you can page dot click or you can um like focus events or you can Listen to events like you can wait for something to happen and you can act on it. So it's a fairly full feature. Um Browser automation tool it also lets you do things like take screenshots and create pdfs out of entire pages So like if you want to do like Um testing like you've made a change. It's the output of my website the same as before that kind of thing You can do that as well Yeah, any questions What's up, sorry, um the question was do you get rate limited or blocked? Um, I haven't as of yet um The the sites that use um like captures and stuff I just get stuck right at the beginning, but the sites that don't implement it usually like if defines you just like I'm not I'm probably doing it most like a request a second. So it's not like huge volume What's up wait? I tried that but um, generally if I hit a capture and I try to manually fill it up I still can't get through I just get stuck in the capital for forever for some reason. Um, also I have a lot of colleagues who are Also scraping at the same time. So I think it just detects that my entire network has a lot of shit going on And like I'm getting like generally like even if I'm like just in a browser and going somewhere Sometimes it gives me captures without when I normally wouldn't Be getting those yes to So I should instead of like opening a brand new chromium browser. I should like use an existing chrome browser Okay. Yeah, I'll try that. Yeah Yeah I have no idea like no one said anything. So I haven't I haven't anything um, but I mean There was this whole court case about like if if data is publicly available It should be fine. Just great. Right. Like there was this thing that happened with uh, and then I'm not sure. Yeah Well out of my area expertise. Yeah A what? Oh, no, I have not tried that. Um That will no no anyone have any experience with But I I there's nothing much to say about We're using j s mostly pure j s as few libraries as we can We use a service who provides us I don't remember because we we tested a lot of different ones. We Oh, no, wait, it's not no Sorry on that on that topic like while I was looking this up. I found this api which, um It's for solving captures. It's literally real people Solving captures for you or like we use that. Yeah, like on like in real time like they have like I'm not sure how many people they have but they give you like they can solve your capture within a certain amount of time and they have like a page with photos of all their workers and like And like and like how these people are from that world countries that make about 200 usd A week which is like more or month which is more than Like they would usually make something like that but I was like, okay, that's dodgy, but I'm not sure but okay So use that if you if you need a capture solving service I'll find it for you. I'm not sure So since we've gone into captures, but what happens when you get those like I'm not a robot prompts from google like how would that service solve it so I think you That's a good question like it hmm It might be that you need some kind of virtual Like chrome browser that you both share and they interact with it part of the time like instead of having a browser on your own machine, could you open it on like a Like a remote desktop somewhere something like that. I'm not really sure how that'll work. That's a good question I don't know like my impression of the I'm not a robot captures I may be wrong Is that maybe they look at your mouse movement and kind of see that your mouse movement is pretty slow and Inaccurate and they decide you're a human so I Briefly look this up and it seems to be a lot more than that like especially for things like chrome like they do even seem to look at your Like browsing history and what you've been doing before like if you spend the last Three hours browsing memes and like this is a person just let him through but if yeah, that kind of thing also seems to be the case It's not just mouse movement. There seems to be a lot of things I've encountered this type of capture mostly on chinese websites where it's a slider Is are those captures? Automatically solvable sorry so Sometimes it's like an image But they do it as a jigsaw puzzle piece and then you have to slide the piece from left to the the empty slot And that's like so there's a simpler version is just dragging the slider from left to right completely That's the simpler version. I've seen the jigsaw puzzle version But it only comes out on chinese websites. I think lazata uses that as well though lazata. Singapore. I've seen in lazata. Singapore Like I get it quite often that of that type of capture. Is it Can you can we solve it automatically or it really is requires human intervention anybody? I have no experience with that. I've only been using google's recapture Yeah, so it is like the jigsaw puzzle style where the distance is not fixed then how But you have to stop at the spot wall ah So it can you imagine it's like an image then there's a hole in the middle But the hole is not by the same place all the time there to drag the piece until you fit the hole So basically what you must do is you basically build another human that has the ai to go and do all these Things Any other questions? I had some questions I had four questions What what were they Not there remember So like I saw in the code you have chrome dot evaluate and then you pass in a function with a Select the argument, but I didn't understand how Oh, okay. Okay. Okay. So you pass in the argument at the end. Okay Over regular um, selenium It's really easy to set up like like I selenium is really troublesome. Okay. I haven't used it I haven't used it in about three years, but the last time I used selenium like it was like it was it I never it was it wasn't as easy as just installing it Then you had to install the the driver and then you had to link them and then I'm not sure whether It maybe it's gotten a lot easier now. I haven't touched it in Yeah, like I actually most recently set up selenium on my new computer And then especially if you only need chrome You just need to brew install chrome driver And then if you use it with python, it just hooks up like like that. That's it I then just need to pip install whatever is necessary selenium. Yeah, so it's gotten better. I assume it's been a while Yeah, so I I went down this route because I mean I had just been hearing a lot about How good it is and I wanted to try it out and it's it's okay. It's pretty good. Yeah cool