 All right. Hello, everybody. So as you can see, if I faint, call 995. But if I'm just like panting, don't worry. It's just a baby. So I'm going to talk about data scraping as well. But so this is more like a story. It's a very specific case. How many of you have used Google Trends before? Right. So it's a very useful service where you can see certain keywords was how popular something has become or unpopular something has become over time. And then this has been used very frequently by academia in writing papers and stuff. So this project that I needed to use this with was for a professor, an economics professor who wants to do some trend research. And then initially I suggested Google Trends because Google has all the data. And then also it provides this handy download button where it will give you a CSV of the data range that you wanted. Easy peasy, right? The problem is, oh, yeah. And then better still, they have this NPM module where you can just, well, it's not really an official API. But it wraps it for you. And then it does handle the throttle and stuff. So it's kind of nice. However, the problem is the keyword that the professor wanted is some Chinese keyword. And then if you read Chinese, you're Chinese Chinese. You know what way one is. It's sort of like an official sort of China speak of maintaining stability, but specifically social stability. So it's a very sort of politically charged word. And then as you know, Google is banned in China. So unless you can get over the great firewall of China, you can't really use Google. So the problem is, when you set the region to China way, when you can see the, I mean, you can't really see the absolute value here, but the volume is very small. And then the thing is the professor also needed prefecture level data. So meaning not just province, like at city level. But here, as you can see, even drilling down to a province, Xinjiang, you already just don't have any data. So we can't use Google. Let's turn to Baidu. Like naturally, Baidu has something similar. It's called Baidu Index. And then nice enough, Baidu Index has prefecture level data. And here I'm showing Xinjiang data where Google just doesn't have any. And then the trend, sort of, you know, you expect it to go up. And then after certain events, we look at the data. It looks like reasonably quality data. So we decided to use Baidu trend. Well, yeah, that's another concern. It's a separate concern. And that's why the professor actually checked major events in certain regions. And they do see a spike in those keywords. But then to what degree is that spike? It may be a spike 100 times. But in reality, it's a spike 100 times. But Baidu only manipulated it to spike 10 times. So that's not something that we know. But it shows the trend. That's good enough. So naturally, you go and look for a library. I actually only discovered this library today. But originally, I was doing the scraping in Python, like anybody would do. And then when you look at the library, first thing you notice, it depends on Tesseract. I was like, what the hell? What do you need Tesseract for? You're just scraping some data. Tesseract is a good question. Anyone knows? Yeah, correct. It's an OCR library. So it kind of like give it a picture, and it has some text on it. And it will recognize the text, extract the text from the picture. So yeah, so what the hell, right? So it turns out how Baidu index used to work. When I was working on this project, it's that so you can see this is a trend line. And then you hover over like a data point. It issues a network call and returns the image. And that whole black box is an image by itself. And so that's what it does. And then so the OCR is for recognizing the numbers in that image. So this is where you flip a little table. You're like, all right, what the fuck Baidu, right? And then so there's an equivalent Python library that does similar things. Use Tesseract, recognize what it does here. I'm not going to really let you see what you see with the code, but just to highlight. What it does is use the, this is the Selenium code. So you sort of ask Selenium to move the mouse by a little bit. So then they sort of move to the next time point. And then it triggers the hover. And then it triggers the network call. And then look, there's like a time dot sleep. The sleep is there to throttle the frequency so that you don't get your IP band. And then blah, blah, blah. Then you just sort of like find out where that image is. Like find out where that image is. You need to figure out how long that number is so that you can crop up the image, enlarge it, OCR it, and then finally get your number. Are you going to give the talk or am I going to give the talk? So that's exactly sort of the path that I end up going down for. Because you have the graph. So you're sort of like, you want to do something with the graph. But that's not the exact number, right? So after looking through the code, because it's not really a library form, and also that mouse movement is not precise, I have to modify, go into the actual script and then look at it, modify it. And then, yeah, so it's just like die a little inside. And then after all that work, you still have problems where it's data completeness, because the mouse movement is not precise. I think what Baidu's doing is that it jiggles the data points a bit. So if you every time move the same amount, you not necessarily end up hovering over the right point. So I end up getting like about 70% data completeness. And the OCR is also not that accurate. You would think it's easy, right? Because it's just so standard, but you still have 1 versus 7, 4 versus 9, all these misregonized numbers. And the thing is, it's very hard to catch. How do you know if something is off? And then especially when the digits at thousands or tens of thousands, then you're off by a lot. Yeah. Yeah, and the image I suppose they're generated at runtime. So it's very slow, actually. So yeah, that slowness is another problem. So the alternative solution is what Malvin already sort of foreseen. So here, because that graph is SVG. So I grabbed the SVG, save it. And then I still have to do OCR, because I need to know the numeric number for the mean and max. And then because SVG will give you the physical distance and the range. So then you estimate by scaling. So you end up getting an estimated number instead of an exact number. So here is a code. And now we're switching to JavaScript. It's very short, about 20 plus lines of code. The idea is that the mean and max is the actual mean and max that you pass in, that's OCR, the mean and max. And then you just sort of do the calculation offset. And then I just did all this math. It's pretty straightforward math. And then in the end, you get the scaled, you sort of get a scaled and estimated y. Well, we can come back to this if you want. Inverse of d3.scale. I don't know what you're talking. Right. You sort of like you're trying to get the numbers from the graph instead of the other way around. So the results is that I got 100% data completeness because I have full control over the data. And then the estimated number, so this is based on the check of a single prefecture over the full-time range where we are about 1.5% off. And that's totally acceptable for academic research purposes. And then on the plus side, the script runs much faster because I don't have to make a however met number of network calls to get images of the numbers. So yeah, and I got less by do as well. Lessons learned. Yeah, that was my original title, actually, language. Language doesn't matter, getting things done does. Tim's like, that's propaganda shit against JavaScript, I said, as just payback time for him giving all those talks in JavaScript talks in Ruby conferences. So yeah, estimate because you can't really, you know, the world's not perfect. Getting the exact answer can be very tough or it's just not worth it. When you only have 70% of the data and it's not necessarily accurate anyway, you might be better off just doing the estimation. Yeah, and then web scraping, that's like a general rule of thumb that after doing so much scraping is that I find it's always good to just save the HTML because you will always find some use for it. Because you're like, oh, you've extracted all this data and your professor's like, I want more data. And then you're like, do I have to go and visit those websites again? No, save the HTML. That's another thing I don't know if the puppeteer can do. Oh yeah, OK. I don't know, I mean the original page source in case you need to extract it again. Yeah, OK. Yeah, so Selenium gives you the same thing. So yeah, so my question for you is, do you have any handy web scraping tricks you want to share? Or if you have any question for me? Actually, that's a very, I think a lot of people try to scrape Baidu data. So if you can see this page, here that's like a screenshot of something on Taobao. Like this is how much people are charging for scraping a keyword of Baidu. So it's like a totally a business app there. Because yeah, people want Baidu's data and Baidu doesn't want people to get their data. It's another thing is that today when I checked, Baidu actually updated their whole index thing and then they no longer do the image thing. But it also just means my script doesn't work anymore. Now it's just like they use Canvas instead of SVG, but all the data is already shipped to the page. So you can just grab the page and parse it. Now you're done. Who pays someone to scrape for you? All of you guys do it yourself? All right. OK, cool. I'll say what point do you just get someone to pay? Someone just do it for you? I have a bit of an unrelated question to everybody. Did any of you ever use Mechanical Turk for any data getting stuff? No. It's banned. Yeah, it's unavailable in many countries, yeah. Maybe too much of our jobs are very like manual. So they're trying to protect everybody's jobs. As since you asked for a trick, and the one I found quite useful is to listen for DOM node inserted, because then you know when the page is changed, when something new is entered the page. Yeah, that's the thing. As if I do, it's fine. I can deal with it. But the professors subsequently wanted me to scrape some data of Google scholars. I'm like, I don't want to sit there and solve like CAPTCHAs and especially help Google with their machine learning by identifying cars and fire hydrants. So yeah, I was like asking my friends in Google, be like, hey, can you get us this Google scholar data and so on and so forth. So Google, I'm like, by this fine, Google's like, uh, scary. I want their data. I want their server to respond. I don't want to smash their servers. Not my objective. Sure. Oh yeah, of course. If there's API, I always want to use the API. Yeah, yeah, absolutely. Yeah, Melvin. That is kind of legit. So at the company I work at, we have with a lot of banks. As in we maintain bank accounts with many banks and we have to know what's happening. Like we send money or we receive money and we have to check the statements and make sure that the money is being sent. So some banks have APIs and that's great. But some banks do not. These are banks that we've actually talked to in real life and we basically have a business to do with them. Like they are okay business doing what we do. And we need to get data from them like legitimately. It's us that they don't expose an API. So we have quite a lot of scraper integrations for the less technologically advanced banks. And it's full of tricky things as well. So with banks, there's this new challenge, which is the OTP token. Because you have it at log in and then you have to figure out how to make that process and different banks have different strategies. So yeah, not all scraping is legit. Sometimes it's really just because they don't have an API and they're not gonna build one unless you pay them money and all you can do is scrape. I think that's how Mint started, mint.com. Cool. All answers. Just sharing some tips. I thought about, I used before, instead of gripping parsing the HTTP, I will scroll down the page a little bit trying to figure out whether the website itself does HTTP request calls. The case is you just, you send it's the same HTTP request call and get back the structured response. So this is something I will do in my beginning stage of analyzing the scraping target. Yeah, just adding on what you said, but that's true for sometimes looking for data instead of going to the scrap solution that is long and painful most of the time. Sometimes I have two examples where the website has like unprotected APIs, sort of hidden, but you can pretty much find them quickly and you just call them and no security, nothing, you get everything very quickly. But it's not for, I guess it's not for banking data. It's more, I used it in some sort of font library stuff. And the other one was like a material database. So I guess they don't really mind if people steal everything they have. Like, yeah, they give it away anyway, but just one by one. All right, speaking of that, so both Baidu and Google, they let you do the data requests, like without anything and then they give you back results. But probably, I think Google, because I most recently worked with Google, Google give you about 150 if you're not logged in before they ban you. So you have to like do an IP rotation. You have to use a different IP. And then if you're logged in, I think they give you like about 500 before they ban you. And then interesting thing about Google is that they don't even rate limit anymore. They just give you the image to solve. So when I do web scripting with Google, I don't even rate limit. I like Google do it. And then I mean like they sort of slow down the response time over time. So yeah.