 Okay, hello everyone. So I'm really glad today to be in this new room, which is all about open science and tool and technologies. And so I'm here to speak about empowering social scientists with web mining tool. So we will see together what is web mining and how we can teach researchers how to do so and what tool we developed to help them achieve amazing tasks. So, hello everyone. I am Guillaume Pliques, aka YoungGather Real on the Internet. It's a youth mistake. And so I am a research engineer for a research laboratory in France, which is called SciencePo Medialab, but we will talk about that a bit more later. So what is web mining? So who here knows about web mining? Okay, that's nice. So then I will skip. So what is web mining? Just a reminder for everyone. So I will only talk about web mining as a tool to be able to collect data from the web and afterwards how we are going to analyze this data and produce insights from this data. So basically on a technical point of view, web mining is actually two or three things. The first thing being scraping. So what is scraping? Scraping is the act of retro-engineering the HTML of a web page to be able to extract back the data that produces the HTML page. So for instance, here you have an example, which is a page from the EchoJS website, which is actually a hacker news for JavaScript basically. And so scraping would be to open your inspector, check how the HTML has been written and display this visual page and try to extract from the HTML the data that we are interested in. So for instance, here it would be the title of the articles shared and the link to the articles shared and so on. So this is the first thing, scraping. So extracting data from web pages using retro-engineering and so on. So scraping. The second thing web mining is actually crawling. So crawling is a bit different. Here we are going to design a bot or a spider or a program which is going to browse the web automatically and that will like slowly compose a network of pages, of sites, etc. And we are interested in two things. What is the actual contents on those pages and what is the network that is drawn by like this whole navigation on the web. So scraping, crawling and the third thing is actually like collecting data from APIs. So nowadays, for instance, Facebook or Twitter or LinkedIn share, share some data with you and so we can use and leverage their APIs to be able to collect data and then gain some insights. So this is it. Web mining here for the purpose of the talk will be scraping, crawling, APIs. So the question is why is this useful to social sciences? I'm putting social into brackets because basically it could be useful to any science, I guess. So physics or chemistry or so on. But since I'm working for social scientists, I will speak from the point of view of social sciences. So why is this useful to collect data on the web? So the bad take on this is actually, okay, every social sciences data collection is biased. If you do like for instance questionnaires or if you do interviews, you have biased data which is mostly due to what we call the observers paradox. So when you ask people something, they will like be biased because you are asking them the thing and you are in the room observing them and so on. The thing which is really interesting with the internet is that people express themselves without being asked to. So they are like just going to express their opinion but nobody is observing. I know lol because I am observing right now and so it's less biased. So web mining would be a superior source of data for social sciences because it's not biased. So this is the bad take. The good take on this is that internet data come with its own biases. For instance if you collect data on Google Trends, of course you have like other biases that you will find and you should be aware of those biases. And so to be able to control and manage these new biases, you have to apply meta studies and science and technology studies which is a large field of social sciences which study those issues. And so the conclusion, the good take on this is that web mining is still another very, very interesting and very large data source. So why not collect it? We should collect it because it's just a good thing and we can. So yeah, the issue here is that web mining is hard. To be able to perform web mining tasks, to be able to scrape, to crawl, you need to know the web. And when I say the web, I say the whole web. So you need to know how DNS works, HTTP works, HTML works, CSS works, JS, the DOM, Ajax, SSR, CSR, XPath, and so on. You've got a lot of things to know and learn about the web to be able to like rich-rue engineer it. So how do you teach researchers and for instance social scientists, those web technologies? So basically the same as everyone else. So you could like teach them CSS and HTML and so on and try to like empower them through this teaching. But what most consider as an easy layer of technologies, I don't know here, but there is like a misconception in technologies that says that the web is actually really easy. It's really not. And we are really standing on the shoulders of giants. That someone here has already tried to teach someone who is like new to the web technologies, how the web actually works. Did someone do this job? Okay. Usually when you do that, you remember that you notice that you are standing on a huge mountain of skills, which is actually really a down thing. So it's not really easy to teach people about web technologies. So another question here is how to teach researchers how to scrape, for instance. So they know about web technologies. They know a bit about JavaScript and Python. So how can we empower them and teach them how to scrape? And then you also have other issues, which are a bit different, which has, for instance, you are fighting the platforms and their APIs. Platforms will try to like prevent you from applying scraping and trolling. You've got some legal issues in some countries. In some countries, for example, in Denmark, like teachers avoid teaching people scraping because it's considered like lock picking, for instance. It's considered a bit illegal or gray. And you have to wiggle when you publish something using scraping because sometimes you have to say, oh no, I did not scrape. I had a monkey army clicking on the button really fast. So you have a lot of hoops to jump through. And what's more, and this is something I really want to stress today, is that Jupyterizing researchers is not a solution. Sometimes we say, okay, we are going to empower researchers. We are going to teach them everything they need to know. They are going to learn Python, Jupyter, web technologies, and they are going to scrape by Zenfels. This is a really good solution, but it's not really applicable to the real world. So for instance, and what's more in social sciences, some researchers don't have the time nor the will to learn all those skills. And we should be okay as a community. We should be okay with that. It's okay. Researchers don't have to learn the skills. And the question then is how are we going to empower them all the same? And what's more, and this is the second point again against the Jupyterization of researchers, is that web mining is actually really, really, really hard. It's really a craftsmanship. Basically, web mining is a job. It's not a skill. So internet, for instance, is a dirty, dirty, dirty place. So you've got conventions, basically. So you are supposed to code a website like correctly, cleanly, but basically everything is really badly implemented. And so browsers today are really like heuristical wonders. They have a lot of routines and programs to make sure that the web page that you sent, which is really messy, will be read by the browser correctly. So you have to know all of those things when you want to do web mining. What's more, you need to know about things which are considered advanced in informatics, which is, for instance, how to multi-thread a program, how to parallelize things, how to throttle your HTTP request. And if you don't know how to do that, you will harm actual people. For instance, at the beginning of our journey, we did not know how to throttle HTTP requests. So we basically cut all our university's access to Google, which is a bit problematic, not too much. And you need to know all about those kind of stuff which are really complicated if you want to be able to actually perform web mining. You need to know a lot of skills. So what I mean here is that it really is a craftsmanship. It really is a job. And you can't expect people to be researchers and web miners. So the question then is how are we going to empower researchers all the same? And the answer here is by designing tools suited to their research question. So we need to have designers. Who is a designer in this room? Him. Yeah. So we need more designers. And so how did we do that? So I work for a laboratory which is called Sciences Po Media Lab. And the seminal idea of the lab was to gather three kinds of people. So social science researchers, designers, such as this guy, and engineers, such as me. And so we are going to mix those people. And we are going to design tools which are suited to the researchers' questions and work. This is basically it. So what I propose here is to guide you through some of the tools we designed to be able to empower, really empower social scientists to perform web mining tasks. And so the first one we did was called R2.js. So it's a bad pun on R2D2. And the idea here was beginning from the following thing. If you know about modern web technologies, you will vastly encounter something which is called dynamic rendering, which means that the page is not rendered on the server. It's rendered on the client using JavaScript and so on. It's really complicated. And if you want to emulate this to be able to scrape, it's kind of difficult. So the idea was to actually parasite the web browser to perform some web mining tasks. But I know it's a bit abstract, so I'm going to try a small demo time. So everything will break now. This is how it works. So for instance, let's say you have a researcher who wants to scrape this web page, get the whole list as a CSV table. So you are going to go to the page and then you are going to inject some parasite code to help you like scrape the data and provide the researcher with the data. So I use a bookmarklet, which is called R2, which is loaded directly into the web page context. And R2 is going to help me to do some stuff. First, it can do some sound, which is its most interesting feature. Then I will be able to use something like old school jQuery. And using CSS stuff, really basic stuff, I will be able to scrape data. So here I'm just attempting to scrape the data from the website, but directly within the web page's JavaScript context. And when I have that, I'm also able to help the researchers by doing this thing. Yes, it doesn't work because I didn't know. Sorry, it's a bad life-coding situation. Yeah, and so now I have like the data that does not work. So basically I've scraped the thing as a CSV file, and I'm now able to provide it to the researchers. The main point here is that it's still code, but the fact is you can use this same code to generate bookmarklets, custom bookmarklets for the researchers. So it means that I will go to this kind of interface, I will paste my code here, and then I will create something which is actually a bookmark for the researchers. You just have to copy it on its web browser, and then it will only go to this page, click on the button, and it will download the CSV for him. And so for this kind of scenario, we have researchers that like do some really qualitative search on the website and just want to like pick some list and aggregate them. And so we use this tool, R2.js, to provide them with this kind of like Hadock and tailor the bookmarklets. So this is the first thing, R2.js. The question here is that can we... which is a bit more hefty. So we created something which is called now Minet. And what is the goal of Minet? So the goal of Minet is actually to provide you with some common line tools which is going to like handle all the pesky details of web mining for you. So basically all the things which are in bold are the thing on which you are going to focus and work, actually work. All the thing which is around is the thing I handle for you so you don't have to. So multi-threading, multi-processing and so on, I will do that for you. You just focus on the fact which is I want to get, for instance, one million pages from the web and then extract the actual content and data for it. Do we have time for a demo? No, maybe so. So basically Minet is a CLI tool. It's a UNIX tool which like complied to the UNIX philosophy. So it does one thing, only one thing well and this thing is web mining which is a big hefty thing but it does it well. And you can like pipe comments and so on and it looks like that, for instance. So basically you are going to use the common line tool to be able to for instance fetch a lot of pages from the web then scrape massively and parallelize the session from a lot of web pages. You could also extract raw content from articles so you can do like NLP stuff on it afterwards and so on. So it's a Swiss Army knife for web mining which is scalable and which is lo-fi. It doesn't run on any database. It works on CSV file. It pipes to the STD out and on the pipes and it's really like, yeah, lo-fi. And what's more, the thing which is really important for us and which is like true for R2 and Minet is that it's real localized data collection on the researchers' computers themselves. Sometimes you need a servers but sometimes you don't and it's important for the researchers to be able to like control his or her data by being able to do this stuff on his or her computer so they are like really in control. And basically in social sciences we rarely do what we call like big data, TEM. We don't do this stuff. Everything stands and is able to fit in a single computer. And what's more, if you really want to do some Jupyter stuff and that's all right you have a programmatic API which hides all the complexity for you. So if you want to do something like, okay, I want to fetch 1 million pages from the internet and I want to do it right, you can just do a simple for loop and you will have all this stuff handled for you. And so then we have seen like how to enable the researcher to scrape to collect data from APIs. And so the question is what's the next step? Can we design something which is a bit more ambitious, something which is more like a GUI and we actually can. So for instance in the lab we are developing a tool which is called Hive which is a web crawler which has a dedicated interface which enables researchers to like crawl the web, crawl the subset of the web and be able to make sense of it without having any kind of technical and technical knowledge. So how do we enable researchers to crawl the web using this tool? So Hive, it looks like that for instance. So you have an interface, everything is like you push button and you input things using the keyboard. You don't have to know how to code, you don't have to know how to program and you are still able to like crawl millions of web pages and be able to like construct, build a corpus which is a subset of the web on which you are actually working on. And so finally we use this design, we use designers, actual designers to be able to serve a robust methodology which has been proven to work on many like sociological works and we designed the tool to be able to represent this kind of methodology. And what's more, I just want to emphasize last time that it's really non-trivial. Some people have already tried to like build a crawler and crawl the web using a spider here to build and program the spider. Is this really easy? No. So you have to make some things and like for instance on this particular interface we had to like build our own indexing database to be able to like index multi-layered graph and we have a talk which was given here like two years ago if you want to check and it was, it was, yeah, it's called the trough basically. That's the truth. Yeah, so as a conclusion my main point here is that researchers should not be expected to like learn all the ins and outs of web mining and programming and so there is always a trade-off when you design tools for them suited to their needs between scalability so how much data they can handle and fetch and usability so how easy is it to use the tool and to do this kind of stuff. So we need to be able to design a user path and to do so we need to be able to take a step back on what we are doing and take the time to like abstract our design path. And I hope that's what we are doing right now. So what's the future? We would like to do a GUI for a minnet so the researchers are able to like use it without needing to learn the UNIX command line and so if anyone is up for it we need people, we are recruiting also and thank you for listening. Questions? Yes, with what? The robot that they are texting. I will say to you officially that I do, but I do not. Now basically for us it's not an ethical question it's more of a technical question because it's really heavy for us to like fetch the thing and we don't know how to do that very well so we don't. But we could, we could do it. Yes, we use Creepy on Hive for instance. But I think the version we use does not respect the robots they are texting. Yes? We try to use those tools with researchers a lot so those are basically tools that scrape automatically by learning what you are trying to extract. It did not work very well. We also try to design our own tools but fail miserably basically. And it's something we are still interested in but we haven't found the thing which really works yet with our researchers. Do you have enough questions that the social researchers were attacking with these tools? Yeah, sure. For instance, right now we are working. Repeat the question. What? Repeat the question. Yeah, so the question is is there an example of like actual research project or a question we are actually using web mining for. And so currently we are working on a project which aims at studying how people in France share and read medias. And so we are using web mining in multiple of fashions. For instance, we collect all the articles, text from 400 medias. We collect all the tweets mentioning those URLs. We collect all the Facebook posts mentioning those URLs and many more. So YouTube videos from those media outlets and so on. So we collect a really large amount of data to be able to see whether like medias polarize around political questions or not, etc. Yes. To afford what? Okay. So for Facebook there is a trick which is actually interesting which is that they still need to serve a mobile application which is heavily used in India. And in India, like the contrast trick it did because else you will like block users. So you can like hit on it like a madman and it still works. It's not really a good option but the fact is they won't serve you the actual data. Sometimes you have like relative dates and you need to parse those relative dates to be able to find things but it's usually a good solution to be able to scrape massively Facebook for instance. So we found workarounds and if we don't find workarounds we use like proxy meshes which are able to like hit from multiple angles fastly. Not yet. It will soon because we need to. But if you want, you can help us. Any other question? Thanks.