 All right, so it's time and welcome to the first interactive session of EuroPython 2021 by Mauro Perlucci. Yes, so now Mauro is a senior data scientist and a big data engineer. And Mauro seems to also have a lot of very interesting collaborations with the University of Milano Picolka. I don't know whether I pronounced that correctly. Oh, thank you. So I'm sure he could also tell you a lot more about the collaborations and interesting projects he's involved with. And today he will show us data injection and big data, build a data set from zero to solid. You can take it away, Mauro, and so if anybody at any point has any questions, you can just basically talk directly into the Zoom if you want to, and you could also text by chat, by chatting to the matrix room in argument clinic. So yeah, there you go, Mauro. Thank you, Raquel. Thank you for your beautiful introduction. So my name is Mauro Perlucci, not Perlucci, but it's the same because I usually work with a lot of English guys, so Perlucci is okay for me. So hello everyone and welcome to this first interactive session of EuroPython 2021. First of all, I'd like to thank you, the EuroPython organizer. I'd like to thank you, Raquel, for an assistant today, and also thank you to everyone who will be here today with me and will follow my presentation. So I think I will try to make this a more session, less remote if possible, so only three topics. If you have any questions, please unmute and try to send me the question. You can use also the Zoom chat. I have Zoom chat on my screen, so if you want, you can put your question on the chat and for me it's a pleasure to try to respond to you. Of course, you can use the magic server to send me the question, and Raquel will assist me for each question, and of course the material of this interactive section is already shared on my GitHub repo. So my GitHub repo is Maurope Lucci, EuroPython 2021. I am putting the link directly in the chat, so you have the link in the chat. Please, Raquel, can you share also the link in the magic server, so if someone in this will come later, they will have the link. So, let me to speak about this interactive section. This interaction is about data ingestion and big data, so what we will try to do in one hour, and we will try to build a set with web scripting. You know, the scope of web scripting is to create a set. So when I speak about the set, I'm speaking about rows and columns, because the result of scripting activity is not auditory, is not a sort of abstract data, but the scope of the session today is trying to build something structured. So it's like, let me to use this uncommon term, it's like an Excel file, so we will try to produce a dataset with rows and columns. Of course, in some of these columns, we will store something very structured, like image and text. But the goal today is to start from a website, try to create the data and build a dataset. We will use, of course, Python code, we are at Python, and we will use Google Colab, so it's only in exercise, but in this case, Google Colab is very useful to try to make this interactive session more interactive. Of course, you can reuse my code for your purpose. Okay, I don't know if you want to create a dataset, if you want to reproduce yourself, this site is good for me, not a problem. And you can use Google Colab, of course, but you can also move the notebook on your local environment, it's not a problem. In the notebook, you will find some topic to move the notebook from Google Colab. To a Jupyter notebook, JupyterLab, I don't know, where you want them. Of course, when we speak about the data, you have to have in mind the 3D about the data, because I have an experience of 20 years in data, and I think I am working with big data since 2010, so in the last 10 years. But usually when we have in mind big data, we focus in on the volume, on the variety, and on the velocity, so the volume, because we have big data, we have a lot of data, we have velocity, because we can collect this data, and this data are generated with, as you need this faster, I don't know, you have in mind IoT systems, smart phones, and the variety, because in the web, you know, we can collect images, videos, text, so the structure of the data is not by default. Of course, if we have in mind these 3 features of big data, we have in mind 3 technology features of big data, because the volume, the variety, and the velocity are technology features. But remember guys that when we speak about big data, we have also other features, and in my mind they are more important, we have the velocity, because when we collect the data from the web, we will see today that we collect a lot of noise, and we don't have a clear idea of the data collected. So the scope today is only to give a look to the data ingestion, but remember, after the ingestion, there are some other mandatory steps to use big data to make decisions, because when I have an important feature of big data, it's the video. If we are not able to produce video from big data, from the data collected from the web, our exercise is not good, because the scope is not the technology exercise, it's not to collect everything. Of course, if we are able to produce a system that collects everything from the web, it's good, but we will spend a lot of money, we will spend a lot of time, but the story, we will have a lot of data, but no one will use this data. The scope of collected data from the web is to produce video for the end users. So keep in mind that after the ingestion, we will have always a phase of data processing, data preprocessing with machine learning, I don't know. Of course, we will have a phase of data processing where we will put in quality this data, and where we will prepare the data for the analysis, and at the end of the story, we will store the data inside the browser, because the scope of the data science system is to support the event user to make decisions. This is the overall data flow of the data science system, so data ingestion, data processing, data presentation, and final, I don't know, we can use some business intelligence tools, some models, I don't know, but keep in mind when we speak about big data, we have to remember the three traditional features, volume, very thin velocity, but we have also the velocity and the video. So, Raquel already introduced me, so I don't want to speak a lot about me, of course, it's not the path for today, but I am an engineer and exercise, so my role is quite a bit between the two main topics of exercise, so engineering and modeling. Currently I'm working for MSI Bar in Glass, MSI Bar in Glass is a company that collected data from the web to produce a video, only to give you a full overview of the system, I want to show you the final result of scraping activities. It's because I am very responsible, I am the main engineer of a big project for the European Union, it's a public project, and this project is called real time labor market information system on skin requirements, so I try to simplify now the question. This project is about the collection of job posting on the web, you have in mind job posting on the web, we have a title, a description, so our scope, the scope of the European Union for this project is to try to collect every job post published in the European Union, to create the warehouse, to support the decision maker and buy the product. So, only to share with you, this is the result, for example, this is the final phase, this is the intelligence tool over this data collected with web scraping. I'm sharing the link in the chat, in the chat. So this result is called the skills, the skills online because analysis tool for Europe, so this final result is based by Tableau, but we can also use another BI tool, I am not a fan of Tableau. So as you can see, we collect, I don't know, Rachele, we have a question in the chat about the screenings, share it correctly. Okay, thank you. Okay, so as you can see, we collect a lot of job posting on the web, we don't have the final number here, but we are collecting the last four years, something like 400 million job posting. Of course, we have a process to clean the data, the duplicate the data, and the putting quality, and we use a lot of machine learning to try to normalize the data, to normalize the data. As you can see, I can select, for example, a country, I'm starting from Italy because I am from Italy, and I can go deeper in detail. So as you can see, the result of web scraping activities is not only to collect data, but is to try to collect the data, create a data set, and support the decision maker. In this case, for example, if I go to the professional, I can see that the main profession, as we each, for example, to the occupation insight, the main profession required by the employers in Italy. I can use a professional, I go, for example, to Italy, the system is open, so you can try to use it, I go to the SIT professional, and of course it's fun because the most required profession in Italy, but not in Italy, you know, is a software developer. So remember, have in mind what is a job posting in the web. We don't have, so we can have a title, we can have a request for a job posting about software developer, but not only. A lot of time we have a Java developer, Python developer, I don't know, C sharp developer. So the goal is not the challenge is not at the end when we collect the data. So the scope of today of the next 15 minutes is to collect the data. But the complete exercise is to apply machine learning, apply some, I don't know, some system based on rule to try to normalize the data and produce a video. Another nice tool produced by this system is this one. It's a tool most simple, for example, I can go deeper, I don't know, in German, and I can analyze the economic sector in German, and also this data are connected with web scraping. So let me to continue now. I don't have a question in chat. So I can continue a small agenda for today. Of course, we will, we will speak something about the decision just in patterns. We will see the difference between scraping and crawling. For example, for the European project that I have mentioned a couple of minutes ago, we are using scraper, but also crawler, because it's difficult to maintain. We are using 500 different scraper. So in some case, we use some scraping tasks. In our case, we use crawling tasks. Of course, the data produced by the two different methodology are different. And then we will see the different, the main difference between these two tools. The first part of the agenda today is to create a data set. So the goal is to try to navigate Indiegogo. Indiegogo is a crowd-funded platform. I don't know if you know it. I think it's very good for scraper and for scraper size. And we will explain why I think it. But the main goal today is to try to navigate Indiegogo with Python and Selenium to create a data set. And at the end, we will speak about ethics. Of course, because everything published on the web, you know, is public. So everyone can take it. So it's difficult to address the ethics challenge. Why is it difficult? Because when you publish something on the web, you don't know if someone will take it. And you don't know if someone will take it and what is the scope. So every time that you will run or scraping activities, every time you will collect data from a web, you will have to keep attention to the data. Because in this case, the staff is not about who publishes the data, but is on the side who collect the data. So you have to keep in mind that to repeat yourself the question, can I collect this data? Can I collect this data for my purpose? Can I collect data for my purpose? Or I have to inform the original owner of the data. And we will speak a little about ethics. So let me start from the interesting part. So I let you speak about this topic, because of course the complete exercise about data ingestion is only a part of a big picture. So when we speak about a big picture in this case, it's the overall flow. We have, of course, to collect data from the web, but the data collected from the web are, we have a lot of noise, we have a lot of uncompleteness. The accuracy is very slow. So when we design a system based on data collected from the web, you have to keep in mind that the main tasks are data ingestion, the data processing, the data presentation, and the final visuals of the data. And during these four steps, we have always to measure the quality of the data. So the completeness, the accuracy, the, I don't know, the presence of empty values because the scope of the first task, the ingestion, is to collect data. The scope of the processing is to increase the quality of the data. The scope of the presentation is to store the data in the overall system and at the end, the scope of the user is to use the data to produce data. So please keep in mind these four main topics. Of course, today the scope is to speak about the ingestion, so with scraping. So please, Raquel, can you share the first zoom pool? I wanted to give this session a more interactive, so. Yes, I have shared. Thank you, Raquel. Thank you, Raquel. So I wanted to send you a question, what is web scraping? So of course, it's all your joke. I have six respos, of course, where you are right. So I don't wait a lot on this pool. It's only to try to engage you. Of course, the right way, the right answer is B. So the scope of web scraping is to extract, extract the content from our page. And of course we create, we will create during this session an application in Python to run or scraping activities. And for there are a lot of tools to scrape data from the web, tools with different programming language, different framework. There are also some already ready tools that use drag and drop. So visual tools with no programming language to produce data. Of course, the scope today is to use Python. Python is very nice and very flexible to run and scraping activities, of course. But it's not only the tools. And also in Python, there are a lot of libraries. Today, the scope is to use Selenium. Selenium is not all your libraries for scraping. Of course, the scope of Selenium is not scraped data from the web. The scope of Selenium is to try to run a sort of automation of repetitive tasks on a browser. So we can use Selenium to run some scraping activities because it's very nice and it's very flexible to create applications in Selenium to scrape data from the web. So of course, the first difference is about scraping and crawling. Let me start to speak about crawling. The crawling is the job of Google. I think I'm not wrong if I associate the word crawler, the term crawling, the task crawling with Google being and so on because the main task of this player is to try to collect everything from the web. And I think if you have in mind the job of Google, you have in mind the difference between crawling and scraping because the result, the scope of a crawler is to try to store all the data. So, you know, the work, the job of the Google bot is to try to navigate the web and to store in the Talek, in your file system, every page. The scope of the scraper, the scope of scraping activity is a little different. With the scraper, of course, we collect, I don't know, the HTML code of the page, the entire text of the page. But the real scope of the scraper, the real scope of the scraping activities is to create a structured data set. When I speak about structured data set, you have to have in mind an Excel file with rows and columns. Of course, in Excel, we can store a text. We can store an image. We can store a complex structure. It's not a problem. We have no SQL, DBMS. We have, I don't know, Parquet file. We have JSON file. It's not a problem. We don't have to store Excel file, of course. But you have in mind the scope of a scraper is to try to cut and paste some portion of the page in the data set. So in our case, we will scrape data from Indigo Go, Indigo Go has a lot of campaigns about games, art, comics, and so on. And our final data set will contain the title, the ID, the URL, the description, the balance, I don't know the start date and date for each single campaign. So if in Indigo Go, we will find 5,000 different projects, our final data set will contain 5,000 different rows. Every rows will be composed by title, description, balance, currency, start date and date and each category and so on. I don't know if it's clear. Can I continue? I don't know if there are some questions in the matrix. Thank you, Jasper. Jasper. OK. OK. So the main task. The main task when we speak about the scraping is I try to keep the focus to the first topic. So to identify web pages and to understand the structure of a page. Because before to start to write code to program something, we have to explore the, we have to expect, the real term is expect the web page, web page structure, and to understand where the portion that we have to take is stored. So we have to understand something about HTML, we have to understand something about the browser, but it's difficult, of course, because we have a simple browser like Chrome. So for the session today, I think it's better to use Chrome because with Chrome we have a lot of night feature for scraping exercise. So with the browser, we can identify the portion of the page that we need. So the fourth task about scraping is to try to identify this portion of the page and to try to link every portion of the web page to the file that I set, which is the main task. Because at the end of the story, the code is very easy. We will see that we have 10 lines of Python and we can run this script. The first exercise is more complex because we have to use, I don't know, a little of creativity, a little of experience to try to identify the query to run on the web page to extract the data. But before to start the program, before to start to identify the page, please try to respond to answer to the question. Why? Why we scrape data? You can use the chat if you want. You can mute yourself. What are the purpose of scraping data? Maybe because you can get insights of what people are interested in. Okay, of course. Thank you. Okay, that's a collection. Thank you, Jasper. Okay, but that's a collection for what? Try to image. Okay, I built a set. What is the final scope? Because when we scrape data, we can collect everything. But we have to stop at the time because we can create a scrape but scrape everything from the page. So have in mind the final goal of a web scraper. For example, like Osvaldo has mentioned, we can create a scraper for research purpose. So I work in university and a lot of time, someone calls me. Okay, Mauro, can you please scrape data from, I don't know, glass door, from a fan page, from a newspaper, because I wanted to run an academic research. So the first scope to scrape data is to try to create a set for research purpose. I think it is one of the main scope. Second for business purpose, of course. For example, we collect data from European Union but I also collected data to sell data to my school there. So when I use, for example, job posting, why we use, so what is the scope for me to write job posting? Because with job posting, I can run some competitive analysis between two employers. I can measure my online reputation. I can run a product and price comparison. A lot of time, my students in university use web scraping, for example, to collect price from Amazon and compare the price on Amazon with the price of a better employer, a better company. It's normal. For example, we can create some scraper to understand what the competitor is doing, what, I don't know, we can use a scraper to collect data from social network. We can try to detect some fraudulent reviews, for example, because we can scrape data from every website. Everything published on the web could be taken from a web scraper. It's not a problem. You can scrape data from Amazon, Twitter, Facebook, LinkedIn, Reddit. In my ten years of experience about web scraping, I saw a lot of different type of sources. I don't know, we have finance, a lot of newspaper. It's not a problem. Remember the ethics aspect. When you collect data, you have to repeat yourself the question. Of course, I repeat myself, the question, can I collect this data from my purpose? Because the owner of the data don't have the vision to understand what is your purpose. So the ethics aspect on our end, we have to repeat ourselves the question, can I collect the data from my purpose? For example, I collect data about job posting, but never in my work life I sold the data to an employment agency. I don't know, the URL, with the source, I don't know, because my scope is to sell aggregated assets, to sell analysis about labor market. My scope is not to create Mauro Pelucci, it's not my scope. My scope is to collect data from the job post, to create some statistics, some analytics, some KPI about labor market and to sell to my stakeholder aggregated assets. I am not a competitor, for example, of LinkedIn, because my purpose is not to publish, I don't collect the data job posting to publish, to republish job posting because I think it's not correct. So I remind that when you collect the data to repeat yourself the question, can I collect this stuff for my purpose? Of course, there are different tools to run scraping. We have some visual tools like, I mentioned import.io, modgen.com, Octoparse. If I remember, octoparse.com is free, so it's not completely free, but it's free and some free features. Of course, visual tools are very, very good, but it's difficult to use visual tools with complex websites. So I don't use this type of tools. The second frameworks, for example, Natch with solar, Natch with elastic search, HTT Track, I think one of the most famous is Stormcrawler. This type of tools are crawler. We can use Natch, for example, or Stormcrawler to collect everything from the web. It's not difficult. These tools are already ready. We have also to download the tools and use it. Of course, we can write custom code. Custom code in Java with J-Soup, in Python, with scraping, beautiful soup with Selenium. I only mentioned these four libraries in two languages, but there are a lot of different languages, of course. Beautiful soup is different from Selenium because Selenium uses a browser. So mainly we have a Phantom browser to run the scrape. Beautiful soup is only a parser. So in beautiful soup, we have to put in input of the web page, and we can run the parsing activity and run query. So let me to start. I don't know. So in this slide you can see my repo. Let me to reshare in the chat. So if someone was late, you have my repo in the chat. If you navigate in my repo, the notebook folder, the notebook that we will use to run this exercise. So I put also the link of the notebook in the chat. So I don't know if there is a question on chat. I think no on the matrix server. Raquel? No, it was just a very nice answer about why web scrape. Yeah. Can you share the answer? Yes, of course. It says to access the data automatically e.g. daily which is not available via API for further analysis, collection and or ML machine learning. Thank you, Raquel. So I go ahead to my session because this is the main part of the session. So we will start from IndicoGo. We will write some Python code to produce a file. So the main tool to start with a scripting activity is the browser. So we can use so I'd like to use Chrome. So I suggest you to use Chrome because in Chrome or Chrome based browser we also have our browser like Chrome as the same feature but of course with Chrome we have a lot of nice tools that feature to inspect the website because the main part of the scripting is trying to identify the portion of the page. So if we go to the IndicoGo page we will move the zoom bar. So I am putting the link on the chat. So IndicoGo you know is over the site about campaigns. So if you don't know there is a lot of funding and campaign activity about audio, comics, tools, film, games and so on. And our scope is to try to identify each single path for example by title, with description, the balance, the currency and to create our dataset. So I click to oversight. So first of all we have to start from the browser because on the browser you know that you need to go to the main page. You know in the browser under the option so I cannot zoom this part because I cannot zoom the part but under usually if you use Chrome under the more tools developer tools we have on nice tools to try to identify the portion of the page that we need because I am using the elements tab in this case and we will understand why the elements in a couple of minutes but I have this tool the inspect tool. Why the inspect tool? Because with the inspect tool I can move on the page and I can identify each single part of the page. For example if I go here and I click so in this orange, it's an orange green box, if I click as you can see in this area I have the HTML code that use the browser to show me the box. Why the browser? Because I have the HTML code in two tabs the elements and in the sources in the sources I have the HTML part HTML code recite by the server to my browser. In the elements I have the HTML code refactor reworked by the browser so if I use Selenium I am using a browser to run the scripting activity so it's better to use the elements part so Selenium is better than the other tools because with Selenium if the website is with a lot of Ajax script and so on everything could be taken by Selenium so it's not a problem as you can see if I go here I can identify that with the discoverable dash card if I go to the second I have a lot of discoverable dash card so I there is this part with this box if I go down, if I click as you can see I can identify each single part so my goal today is to try to scrape this discoverable dash card and to run query on it to extract the title with description, the balance of course I need the query language why query language because I have to run query on this web page so I have a couple of run here of query of query framework the first of its part its part is not on purpose today because usually I use the CSS selector framework why CSS selector framework because the CSS selector framework in CSS selector framework you can use the tag of HTML code also the class so it's more easy to use the design of the page to identify the portion and it's more easy for us and it's more to use for example the class because if a designer I use the discoverable dash card tag to identify the card of the project on the page I think it's a feature of the website so it's more easy for us to run a query of this type so the query in CSS selector are very easy as you can see the language is not complex like SQL for example because we have for example element div.class and if I run a query with element.class I can take every portion of the page with this element and this class for example if I go here I am in the elements part I use the command f because I am on a Mac or control f to open this find by string selector or its path feature and I can write my query for example I write the query this for example so as you can see I am able to navigate the page I found 14 different portions in my mind there are only 12 portions but it's easy now to identify a single portion because as you can see if I use the expect page I can go here and as you can see I can identify the class I have the class discoverable-title so I can run a query for example with div.discoverable card-title I put the query in the chat and in this case I am finding 12 different title and if I go down with the title with the arrow I can identify each single title of the page so the complex part of the scraping is this one try to identify the right query of course indigo go is very simple because it's very easy I am using indigo go but for example I can try to identify a single category it's very easy because I have the class discoverable-category so if I run a query with discoverable card-category I can find each single category of the website I put also in this case the query in the chat so it's the complex part because it's not every time very easy like in this case so if you are in difficulty you can use also an app from the browser because if you click with the right button on a portion you have the copy menu and under the copy menu there is the copy selector with the copy selector you can have an app to write the query for example if I copy this one as you can see I have a complex CSS selector query but at the end if I remove the first part I have the query but I need of course in this case I have two classes with the syntax element.class I am forcing the condition class the first class and the second class so this is the main part of the exercise let me to identify the description for example in this case is discoverable card description and also in this case it's very easy to identify the description part and putting also this one in the chat so at the end of the story the first task about scraping is to try to write in this slide you have everything of course I am navigating the site but in this slide you will find every single part so the main task is to try to write a table with each single element and the query for example for the link I have to use discoverable dash card for the category dot discoverable card for the title and so on now we can move to Selenium I don't know if there are questions Raquel no ok so because now it's very easy because we have the queries we have only to use Selenium to run the queries so I am going to google collab let me to open the notebook so you can see how to open the notebook I am go to file open notebook you can use the link to my notebook in github if you go to the github tab on google collab and if you put the link google collab open the page and identify the notebook you can click on it ok so we can start the first stuff google collab is in cloud so it is not on our local system so we have to install the libraries so with pipi stall Selenium and apiti get I am installing Selenium ok by the way install Selenium and install the browser so Selenium will use a phantom browser on google collab of course you can run this notebook also on your side it's not a problem you have only to download the notebook and open the notebook on the uptr for example and to download from this link you have the link in the notebook the chrome driver so if you are sensitive to your version of google chrome so please check the version of google chrome has to be the same version of the chrome driver and then you have to edit the connection as in the notebook however in the notebook you will find the code to run the notebook locally so after the process we can import the libraries as you can see I am importing Selenium the tqdm library and the tqdm is only to have a progress bar so it's a nice feature it's not a functional requirement only to make more nice my notebook pandas to produce the print also in this case the pretty print is to print the jace or the dictionary created in a more nice way so after the import I can create the chrome option why chrome option because we have to to give an input to the chrome drivers and parameters for example the edless because I am in a cloud environment so the phantom browser is completely phantom if you run this notebook locally you can remove this option and you will see the browser open the click and so on of course we have to put the user agent what is the user agent is the signature of your browser when your browser run or request run to the server in this case I am putting the signature why because when I run the script the scrolling activities I have to emulate the user so in this case my phantom browser we will present the signature of chrome on a Mac so in some case it's better to use a real signature so to put the user agent in some case it's better to use a custom user agent for example you can use I am a bot and I am scraping your data if you want you can put whatever you want it's not a problem of course if you want to better emulate when the user is better to use a real signature too now we have everything and we can instantiate our phantom browser in this case it's chrome so webdriver.chrome I passed the chrome option input and the boot the object is my phantom web driver as you can see I don't see your browser but if I run or get a request page this page is the same page that I am using here it's the same page if I run or get and on the boot the object if I run the same screenshot I can add the page as you can see it's better to click if it's ok let me open the image if I go here and click on the image I can see the image in this part as you can see my browser is ok because my phantom browser is navigated so now we can use selenium to run our css query so of course let me check we have a lot of nice nice features on the boot the object as you can see we have we can add cookie we can close the page we can have the current url we can delete cookies we can for example get the page source but we have also a lot of like find the element by class name by css selector and find element with the same find element by css path find element by css selector what is the difference between the two different methodology? find element return one single element from your query find elements, return a list of elements so if your query is wrong the find element returns an error if your query is wrong find elements, return an empty array so usually I use every time the find elements because with find elements I have an array not good and with len my array I can check if my query is right not right and I can make a decision if the query is right if the element is present so let me go down a little first of all I want to keep each single box of the indigo go on the site I am using the first query so discoverable dash card if I run the query and if I check the result as you can see I am finding 12 campaigns so it's okay, it's expected because in the first page of indigo go I have 12 projects now I can write the main part of the scraper so as you can see I have 20 lines of code but more or less they are always the same so I run 4 projects in my list of projects this project is the result of the previous operation and on this project I can run query so the previous query stored in my notes are run on each single project on each single area of the indigo go on the site so first of all I can run the query with base condition so if the query returns at least one element I take the first element and the text part of the element if the query doesn't return nothing I leave the variables equal to blank so as you can see with this approach I can scrape the first page of indigo go and we can follow so it's very fast and we can check the result so as you can see I append my data to an array and I find balance the category productivity Euro description image with title I think this is the first if I go here okay as you can see this is the first second so we have built our first scraper on the first page of indigo go now I create a function so I put this code inside the function only to organize better my code so the code is the same of the previous I put input to this function the box, the card of the project I run the scraper and return the conditionary and now I try to run my function on the first page so my expectation is to find 12 campaigns so it's very fast it's very nice because we have also T2DM so it's okay now we have to run the scrape with a lot of pages because in this case we have only 12 campaigns so mainly I can copy and past them very our goal is to try to download everything so the question is how to end the activities I have a lot of possibility not only one I can run the scraper on the first 1000 pages and when I have as a result an error the 404 error page I can stop my scraper I can try to understand what are how many projects are on the website why I have the right number of the projects I have a lot of different features there is not only one only way to stop a scraper of course in this case for example I run a scraper decrease the number from 10 to 5 I try to run the scraper on the first 5 pages so as you can see I can also run a click on the sidebot cookiebot this element identified by a new kd is the banner so with the selenium we can click also an element on the page and I try to run my scraper and download the 12 pages the 12 projects and then I click to the show more with the selenium we can try to click on this object and I have a result ok so my goal is to try to emulate the user so to go down click and scraper to 12 projects go down click and so on so with this 10 line of code I can run click the cookie button it's present then try to check the first page download and so on wait usually I put this time and dot slip because I want to be polite I don't want to make to put in difficult with my scraper of course so I use this time to slip of course to be more polite with the web server and of course to better emulate the users so as you can see I put this free second of wait between click and the next page as you can see as a result we have 251 project and for this project we have all the titles now we can use pandas for example to visualize the data set we have the URL, the title all the data and to download the final data so we have built our data set in the notebook you will find a lot of other possibilities to download other fields so please job with the notebook it's not a problem let me to conclude my interactive session because before start we have to keep in mind that we can also use the API shared by Vistacolder so keep in mind that in the browser we have the elements part but also the network part so let me to show you if I clean with the clear button and if I click the show more in the go go under the fetch slash is hr section show me that there is this discover call if I try to watch this discover call I am lucky because in the preview I can see a JSON file with 12 objects as you can see from 0 to 11 so I can for example copy the JSON I go to JSON viewer online I am using the first one ok so as you can see if I run a request to this page I have 12 projects for each project I have the category the open date the description, the title the project ID of Indiegogo the image URL the found raised amount the found raised percentage the currency a lot of nice features so in this case I think it's better let me to go down my notebook at the end to use for example this request to run the scraper because it's more faster for example as you can see we've only the request library with the request I can run for example a post I can't find the parameter here because if I go to the headers I can see the request payload to use I can use this request dot post to try to emulate this behavior so let me to only for the first 25 so in 22nd I can scrape 24,000 project with a lot of details so remember before start to write code please investigate inspect in the better way the website try to inform the website because only one minute and I finish I stop my because there are a lot of questions about scraping and crawling data from the web a lot of sentences, regularization and alert by the website of course we have to be polite because remember to repeat yourself the question can I scrape this data the first step is to understand if the stakeholder wants that you scrape the data how to check you can go to the website for example you can check the robots.txt the robots.txt is usually used by the crawler to check if the Google bot and so on as you can see in the Google is quite polite you can navigate every page of the site not the account page the project and so on first of all try to check if the site has to download the data for example if I go to the link in the robots it's really different I have a lot of denied so I have also noticed so please second try to inform Mr.Colder because in my experience a lot of time I don't know Mr.Colder doesn't have issue if you scrape the data I want only to be informed about your activities what you do with the data what is your program what is your challenge so please check everything the robots try to engage Mr.Colder try to keep a look to the time to the limits and so on of course because scraping activity is not it's not simple with IndigoGo it's very simple but it could be required a lot of time so before start try to discover if the scraper is the possibility try to engage the web server I don't know I think you have an idea so I am finishing my time I think it's right Raquel yes that's correct I don't know if there are questions or no Lillian said that she would run the Jupyter Notebook from the Google collab I think that Jesper already said that that's good so I think that is yes another suggestion to you guys my notebook is very simple but you can reuse my notebook every time I have a lot of students that are coming from economics degree marketing studies and so on that I use my notebook to create some nice scraper of course Google collab and my notebook are useful to create some one shot scraper but if your scope is to for example investigate the price investigate and reputation is good so you can adjust my notebook to your case and reuse it every time Raquel on matrix we have some questions I don't see any questions so if anybody still has questions feel free to ask just now okay thank you Hugo I'll post all of the links we shared here as well yes in my presentation in the repo you find also my mail my LinkedIn profile feel free to write to me if you have issue with another book or if you want to have more information about my work and the ingestion and also the other part of the work because the ingestion is the first part sometimes I think it is a very more nice part because you see the result of your work of course it's not only part of the science system it's the first part thank you Yash yes I think what Yash said really quite changing was what I feel as well it's seen quite a few of these web scraping tutorials I think this has been one of the best I have seen very clear and thank you so much if anybody has any other questions feel free to still ask on matrix and if you feel like you want to have another session with Mauro if Mauro has the time we could also book you into another session into the city walks that's the open space so that's not recorded you can just hang out with Mauro and ask any questions over there that would be in the conference 6 city walks open space yeah I can that as well you can join the jitzi and then have another video chat alternatively you could also go to Wandermeese Longing Area and then just hang out okay thank you okay thank you very much Mauro so now I guess we can sign off and have a break and then ready for the next session soon okay thank you guys and a picture again for your assistant bye bye