 Great. It's awesome. I'm Kyra Villanueva and I'm a software engineer intern at AgriList. AgriList is a startup that focuses on indoor farmers. So we're basically aggregating their data and creating predictive models for better yields. I'm also a research assistant at the Lamont-Darrity Earth Observatory where I'm creating a database for their seeds. That laboratory focuses on climate change and extracting cores. And I'm a computer science major at Columbia University. So today I'll be talking about web scraping using Scrapy. So basically what is web scraping? Web scraping is a technique to extract information from the websites and it's usually translating unstructured data into structured data. So unstructured data would be in the form of HTML and then turning it into structured dataset which is which can be later used and saved in spreadsheets or databases. Some examples of web scraping people have probably done this before is just physically copying and pasting anything from the web like from Wikipedia page like right click copy and paste. You can use like Unix Scrap with regular expressions. That's one way to copy stuff from the web. Either way to copy things is through using Scrapy, that software, or you can use a computer web visual analyzers. So these are just some examples of what web scraping can actually do. So like let's say that you're looking for a video game or you want to get like the best bang out of your buck. So I like horror games so it'll behoove me if I go to GameSpot, IGN, etc. All these like game websites and kind of like add that in and Scrapy and kind of like extract all the prices from there and I can create my own like little application that gives me like the best price for that instead of like you know browsing through Steam sales or you can even go through Steam and add that into their website and like aggregate all that prices in. What are data for the week? Let's say that you're researching climate or your variable requires like climate and you don't have any sensors so you can pinpoint the coordinates for that specific region and grab data from that using NOAA, National Ocean Atmospheric Association where they have a bunch of like data sets for data and for today we'll be going through a list of conifers so we're extracting just pine trees just like a simple table. This is like just like straight up HTML. For installing Scrapy there are a couple of requirements that you need so it's currently unstable with Python 3. It's very stable with Python 2.7 with Python comes pip, LXML comes with most Linux distribution and OpenSSL comes with most software systems except Windows so once you have all of this then you can move forward with like installing Scrapy. This you can just like open up your terminal install Scrapy so I'm going to go to my terminal. I already have Scrapy installed as you can tell. I'll just show a bunch of commands. It's already running in my computer so before we get to Scraping what exactly is Scrapy? So verbatimally like word for word for from like Scrapy document because I guess they worded it like pretty well. Scrapy is an application framework for crawling website and extracting structured data which can be used for a wide range of useful applications like data mining, information processing or historical archival. You can find out more through Scrapy.org and if you want to see a list of global commands that Scrapy have you can type up in your terminal this so let's say I want to know what start project is or what it does. I can just say Scrapy start project dot dash h so it will give you what it does. This is how you're going to start a Scrapy project, Scrapy start project in your project name. You can add whatever project name but it'll be good if it's like relevant to what you're doing and then it tells you what it does, great new project and these are other things that you can do with it. So yeah we're gonna this is basically just like the structure of Scrapy so like the first is like the root folder that's we're going to be running your your crawl bots. You have your configurations and then you have like the folder where your items are your field items are there if you want to configure your bot further you go to settings.py and the rest of your spiders are located in your spiders folder so let's let's build a bot. Okay thank you. Okay so um all right so we're gonna build the bot now so go to whichever directory you feel comfortable in. I'm just gonna go on my desktop and to start a project you basically say Scrapy start project. I'm gonna title it conifers because we're gonna be scraping conifers or pine trees. So right here it's telling me okay great you have your Scrapy project started you can cd into your conifers or your Scrapy project and start creating your bot. So I have sublime in my I use sublime so I'm just gonna use sublime to open up that. So now I'm given this whole like file of uh with like a bunch of repos so what you want to look at is uh the spiders folder that comes with so I'm gonna create a new file I'm gonna save that as conifers spider.py. What I'm gonna do now is I'm gonna clone the webpage for uh the list of plants so the list of plants is right here so this is what I want so this is what I'm gonna be scraping. Basically all like the genus and species and the common names of like the conifers like all of these stuff. Like one thing about this is I like I guess like one awesome thing about like scraping is that once you have like a lot of information or a lot of things instead of like like I could copy and paste all of this but like what if like this kept scrolling all the way down to the bottom or just it's just endless uh that's where scraping really comes in handy because you all have you have a lot of this data that you can just say hey I want all these plants so I'm gonna import scraping. I'm gonna create a class called conifers spider. I'm gonna name I bought conifers and then allow domain for that is I believe it's great. Okay so that's where like the source domain will be and then my start URL you can add as much start URL as you want but since we're only working with one webpage I'm gonna only add this one so right now we're basically cloning this whole HTML page. I'm gonna create a function of parts passing self and response and then for here I'm gonna create a file name and I'm just basically saying that like I want this thing right here this file name well let's split I'm gonna split that based on the back slashes I only want by plant type then we're gonna add an extension file called dot HTML you want to open that file so a file name and then we want to write to it write to it as response to that body so what this does is it basically clones the web page this clones the web page um so if I go back to my terminal and I change the directory into the root directory and I say scrappy crawl conifers all right I'm just gonna reference oh yeah oh right right yes so you have all this so it basically copied the whole web page um and then I can look that up open this up and that's basically it you have like the whole web page right there so that clones the whole web page now if you want to start looking into let's say like I want to take in all this data so like what like the definition of web scraping for is like it's one thing to like clone the whole web page but what if you just like want that raw raw data right so uh you want to look at the fields so I'm gonna create fields here so I want the common name so scrappy that field and I want the genus I also want species so this is just creating fields properties for like our conifer items so we go back to the conifers here comment that out save that and I'm just gonna create like the same function same name um and then for every item in this table I'm gonna create uh a conifer item with the properties of genus species and their common names so for cell in response dot x path so it's using it x html to grab like the data from it um it's like so one thing to do this is like you really have to look at or like be really specific about um what you're extracting so let's say this is the body so like this is where I'm starting at the body of um the body of like this table so it's t body and then we want all we want all the rows so that's all the rows then from there I can say items equals conifers items but first we have to go back here and import the conifer's items so from conifers that items import conifer's item so it is importing this right here um okay cool all right so now that we have that um so it's basically just gonna loop around in this body and like extract all of this so we're gonna first look at the name select that x path um I'm just gonna put this here so if we want the name so we're gonna look at the name right here you can see that it has uh it's basically a column with a class of common names so you want to add that into your x path right so it doesn't stop there because we don't have the text yet so this one right here so basically like the a link has a text so we have to say a and then scrapey if you say dash text it will extract a text from that a link so that's the name right there um and then if we want the genus we do the same thing you just have to like reference the web page that you're looking at and keep looking through it so it's a column with a class called plant name so so when you're like referencing classes or ids you have to use this with the brackets within that it's like the the column has like stuff nested inside it so like we can't get to um so this has the class plant name and then below that you have like an a link so inside the a link you have like another span class called um avis or uh avis is the content and then the class name is genus so you have to extract that um so we have an a link and then a span and then the span has a class called genus yep and so i thought you can extract that i'm just gonna copy and paste this the same thing i think the species the only yep so like they're within like the same like level so i can just like change genus to species diddy class name span class name all right cool i'm just gonna show you so once you have all the stuff that you want to extract or specifically the context or the text you can yield the item all right okay cool so i'm saving that um all right i have confers all right i'm gonna go back to my terminal um in the root directory again i'm gonna tell them to scrapey crawl confers that's the name of your bot dash oh um is basically turning that data and um asking it for a file so if you want to turn something into a file so let's say i'm gonna grab this and i want it as a json file right so sometimes like items you'd have to sublime kind of like gives me errors so i have to type it again indentation problems all right so when you get like a blank screen sometimes like you have to look into um into the way you're writing like the x-path because like like all the stuff that you're looking into that can like create for a bunch of errors so i'm like right here let's see i'm gonna like look back and goes inside okay cool so now you have all that there so that's one way of like extracting information that you have so this is basically a whole json file of the stuff that i've just like extracted um you then like you can use this for your application etc um let's say i want that in a csv file i can do the same thing scraping crawl conifers oh so trees csv dot csv and then it'll show up in your root folder and there you go like i have all of like the genus of each of the stuff name and then the species of the conifer plants um all this stuff is available online um uh and um yeah thank you