 And I've just been informed that I needed to turn the microphone on. So with that being said, I'm still going to use the New York City DOB Building Information System website to showcase some basic web scraping techniques. So even though we could just lift all this data from the open data website, download a CSV file and have everything we would ever want, I believe it's good practice to look at a known data set and really just pick at it and get a feel for how it works. So then once you know how to do that, you can move on to a new data set that might be a little different, and you'll have some experience with solving these problems. I particularly like the Building Information System because it's a very rich data set. It tells you pretty much anything you would want to know about the boiler permits, the work permits, the open violations, and many other things associated with a given property on the Department of Buildings website. And as I'll show you briefly, it's a very minimalist layout. So you don't really have to worry about anything major as far as HTML parsing. It's all very simple. It's just basically a series of HTML tables, which I'll show you in a moment. And the quickest way to get there is nyc.gov slash b-i-s. So I'll take you there right now. And so this is where you would go if you typed in that link. And we're going to go to this part here, click on Building Information Search. And now this is going to become important later because when there's a lot of demand on the servers, that splash screen comes up for a moment. And that will disrupt the web scraping process. So we need to be able to account for things like that. And that's just part of the web scraping process. Sometimes you have to put wait times into the process to get it to work properly. Yeah, well, sometimes it's there right away. Sometimes there's that wait a moment while we get your information screen. And if you wait for it long enough, it'll give you the information you want. So I'll just show you how you can look up a given property. And for this one, I'm going to choose the building from the 1984 Ghostbusters movie where the Ghostbusters confronted Gozer and all that. It's based on a real Art Deco building, 55 Central Park West. So I'm going to hit this. Here is all the information. You've already supplied the address. You have the borough, the zip code, this unique identifier that the DOB has for every building called the Building Identification Number. And for tax purposes, the tax block and tax lot, a lot of times people that do work with city data will search for a property by the so-called BBL, which is a concatenation of the borough block and lot. So as you can see, there's a lot of information here. You have alternate addresses for the building. You have cross streets. You can tell it's a landmark. And there are many other bits of information here. It's an elevator apartment building. And a lot of that information can be useful right off the bat. For instance, in my line of work, it pays to know what the cross streets are, because when you're dispatching a technician, if you can tell them the cross street right off the bat, have it in a handy lookup table, then you can save a lot of headaches of, oh, I thought you said the east side, not the west side. And of course, if that's near peak traffic hours, you're losing a lot of time and a lot of business. And then the landmark status comes into play when you're trying to write up plans for alterations of the, in my line of work, the boiler system. But it could be the electrical system or anything, because when a building is landmarked, then there's a lot of extra stipulations that come into play that you need to be able to account for. And also, it helps to know what kind of building it is. So is it a walk-up? Is it an elevator building? Is it a condominium? Those help us to be able to tell what sort of heating equipment that they're going to use, because if it's just a small walk-up, it's probably going to be a smaller unit and so on. But now, another nice thing about this layout, I'm going to use one of the advanced features in the Chrome browser. Going to right-click it and go to Inspect. And as you can see, you can go through the HTML. And what I've found is that if you go through it, it's just a series of table elements that make up this page. So it's a matter of finding the right table element that has the data that you want, which narrows down a lot of the searching. All you need to know is that something you're looking for has. So a table is a nested object where the main tag is table. And then there's table body, table row, and table, I guess, TD is table data entry. So everything you're looking for is going to be held inside one of these open TD, closed TD tags. So all you have to do is look for a tag that has the text you want, and you're done. And another nice feature of this particular data set is that if you find the tag for the data that you want, for instance, you just say, oh, look for the table entry that has the words cross streets in it, then you can go to the next element and it will be the data that you want. So you just find this and then step to the next thing. And you can populate your data table rather efficiently, just from essentially two commands. And one last thing I'd like to point out is that the URL itself, when you've run a query, will actually have information about the query you've run right in the URL. So I believe it's called the request line. And so it starts with a question mark, parameter 1, value 1, and parameter 2 equals value 2, and so on. And it turns out, I'll just spare you some of the details, but all those other values were extraneous, so if you just hit that again, you get the same exact thing. So that way, you can just feed these values to a query and you don't have to manually type in all of that address information from before. So it lends itself very easily to being automated. And lends itself very easily to being searched. OK. And so let me just tell you something about the two modules that I've had you install. There's first the request module. It's called HTTP for Humans. It's just simply a module for sending HTTP requests to websites. And then it uses another, it has the dependency URL lib3, but all you really need to know is that just a couple of basic commands. For instance, in order to reproduce that URL that I showed you before, you would have your base URL, of course, import requests. Then you have your base URL, which is everything before that question mark. And then your address parameters, bureau, in this case one stands for Manhattan. The house number is 55 and the street is Central Park West. And so in order to get that information off the web, you just run website equals requests.get base URL and send that dictionary of parameters. And you will get that URL that I showed you before right back here with the question mark and also the plus signs that fill in the spaces. So pretty much no special formatting considerations. It all does it for you. And then if you want to get the HTML code for that website, all you have to do is say, in this case, website.text. And of course, I omitted the actual HTML code that this would produce because of space. All right, is everybody following along OK? OK. So let's see. And the beautiful suit module actually has many useful features. Not only does it parse HTML, but it also parses XML or anything that has an HTML style to it where there's the open tag and close tag format. So and it essentially converts the HTML file into a searchable tree data structure. And allows you to jump to specific branches of that tree with simple commands. And it's pretty lightweight. It's just coded in Python itself. So it's not invoking 20 new dependencies or anything like that. And you could also, if you need a little more speed, they actually have a, this is a bit more advanced, but you could also swap out the parsing engine and either make it, you could either use the default one that they give you. You could throw in this LXML, which is a C-based engine that makes it run a little faster because it's not running in the Python interpreter. It's just running in compiled C code. Or the HTML5 package, which is, it's slower than the others, but it's completely HTML5 compliant if that's something that you're concerned about. And another advanced feature is not only can you read the tree or the parse tree, you could also edit and add and delete from it with this package. But we're not going to go through that. We're just going to show the two most basic functions we're going to need. So we have the find all method is pretty much all we, you can get very far in web scraping just knowing the find all method. Because for web scraping, you're just trying to find something. So this is the thing that finds something obvious. So and you can find, you can search for things based on the tag. You could look for all of the table elements or the TD elements. And you can find, you can search for things based on the attributes that certain tags have. You could even search for things by text using the regular expressions package. And so you should just, and then if you want to, and then once you find the element that you're looking for, if you want to jump to the next one, you would just find that, look for the result. And if you got a result, you can just use this bit of code here to get the data from the next element over. And that will be the general pattern in this case for parsing the website and extracting the data that you would like. And for now, I'll just show you how one could go about going through the DOB BIS data and given an address, find whether or not that addresses a landmark, find the cross streets, and the building class, whether it's an elevator building or whatever, at that address. And so I will go to my parsing code. So I just invoke these four packages. And the first kind of the most fundamental function is it's called getSoup from BIS. So I give it the base URL and the address parameters in this case. And I also give it a number of times to retry the search because as I showed you before, sometimes it gives you the load screen page instead of the page you want. And I'm telling it as a default to wait for a second, try it 10 times. And if it keeps giving you the load screen, it just gives up after 10 tries or more or less depending on how you set it. And so I just say for up to 10 tries, I give it that base URL and the parameters, and I say if the word... So if the word visitor prioritization occurs in the text of the HTML, then we know that we've gotten the wait function. That just happens to be the wording that they use for the title of the HTML file. So if I don't find that text in there, which in this case find returns negative one, then we've found our text, or then we found our data. And I also check to make sure that there's no not-in-property file returned because sometimes it is possible to give this site a non-existent address like 50,000 Park Avenue or something like that. So basically if the address doesn't exist or it just isn't able to give you the data in a timely manner, it would just return an empty string. Otherwise it waits a second and tries again. And if it finds the data, then it returns the parsed tree version of the website, which is this beautiful soup object. So this is just kind of a helper function because sometimes the base URL is different depending on what you're looking for because you could be looking for building data, you could be looking for boiler data. So for instance, if I click on this boiler records function, I can swap out that base URL that I had before, which was property profile overview servlet and put in boiler compliance query servlet. And I essentially am able to reuse a lot of code and take a modular approach. So that's the only reason that I have this extra function here. And then I have this function here, where I feed the borough, the house number and the street into this function and essentially convert these three variables into the parameter dictionary and feed that into this getPretty profile, which in turn feeds that information into the getSoup from BIS. So you type in the query parameters and you get your parsed tree of the website for that building. And finally, once you have that, we'll call it the soup, you use the findAll method to find the field with the landmark status name. And then from that, if it exists, you jump to the next element and that text will have the landmark status, whether it is a landmark or not, same for giving you the cross street data if those data exist and finally for the building class data. And it returns a dictionary with those found data. And I'd also like to point out that if this did not return a valid website, it would still, it wouldn't freeze on you because I do the search and I always check to see if there is something there. So if I gave it an empty string, for instance, if I searched for 50,000 Park Avenue, this would say find all of these occurrences and if it's an empty string, there will be no occurrences, so this part will not be executed and so on down the line. And these values are initially set to null, so you'll just get a dictionary of essentially null strings, but the code will execute and if you have it in a loop, you will be able to just kind of skip over that rogue element and kind of move on. So I did just that a bit earlier, so this might be a little hard to read, but so what I did was I threw this into a for loop and I looked through all of the possible addresses in Park Avenue from 1 to 20 and I printed out the address. I Park Avenue, New York, New York and then I printed the result of my web scraping function and it prints out the address followed by the dictionary for showing what the building class is, what the landmark status is, if it's blank then it's not a landmark and the cross streets and I don't do any kind of post formatting right now. This is just to show you how it works and you can choose to do what you want to do with the data that are extracted and so as you can see it executes and then I found out that apparently there is no 14 Park Avenue so if you're writing a script for a movie you can have your character live on 14 Park Avenue and nobody will get upset about getting their mail. Same for 18 Park Avenue. And as you can see there's a lot of elevator apartments and office buildings and a couple garages and so you can essentially devise any query scheme that you can imagine for this data set and this simple set of commands will return the dictionaries of all of the information that you want. Yes, that's a good example. So this was, I did essentially kind of a naive approach to searching through these buildings. Certainly what is good practice is to, you know, for those cases, let's see, so it was not, you said it was 9 Park Avenue so we'll do, I'm kind of doing a shortcut here. So yeah, in this case it's also attached to 11 Park Avenue so it's very possible that 9 Park Avenue is considered the address for the garage beneath and 11 Park Avenue is, well in fact that is a garage as well but that is a good point because sometimes you'll do this search and this part that says buildings on a given lot usually is just one but sometimes it could be two. I've seen it as high as eight in some cases so it would also pay to, you know, when you're scraping to pull this buildings on lot variable to make sure that you're not skipping valid addresses Yes, and so I figured at least for this talk I didn't want to get everybody bogged down on the minutia of the DOB system as it were but yes that's certainly a valid concern for any data set is to understand the assumptions that are going on and make sure that you're not just getting garbage results because you're using, you know, faulty assumptions. So and then just to for your kind of for your own information if you want to get further information on the documentation of the two packages I showed you there is a wealth of information on these two sites where the documentation is fleshed out in detail there are also alternatives to beautiful soup for instance there's Scrapy which is a more advanced web scraping framework which many of the things that I did by hand this framework does in a more controlled structure and if you're interested in that Cara Villanueva is giving a talk on it tomorrow afternoon so I would recommend that also if you run into a system that doesn't have these like a convenient URL like this one you know sometime like for instance this here I can do you know a search for 55 Central Park West again and this is a different database that the New York City Department of Finance hosts I search for that when I get the result so then I have to go through this other process and as you can see it just says BBL result it has no connection to what I've searched for so you need something a bit more sophisticated than just requests sometimes and I have found that this package Selenium is useful because it stores all of the cookies and the JavaScript internally it's a bit you know top heavy but it gets the job done yes right yes okay so and does anybody have any more questions oh he said he said that there are it's not just you know you're not just sending gets but you're also using posts which is why it doesn't you can't yes that's why well as I said it's a which is why I say it's a legal gray area because again I'm not a lawyer I I yeah so yeah so I would say you know for most cases it probably you know if you're not sure about it I wouldn't put a you know cluster on the job maybe just like a small laptop will do okay I have a question I suppose they could look in the logs and say hey this IP address has been sending 50,000 requests a minute maybe we should look into that but yeah they can also tell by a phone browser or Chrome or Safari so you can so you don't trigger those anti-bot things yeah yeah yeah and like there might be a need like you know in the US it's pretty clear about the gray area and what you can do but you might be doing something where you're purposely trying to get something in another country that the human rights issue so it might be illegal in that country but it's a good thing to do so yeah you might want to sort of spurs around so they can't easily pick up on your screen okay so some good good discussion on the kind of the process of web scraping so does anybody else have any more questions that is a good question and I've been at this for a while and I don't really have a good answer at this point I know that one of my the city servers that I've that I depend on recently changed their format just a little bit and I had to reconfigure my logic code so yeah so yeah it's always worth double checking every so often