 Hello. Hello. We are back. So the last real lesson of the day is now web APIs. So Enrico, how do we start? Yes. So maybe it's good to start with understanding what do we mean with web API. I don't know if everybody knows what API means but most likely everybody knows what UI means, user interface. So for example, most let's say they want to search something with the browser, with the window, which is your user interface, you go to a website like google.com and that's the interface that you work with. You type something and you get an answer, the search results. So API, the interface now is application program interface. So it's like that the interface is not designed anymore for humans who can, you know, use the mouse and type and do things like that, but that we can basically use machines. So basically this kind of goes down to the concept of server. I don't think we need to go into the details what is a server, but in general we are all using these servers, meaning these softwares that when we ask for a page or when we ask for some information, whether it's with our smartphones or with our computers, we get an answer. So we are served with the answer. So what we're going to cover here is that sometimes in some cases it's very useful to interact with these sources of data that are not physically stored on the same machine where you are. Like the example that I gave that you might want to query something from Google, but instead of looking at the answer yourself, you can have Python look at the answer. And then of course there are other more advanced uses of the so-called APIs, meaning that there are companies who might give you access to a specific API to do something. I guess you all heard about OpenAI and ChatGPT. That's another example that, you know, it wouldn't be impossible to store this huge GPT model on your local machine. So you can, for example, via Python send the query, like the prompt that you would send to the remote computer and then get an answer back. The most simple way of doing this programmatically with Python is with this package called requests. So request is a Python library that makes requests to web servers, as I just gave you some example. You see here in the bottom there is this, with basically the web page for this request is embedded there. And they label it as HTTP for humans. Maybe I could ask Richard, what does HTTP mean, even though we type it almost every day? Yeah. So HTTP means Hypertext Transfer Protocol. And it's basically the way everything communicates with the web server. Like you send the request saying, I would like to see this web page, and it sends back saying, okay, here's the data you requested. And you can give other parameters like what languages my web browser is configured to show or things like that. Yeah. And for humans, means that request is trying to make it a way that's, well, that's easily easy to use by humans. So they really care about the user interface of this. Exactly. Which basically simplifies your work. As usual, maybe one of the biggest lesson here in this course is that often there's no need to reinvent any wheels. There are already very good packages that can do the heavy listing for you. So here in this retrieve data from API, the introduction text there is basically telling you that there is a list of many of these application program interfaces that can be used without a key. So this is something that if you will start working with external API, sometimes you need to have a key. For example, in the OpenAI chat GPT example that I was giving you earlier, if you want to query the remote model, you need to be, you need to basically show that it's you through a key, through a token. But in this case, we're going to use an API called cat fact, which is basically it's going to return some random fact about the cats. And so maybe enough with the token, should we just get started with some Python? Okay, I'm arranging things. So I guess I will make a new notebook in my place and rename it to okay. So how should I start? Yeah, so if you scroll a bit down and you can also do the same for those who are watching, there's a section a little bit more down in the materials. Well, we basically start with the import for this, for this library. And then the code will also have this URL. So URL, it means uniform resource locator. It's basically again, what we normally type daily, it's the address of this, in this case of this remote server API. And then we can use this library, this request library by getting this URL. So what happens? So now there should be something inside this response. So it's a response object. So with objects, you know that with dot and name of the name of the field, you can actually get some content. And here we see basically a string that starts with the B and loads of stuff. So the string looks like JSON, but it has a B in front. What does the B stand for, Richard? Oh, that's a thing I like talking about. So it looks like a string, but the B in front means Python is interpreting this as binary data. So basically, it's not assuming it has a text encoding. So in Python, all the things that look like strings are either Unicode strings, meaning they should contain text and can have any Unicode character, like each index is a character, or B, which means each position in this string is raw, like a raw byte from 0 to 255. But I think we probably shouldn't go too much into that. It's something you can get deep into later. But then the rest of the whatever comes after the B looks like JSON. Did we introduce before what do we mean with JSON, the JSON format? I don't know, did we? Maybe I should say again anyway. Yeah. So JSON is a way to turn objects into strings. So this looks like it would be a Python dictionary, but actually it's all encoded in a string. And basically, JSON is a way to do that encoding in a way that's standardized and shareable among different computers. And many, many APIs that basically will, will return strings, they will actually give you the answer in JSON. And this is why this response library also has a .json function. So what if we tried the, exactly, that part of the... So now basically is this like a dictionary, basically? Let's see. Yes. It is a dictionary. So then we could now easily access, for example, using the keys, the different elements. All right. Instead, this is what you did earlier. Now it's listed as bytes, because it has to be the... Yeah. So now basically you are putting this dictionary into a variable corresponds underscore JSON, and then you're getting the key fact. Yeah. All right. So far, so good. So basically that we can basically get some data and get is one of these terms using this HTTP protocol. But sometimes we also want to maybe send or at least tell the remote server that maybe we don't want to just get the page, like in this specific case, but we want to filter according to what the remote API is allowing us. So often you might, again, have seen this when you browse the internet, you might see a URL and then a question mark followed by various values or like key equal values. So these are basically, these are called parameters for the get query. And in this specific case, for example, we use a remote API that is called universities.hypo labs.com, which is basically providing a list of all the universities in the planet. But you see here that now this URL has a parameter as a question mark country equal Finland. So if we run this code, and we get the request from this from this URL, will it most likely give us basically a list of the universities in Finland? And as you can see already, indeed, it does. In this case, why did you put like, can we remind people what is this column two? It's giving the first two elements of the array. So if I remove it, then it will be a long response. Exactly. And we will see all the list. But then this is the thing. So now we were passing this parameter through this question mark followed by key, name of the parameter and the value. But this library, the response library is actually, with the response library is also possible to pass the parameters via the Python request. So should we try this bit of code, Richard, where now, instead of, so the URL now is different, it's just a general URL for the API. But now we pass the parameter country equal to Finland basically. So I've just copied the whole thing. I guess the key point here is this parameters equals parameters, which is a dictionary. Exactly. And now it will give the same results, but you can try, for example, with your own country and put Sweden, you can put Norway, and then you can get programmatically, you can get the list of all the universities that are at least listed in this remote API. So this is basically the main idea behind this remote APIs. Maybe rather than talking, we could let people try exercise one. It's an exercise where, in this case, you don't have to, how can I say, think the solution, the solution is actually following after the exercise. So it's an opportunity for you to test another API. So even copy pasting is fine. And exploring the output and when it gives, we could assign 10 minutes. If these are too short for exercise one, you can also start with exercise two. And then later we will comment on what is going on with exercise two. Does this sound like a good plan, Richard? Yes. Yes. Very good. So let's have until 12.23. Start with exercise one. And if you have time, do exercise two. And then we will come back and do a recap and more stuff. Okay. Sounds good. See you later. Hello. It was a nice interesting question in our notes document. Are there any best practices or general guidelines? Well, when it comes to the coding, the examples that we gave you with this request library, it's a good starting point for working with web APIs. But then one should also consider maybe the ethical and sometimes even the legal aspect. Ethical, it can be in the sense that, you know, are you allowed to use this data? Is it ethical to script this data? Often we do this for research purposes. So we, you know, we can do these things. We don't break any law. But then there's also the ethics of the fact that you are sending many queries to some remote computer that maybe you don't want to overload the remote server. I don't know, with 1000 queries every second, you know. So then often if you, let's say that you need to scrape many, many web pages and we will soon show briefly a little example on that. Maybe you don't want to, you know, maybe you want to have a break between every page so that your Python code could scrape one page, wait for a few minutes, scrape another page and so on. So there's no comments related to the exercises. So most likely they were easy and understandable enough. Maybe regarding exercise two, Richard, exercise two talks about something that we didn't really cover before, the headers. And in practice, if I understand it correctly, it's like what we tell to the remote server like I'm on this computer with this operating system and this browser. But do we need to sometimes change these headers? Sometimes I've done it occasionally. So sometimes the headers can be used for, for example, passing authentication information. So the API knows who you are and that you have permission to access it. But oftentimes there's other things in request that can handle this for you. Or sometimes the headers would say things like what format you want the data in or things like that. And I can't remember an exact time, but I've probably done it once or twice. Exactly. In this, there is in the learning materials, there is an example where, or there are some example of these headers that, you know, according to what you need to pass to the server, you need to modify eventually. But often if your goal is, for example, to scrape materials from the internet, it's okay to use the default headers of the request API, of the request library. And when it comes to scraping exercise three, we don't have time to go through it. So please go, please do it if you have time. So the code that we see here is basically an example where we want to download the full HTML of a remote server. So now it does not an API in the sense that there's some software that, given the parameters that we pass, is going to return an answer like we had earlier, Jason. But now what is, what is return is basically the so-called HTML code, which is this structure, which is the language of the web pages that you see. So in practice, with the example that you see here, you request the HTML of some page, in this case, example.com. And then there's another library that is used, if you scroll a little bit more down after this HTML, there is this beautiful soup, which is one package that allows you to basically parse the HTML so that, again, you don't need to invent new wheels to extract the information that you need. Here, basically, it creates a structure of all the tags that are contained in this HTML file. And, for example, one can look at the title of the page or look for all the links. The links are marked with this tag A. And so this is why there you see find all A there. So that will search for all the links. In this case, there was just one link in this web page. But it's time to wrap up. Maybe there's also something else interesting in this web page, which is saving the data. The example there, it's a useful example. There's nothing too difficult there. But it's a nice example of doing multiple queries, one after the other, like I mentioned earlier. And you see at the very last slide, there is this slip, if you scroll up a bit. Yeah. So this is exactly the thing. So we put one second between the multiple queries. In this case, there are three queries to this cut fact API. And then what's happening in this code that this JSON replies from the APIs, their store, they are appended to the same text file. And this is a useful way to deal with kind of to basically building a data set. So at the end, you will have a big JSON.pxt file that everywhere, every line is a JSON string. And then most likely from this type of data from this format, maybe you want to move it into a pandas data frame to basically continue your your analysis. I think we're done. Richard, do you have any comments on this data storage from API? Yes, I would say so how this data storage relates to what we talked about yesterday, or two days ago. So you might have different ways of storing data for different purposes. So for example, when it's downloading, you want a file format where you can easily append access data to, and it will get broken, like it can't really get corrupted. So for that downloading storing in a giant text file with one line per thing, that's easy and relatively foolproof. But when you're analyzing, then you might convert this to some other format, which is more efficient for your other analysis. There's a comment there, like, what a good way to be to store it in SQL server of some sort. I mean, it could be done. If you have that set up, many of our researchers have access to a big file system, but not an SQL server. But another thing like saving it to disk first, just so you have it, because the file system is more reliable than SQL server for, well, not getting corrupted. Yeah. I guess SQL would be a useful solution if these files, if one is planning to store one million JSON files, maybe that's not the best. Yeah. I mean, it really depends on the size. So even if you're scraping whole internet-level things, you have to really think about this and try to do it really well. And that's not easy. Yeah.