 Let's do this again. So we have nobody with us. Am I pronouncing your name right? Yeah. Yeah. All right. Awesome. And she's going to be talking about leveraging linked data using Python and Spark SQL. And a little bit about her. She's a data scientist with ACI worldwide. She's also the education co-lead for women in AI. Yeah. Ireland and she's a Python and data science instructor as well. So without further ado, Navanatha, maybe you can share your screen? Yes. Can you see my screen? Yes, we can. All right. So can I start? Yeah, of course. There you go. Okay. Okay. Thank you so much for that introduction. So today I'll be talking about leveraging linked data using Python and Spark SQL. This is something that I came across when I was studying my master's in Trinity College Dublin here three years back. And I thought that this is really handy. I really love NLP applications and Spark SQL and data from Wikipedia. We all know that it's very helpful for us to build models and get text data from online resources. So this is a small presentation on how you can get that data from Wikipedia. So the first thing that comes to mind is why Wikipedia. It's basically because it's not just it's got loads of data but the kind of the nature of data that we have in Wikipedia. It's cross-domain. It's got different data. It's multilingual. It contains data from more than 300 languages. It's also freely accessible. You always Google data. If you need information about anything, your first step would be to go to Wikipedia and search for that. And so it's also automatically evolving because the data in Wikipedia is crowdsourced. So all these things make Wikipedia a really rich, informative database that we can leverage for NLP applications. There are two ways of getting data from Wikipedia. One would be a traditional web scraping and the other one would be using link data. So I'll have a demo on how to use the traditional web scraping using Wikipedia Python library. So this Python library is basically a wrapper around Wikipedia which lets you not code everything from scratch for scraping using the HTML files. It just lets you load a page into a Wikipedia page object and then surf its contents and the URLs present in that link, the image links present over there using that particular library. And we'll look a little bit into what link data is and how we can leverage it using Sparkly wrapper which is again a Python library which lets you query the open data from Wikipedia. So I'll share my notebook here just to demo the traditional web scraping part. So you will need the Python library Wikipedia in order to be able to use it. I'm using Google collab here so beautiful soup is already available over there. So installations and imports and then the first thing that I do here is to search, use the search functionality here on the Wikipedia Python library. So like anything you search on Google Chrome you can search here as well. And if it finds that entity over there in Wikipedia it's going to give you a list of pages that is related to it. So this list all of these pages that contain the name are the headings that contain Python. So that's why I got this response. From that list I got the I just for an example took the first one which is a Python programming language that Euro Python is all about. And if you can see basically calling this function here page and I'm trying to load that particular pages contents as an object. So the type of this object would be a Wikipedia page object and it also contains different properties like the contents. This content is basically all the all the language texts that's available and that is the most important part for us to build corpus for national language processing stuff. And then you can also get the URLs. Sorry you can also get so the total length would be 39,061 characters for one page so you can understand how rich that data could be for you and this it's a bit more you know collapsed data over there. This is a sample of the contents of the structure how the page comes up from Wikipedia. So different contents are separated by these headings the subheadings basically and so when you're processing this data you might want to keep it or remove it based on you can use regex in order to filter these things. You can access the URL of the page you can access the different categories in this page you can access the images. How you can use this information is either you can just get a search for different topics and get all the related pages and store that data you can also go through all the different links that are related to this page for example you can query all these data all these list of pages and there will be also another object for external links so another property for external links and you can loop through those external those links for Wikipedia pages and get those related pages as well for creating specialized corpora so page related to the current page you can get all those data and store it again and for computer vision applications as well you can scrape pictures from this data and so the main idea would be here you will get annotated data for computer vision applications as well for natural language processing applications you can also get annotated data or you can create some logic to create those automations automatically and we'll see that how you can do that using OpenWeb and the other way of doing this if you need so the only caveat here only problem here is that the Wikipedia library do not give you back the table but there is a really simple way of obtaining the tables from a Wikipedia page so you can just load the objects so it also sends it also gives you the HTML page layout from Wikipedia of that Wikipedia page and you can basically parse it using beautiful soup so this is a long text just printed it out for showing and then you can use beautiful soups find all tables of the class for the table for Wikipedia would be wiki tables so you can query that from beautiful soup and then this particular page contains only one table but you can get all the list of tables and you can parse them like so using pandas read HTML and you just have to convert tables that part to a string because it would still be a soup object and that's a that's a brief presentation on how you can how you can scrape data from Wikipedia and I'll just go back to my presentation so now that we've seen how to scrape data from scrape the contents from dbpedia we'll look at how to scrape how to get the data from link data so the concept of link data like you can see is about different things or entities in you know tangible non-tangible in the universe to be able to be tracked and related to the other entities that they're that they're related to so it creates like a mesh work of things like you can see this is one of the most used diagrams of link data and it's actually how this how it looks like so we'll look at we'll take a closer look at what these are but we'll look at first what dbpedia and wiki data is so the information that we have on wikipedia they're all some sort of entities and those and and the entities are linked to other entities based on a relationship again and this this relationship this this huge crowdsourced data is also extracted and is extracted to get all the structured information from there and this information is made available on the web in the form of link data so dbpedia was ideated before wiki data it came before and it started looking getting the semantic information in a structured way from wikipedia the most recent one not most recent it's I think it's there for three four years now so wiki data is is also a free and open source knowledge base but the only difference here would be from from dbpedia you'll get data from wikipedia and from wiki data gives you gets all the data from different other platforms of wiki where they have more information on different other things like wiki books and and so on so you can leverage both of these their endpoints would be different by endpoint I mean from where you could query the data and to take a closer look at how open data looks like or how it's built so I presented this once with the Pilates Dublin chapter here in Ireland so I took Pilates I assume Pilates as an entity on the internet and you will find Pilates as a dbpedia resource on the internet so Pilates is an entity for us and Pilates is situated in three different locations New York Dublin Dublin and London so this is not an exhaustive list so and the relationship of Pilates with New York Dublin London would be that they are they they are related by the location of Pilates so this is the most simplest way how this could look like now Pilates is also an organization so it is a type of organization so the type relationship comes in over there and similarly like Pilates we have women in AI we have women in computing we have women who code so all these three are again entities and they're also types of organization so this is how these entities and Pilates they're related to each other now women in AI is also very active in Dublin so I can safely draw a line there and have a location with have a relationship called location with Dublin ideally all these three would have would be connected to New York Dublin and London these three locations will also be connected to women in AI women in computing and wwc and that that is going to basically if I join all the dots it's going to look like a real bad mesh work so it just wanted to keep it simple here and then it's not always that all the relationships are going to hold an entity so it could also be a literal or just a simple string value so for example Pilates URL you'll find the website link for Pilates as well so it's basically going to be a string and then most Wikipedia objects will have an external link relationship with another entity which is related to it but it doesn't have maybe any specific relationship with the with the other entity so and and these links are both ways because you'll find Python programming language in Pilates page and you'll find Pilates page linked to Python programming language so it's the arrow is both ways so this is basically a closer look of how open web how that a little fraction of that huge mesh work could look like and then the question naturally comes that why why link data why we dbpd or wiki data that's because it contains huge knowledge that's given but then with dbpd or wiki data the way the data is stored it actually retains the semantics of information so it's rich for nlp applications where you need the semantics so you need to understand what's the data in it how it looks like for example if you've got for example Pilates type organization and Pilates location Dublin can naturally be inferred as Pilates is an organization which is located or a chapter is present in Dublin so the semantics of that information is present over there it's also easily and freely accessible and for us it's good because we have Python wrapper available around it and it's quite fast and then yeah it's for data mining applications for machine learning and nlp and also for cv like we saw that and how do we query this data so we use Sparkill queries to query this data Sparkill is basically semantic query language to retrieve and manipulate data stored in rdf format so this the huge mesh work that you saw earlier is is stored in rdf format it's a very simple triple data representation format on the left hand side if you can see I have resource description framework which is a standard model for data interchange on the web and this triple is represented as the subject predicate an object so if python is a type of programming language so python is the subject there the relationship of python and programming language is type so that's the predicate and then object would be the programming language so similarly pi ladies would be the subject if I go back to the previous um um diagram so pi ladies um would be an would be a subject and predicate would be type and object would be um that it's a type of organization and then pi ladies would also have predicate or called location which would be basically the relation and the object would be doublet so that's how the the resource description framework works and so this is a comparison of how a wikipedia page and how a resource page could look like and how they're related to each other so how you can um browse them on on your internet so on the left you've got piloties and you've got the um the introduction of the summary here for piloties and on the right hand side you've got this dbo abstract which is also property um of of of piloties so this is basically a dbpdia resource so this abstract here um you have um the summary of piloties and that's exactly what you have in there it's the same here um and then if you see here so there's dbo dbo so dbo means dbpdia orgs resources so mostly um the the entities about dbp which are there in wikipedia they're all um uh modeled in a way uh to have these particular properties all throughout consistently for most of these pages or similar pages so dbo is basically a vocabulary um in object oriented programming terms you could say that um an object will have some properties associated with it so similarly and these um these different vocabaries aggregates several specific types of properties uh related to the kinds of things they're addressing so for example if you've got foaf the third point over there it's called friend of friend so it's it's it holds information about persons people so it'll have um like fof colon um first name fof colon last name will have gender it'll have height um and anything specific to human beings or characteristics of human beings um then they have so basically on the on the left hand side what you're seeing is a website or a web page that that contains an url and on the right hand side you're seeing a dbpdia resource or an open data resource and that's basically an uri which is universal resource identifier so that's not essentially a web page so it's basically a resource but data's instance over there and if you see the wiki page wiki link would be um is an is an um property which contains all the different um links that are present on this page and you can see los angeles is here um it's also mapped here and the mentorship is here so you can find uh these uh related links as well and the last thing to notice uh here is that you've got two versions of the abstracts you've got one spanish and one english version so uh it's helpful for nlb multilingual um nlb applications or any application which is um which is not in english basically so how to construct a sparkle query so this is the most uh basic form of sparkle query where you're querying basically select star in sql languages basically um selecting star uh from star so i'm selecting all the subject predicate an object where at least one subject um one associated predicate and one associated object is present every variable in sparkle query is associated with a question mark in the beginning so over here everything is star so everything is a variable right now and all the subjects will be stored in the question mark subject as a variable all the relationships will be stored in question mark predicate as a variable and uh question mark objects all the objects religious objects will be stored in the question mark object variable a as a result of the squaring and then just to um specify this a little bit more so if i want to just get all the entities or all the subjects and the objects which has a relationship called label then i could use this uh query where where i'm selecting the entity here as a variable um the subject and select the object there or the value as name and then that should have a relationship rdfs label which is basically um the string which is the heading of the page uh mostly so this looks pretty simple um and this is a little bit um just a bit more advanced version of a sparkle query that i did for uh to get uh information on all the athletes uh that are there on wikipedia their birthdays their heights their names um the abstract information that is a summary of what they do and who they are so and in and above that i have the prefix so this is so consider this as importing the objects or importing the vocabulary and here i'm saying that um if this entity is of a type athlete get me their information and if i have a if i have a semicolon here that means the next relationship is also related to this particular um athlete or this particular entity so for all for all those entities who is who is the type athlete get me their birth date get me the height so i am again using a semicolon here and then using dbo height so it's saying if it has a relationship height then get me the height as well the name then abstract and then it'll you can also extend it to us having optional so if i hadn't used optional for uh dbo country it would basically give me uh only those athletes who has a country associated with their um with their resource and it would um it would not include those athletes who do not have a country associated with them so that's why i have that as optional then you um and then i could use filter in order to filter the languages and i can use limited offset again in a similar way how you could use them in sql so this is a little example here on how you can construct a sparkle query to get it to get data so you can get the data in tabular format in jason format in xml format so that's another advantage of using um sparkle queries uh you can you can obtain the data in different other formats um and uh you could use it it's just the way you would like it so now i'll go for a little demo on how to use sparkle wrapper so so the first thing that we need is the sparkle wrapper um python library um install it um and then import the relevant modules import pandas as well now this is the end point from where you can query this data you can actually pilot your queries here you can check you can execute your queries here and see what response you're getting now if you're directly using this query over there you might have to alter it a little bit based on the python string because i'm using python strings here so i might be using some escape characters here so you need to take care of that but you can directly execute um queries uh sparkle queries here and check the response and then plug it in in your code so the first thing i i need to initiate fair i want to um execute this the end point the next thing i need to do is get the set the query um set i need to set the return format so i'm using json it's easier for me to use json here and then the next line would be to execute the query and then use pandas json normalize um and get the results into a data frame so that i can use that well um you can if you can see you there's um 18 total 18 results for python and they're also of different languages so if you want you can use the filter keyword in the query or you can use the filter filtering capabilities of pandas and then i actually put everything all the execution thing into a nice little function and i um and i tested other queries here so here i'm getting all the python resources page disambiquities so because python could mean anything so there's got the the list of uh wikipage links that you got earlier from the wikipedia library as well so you'll find all of them listed here that python could mean different things it could be the snake python it could be the programming language it could be anything else so that list you could obtain from here and you can get so i selected the python programming language here that that file so you see i'm using escape characters here for the brackets that will not work in the in the endpoint here so i'm just taking the abstracts here and you see there are 420 results out of this and that's basically because of the multiple number of languages the output the response has multiple languages here so end of it i basically filtered using language abstract so filter the language of abstract in for um english and then language for label as english and then i got just one result over here which is where the label value is also english and with the abstract um that is a summary is also english so that would be just one by six data frame so that's how that's how you can query open web it is very simple and very structured so now that we've seen how to use wikipedia and how to use spark kill wrapper we'll see how to connect these dots and build a corpora and build a word to wik model using the the localized data or the specialized data so i imported all the necessary python libraries here and i'm using the same function that i created initially in the previous notebook and initiated the spark kill endpoint and here what i'm trying to do is to get hold of the python programming language page so like i said the label of of a page the rdfs label it's going to give you the heading or the name of a page so if you if you check this data here the dbpdp is c++ so that is going to be wikipedia so this is basically going to be the same uh this is the other characters here um but if you query it here it's going to give you the same things so basically the label is the rdfs label is the name that you would search for and load as a wikipedia page so here again i queried that i got the list here so i extracted all the list here wikipage list would contain all the list of the pages that i got over here and then i am basically uh loading using a for loop um all the content for all each of these pages in the list and once i got this so if you visualize if you try to see so i have 223 page objects and i have just to visualize how it looks like so the first page i've got tons of data here and i basically have the data for all programming language so i could i could say that this is a corpora that contains information about different programming languages so what i did is very uh my form of uh text processing i i just kept all the alphabets and the digits and then i converted them to the lower to lower case and then used the ellen case for tokenize function tokenize data and use a simple lemmatizer lemmatizing strategy to uh tokenize lemmatize it and then i ran it through the data and then i used jamsim's word to wik model to train it and you can see how um how uh my uh how the model works here so if i want to check what is the most similar word for programming so i have language computer implementations so they are these words kind of occur together or have uh some sort of similarity the most similarity would have programming with language and then just visualize it so i uh compressed that into two components using principle component analysis and then used matplotlib to visualize this data so you can see um just for example maybe link and external because external link they're uh they've been occurring most of the times in uh in a wikipedia page so they all go together over here and then programming and language compiler so many of the programming languages you'd find around here so this is this is a little introduction to how um i could leverage wikipedia python library and how how i can use link data to create my own corpus um specialized corpus as well for nlb applications this is done and thank you that was my talk um i i'm okay with time awesome that was a that was a fantastic talk i really loved how you um i specifically loved the the whole word to wik bit and you know me being like an nlb nerd um i could i could already think about like so many things that that this is very exciting exactly yeah so again thank you so much for uh you know um explaining the entire workflow um i think we have two questions um so i'm gonna put this up on the screen the first one is what is the advantage of using rdfs and sparkle to store and query triples instead of using a dedicated sql table for each of the possible relations predicates um why would you so if you have to store them in a sql table like if you're doing it for your organization if you've got an already set up database why would you want to invest uh time and energy in extracting the semantic information because that's that's actually a work of um thousands of researchers who actually gets these semantics information um and then puts them in proper structures so if you have to do and replicate that yourself and then create your own sql table um i think that's going to take some good amount of work and sql table is more like tabular format it's not mentor semantic information so there i i'd push back on that make sense um i hope that answers the question of whoever whoever asks a question um the next up is does spark uh sparkle and or wikipedia python library support json ld out of the box uh i think so i think so perfect um yeah yeah maybe maybe we can we can we can double check that and um like you know in the in the breakout rooms uh next but i think that's um uh those were the two questions that were there there was there was one question about um the link for your um notebook that you were sharing and i think someone found it um within the it's there okay yeah it's there if you if you download the slides from uh europa itin's uh website uh the notebook links are there perfect so i will i will i will post this link over to the to the chat room and um all of you feel free to go towards the the breakout room next and um you know have a detailed conversation you can set up a jitzy chat and just have um no doubt about spark will and wikipedia yeah all right thank you so much all right thank you so much bye bye