 We're going to talk about using online data sources for looking at social science questions. And Julianne myself will ask that you focus your attention because it's a very detailed presentation with lots of information in it. But it does follow nicely from the two earlier presentations that we had about using online data and using social media data. I think a lot of us, whoops, would like to use Twitter data and social media data. We can ask and answer, we think, lots of questions that are of interest to us as social scientists. We can look at if we're both Julianne and I are in politics, so if we're interested in candidate messages, candidate campaign messages, Twitter, we can look at what they're saying on Twitter. It's good for events data. It's good for or not so good for public opinion, for example, or looking at national conversations. One of the questions that has been asked using online and social media data has been about ideological polarization in the public. How polarized are opinions about Brexit? How polarized are opinions on an ideological left-right continuum? Using Twitter data, some have found that it's highly segregated, that the public is highly segregated, has a highly segregated partisan structure, and there's limited connectivity between these different segregated communities. But what happens when we look at other data sources, other online data sources? Twitter is about what's on Twitter, and we know, even just from the hands that were raised here, it's a fairly selective audience that's on Twitter, 10% by some estimates of people who actively use Twitter. So is that giving us a whole picture of the national conversation, or is it a very select element of the national conversation? And when we look at other sources of online data, what has been found is that ideological segregation is low in absolute terms. And this is from a paper by Genskjell and Shapiro back in 2011 that uses a combination of social media, it uses web browsing history, it uses analysis of news stories that are shared. And when you look at all of those sources, the picture of ideological segregation is quite a bit different. So what we're going to do in this talk is cover different sorts of online sources of data that you can use, how to use them, how to process them, and what I'm going to do next is to show you where we're going to end up in the talk, so you get a sense of where we are going in terms of what we're providing you for data processing and analysis. And then I'm not going to talk too much about this next slide, but just as a picture of where we are going. And this will show you what happens when we start comparing different sources of online information, what it tells us about sort of different topics and different sorts of networks, okay? That's where we're going to end up in an hour, well in about 56 minutes now, okay? So what we're going to do in the talk, what we'll cover, what we're going to do is focus on the question of what sorts of information people are exposed to using different sources of data and what then that information exposure looks like say on social media compared to traditional media, online traditional media, okay? So that's our example and what we'll carry throughout the talk. What we won't cover, but what was covered previously, which is very handy, is we're not ethical considerations, so that great that that was covered, we're not going to cover that in terms of social media data, and we're not going to cover anything in depth, okay? This is an overview, but you can go on to the blog and find more in-depth coverage of things again that we won't cover, which is about theory, so we're not going to cover social science theory, methodological theory, or statistical theory, so we're not going to cover that. This is again an overview about data and about the skills that you'll need to process and to analyze the data, okay? This is our version of the flowchart of where we're going, okay? Up the blue boxes are the sources of data, the green boxes are about processing those sources of data, again, source of data is URL, but again that's after processing it, and then the orange circles are about our results or the outputs from the analysis, okay? So that's our flowchart of where we are going, okay? So I'm going to start with survey and clickstream data, which is web browsing data, or web browsing history data, and how we collect that, and then I'll go over to Yulia to talk about it. So in terms of survey data and clickstream data, we're using, in the example that we talk about in terms of information exposure and what sorts of information people are exposed to in this new web environment or social media environment. What we have are online survey data that we've collaborated with ICM Unlimited to collect. So we have measures of people's attitudes and behaviors and also reports of what they're seeing or how they use social media and traditional media, and those were collected around Brexit, so the theme that we're looking at and the topic that we're looking at is Brexit, and we have about a thousand respondents in those survey data, okay, and asking them various questions about their opinions about Brexit, but also about their social media behavior. From those respondents, they have agreed to install an app on one of their devices that collects where they go online, okay? So it may collect where they're going on Twitter, but it also collects wherever else they're going online. So where they're clicking on news sources, whether they're clicking on shopping sources or entertainment sources, so it's collecting their whole web browsing history, and you'll all be familiar with web browsing histories. You have your own web browsing history on your own browser, it allows you to go click forward and click back, but also that web history browsing data is also collected and may be used by your internet service provider as well. So we have about 900 people have agreed to have their web browsing history data collected, and that's about two weeks of data for each wave of the survey. Now, one of the things that we're concerned about as social scientists is whether or not these data are representative. Representative in terms of the population, so are our respondents, our individual respondents representative of the British population, but also in a sense, are we capturing what is representative of their online information exposure? Are we getting a representative sample of that, or is it selected in some way? You heard talk about how the Twitter data can be selected and can be censored, that sample. So we're not sure, and in a lot of cases, we're not sure how those data are censored. But in terms of our sample of respondents, we do because it is an online survey. We know that we have a younger sample on average, and we can estimate compared to the British election study, which is a face-to-face in-person survey and is considered sort of a gold standard of election research. So it's younger than that sample, which is a national probability sample, unless we have fewer retired respondents. But it's similar in terms of gender and in terms of regions. So in some aspects, it sort of reflects a representative sample and not. Okay, so that's again a concern that we have as social scientists. Are we looking at a representative sample? So that's two sources of data there. We've got their clickstream browsing history data, and then we've got survey responses, okay. Now, where do we purchase those? Or we worked with ICM to get those web browsing histories. But if you as a researcher are interested in how you get those for yourself at low cost, there are some sources for them. And one is there's a Bing extension that you can put on the browser, which allows you to collect, if you get people to agree to install that extension. It will allow you to then access their web browsing histories. And then there are other apps as well, which are mentioned such as Coscom. And those are detailed and how you use them are listed in papers that we've put up on our web blog for this particular course. But there are ways in which you can collect web browsing histories if you get the consent of respondents, okay. So again, those are two sources of data. And now, those data, how do you clean them? What do they look like? And now I'll give that to Yulia. I have to apologize for my voice, I have a bit of a cold. Okay, so now the boring part. So the idea here is that we want to show you an example of how you can collect these three types of data, the surveys, the click stream data, and the social media data. And then link them together and then do some analysis that tells you more than you would have known by looking at these types of data separately. So this is where we're going with that. And after collecting the survey and the click stream data, we have to clean it. And this is now the trivial process. This is actually what the click stream data looks like. It's a series of links. But these are not links that people have actually clicked on. These are all the links that are loading in the background and that are being sent to the server. So among these links, you see lots of things such as ads, links to photos, links to videos, all sorts of widgets, Twitter and Facebook widgets, and so on. So these are all in there. What we care about, however, is not that information. We care about the article itself. So this is what the page of an article looks like. Over here you have a map. The links that are loading together with a map are in blue over there. Over there are some other links to other articles on the website. And you can see them in purple over there. This is a widget over there that shows down there. And so what we care about is the information in the article itself and not all these other links. So we care about that red one over there, that link. The way that link shows up in our data is like this. In the middle of other sorts of links, many of these are on the same domain. So they are still on the BBC web page. Many of these are extremely similar in the way they look to your naked eye. However, the article, the actual text of the article is hiding on that page that has that particular link. So how do we get from all those links that are being sent to the server to the actual news pages that we care about, that people are reading and that they extract information from? So in order to do that, we first compiled a list of UK news domains. And we did that by using Amazon Alexa top sites. Amazon Alexa top sites is a service by Amazon that ranks websites that are very popular in all the countries in the world. It gives you 500 websites that are popular among users in a certain country. And the advantage, the major advantage that it has over other providers is that it also does that by topic. So we could restrict the popular websites in the UK to those that were focusing on news. So we only got 500 news websites. And out of those 500 news websites, we then did a manual cleaning. We read through them to see if they met our requirements. And then we restricted to 416 news domains. So we are working with these 416 news domains. Now on these news domains, we need to identify articles where articles situated in those news domains. And to do that, we take advantage of the fact that most newspapers that are online or other sources of news online have a very clear database structure behind what you see on your screen. So they hold articles in the same location. And you can identify the position of those articles in that exact location by looking at URLs and trying to figure out where they are in a database. And so then we can write regular expressions, which are expressions that we write for extracting the information, the exact pattern that we see in the URL. So for example, for the BBC website, we write a regular expression like this. The link, if you remember, looked like this. So it's bbc.co.uk slash news slash UK politics and then a number. So the general structure on the BBC website is going to be bbc.co.uk slash news slash. And then the title of the article followed by an eight-digit number. And so this is what we are capturing with a regular expression. So we can then restrict the data that we observe in the clickstream data to links that follow that pattern. And so in the end, we end up with 26,000 unique news URLs in our clickstream data. So that's the clickstream data. The other major data source that we are using is Twitter data. And you've already heard a presentation about that. I'm not going to go into any details. We do have lots of materials on our website or on the GitHub page on that as well. I'm just going to provide an overview of that. So I'm going to talk about collecting tweets and processing tweets, how we did it in this particular project. So collecting Twitter data, as you've heard before, the method that you choose to do it depends on your programming skills or your willingness to develop new programming skills, the characteristics of the data that you want to collect and also your budget. You can use the Twitter APIs, and most researchers that are working intensively with Twitter data choose that method. It has the advantage that it's very flexible. You still have to follow the constraints imposed by the rate limits that Twitter imposes you. So you cannot get all the data that you want in two seconds, but it is very flexible. So you can get a lot of data. The disadvantage is that it does require some programming skills. However, there are many available packages in Python, such as Tweetpy, or in R, such as TweetR, that can help you in that. We also provide some free scripts and tutorials on our GitHub page and on our website that go over how you collect Twitter data in Python in this situation. You can also use some commercial or free software. The advantage of that is that it's easy to access, easy to download the data. However, it's less flexible, and you may not get all the information that you need or in the format that you want to have it for the analysis. And of course, depending on your budget, you can always purchase data from Twitter, which now holds Gnip, which is the company that was selling tweets until Twitter decided to cut the pipeline and purchase them. So that's convenient, but a bit expensive. Now, let's say you've collected the data. You have your Twitter data. What do you end up with? If you use a commercially available package, you may end up with a nice format, such as SCSV. However, if you're working on an intensive project and the available tours are not good enough for you and you want to collect your own data, like we did, then you end up with a JSON file, which is a way to structure data as dictionaries. You structure hierarchical data here. That is very efficient. Very efficient for holding not only tweets, but also text data. So if you want to work with text data, most likely it is that at some point you will end up using this format for storing your files. Now, the next question is where do you store all these data? It is quite big. We had 74 million tweets in our Brexit analysis. So it is quite big. You need a way to store it and access it efficiently. We use MongoDB. You can also use relational databases, but non-relational databases such as MongoDB have the advantage of being able to handle this type of data file much better. And then you have to extract the information. So you have this file over here. It looks like this. If you read through it, it has the structure of a dictionary. So there are fields here that are the keys, and they tell you when the tweet was created that. And then it gives you the date. And then there's a field called entities, and then it gives you information about URLs and so on. You can extract the text. Of course, you can extract user information, all sorts of user information. You can extract lists of friends and followers. And of course, the thing that we are interested in at this stage is the links that were shared. And in our data, we have 1.6 million tweets that link to news domains. So only to those 460 news domains that we've identified with Alexa. Okay, so we have URLs. We have URLs from clickstream data, and we have URLs from tweets. Two major sources of data, clickstream and Twitter. What is the next step that we need to do? Well, the next step is not directly getting the data from those pages that we know are articles, for example, but it's actually resolving the URLs. So what do I mean by that? For example, you have a link over here that says good.gl and then some numbers and letters. And then you have another one over here that is a nice one which says bbc.co.uk news, politics, UK lives the EU. Those two links point to the same page. It's the same thing. That first version over there is a shortened version of this. And you get the same thing when you write a tweet by default, sometimes it shortens them. The Twitter that you can extract gives you a field that is called expanded URL. However, that field sometimes still holds tweets that are shortened either by individual users or by other ways of shortening them. Also, you can get links that have this structure over here, but then at the end, there are all sorts of other indicators that show, oh, this link is coming from Twitter, or this was shared on Facebook. And so you want to make sure that all those URLs are exactly the same. Why is that important? Because we are working with very large numbers of links. We want to extract the content out of them, and it's very intensive in terms of power, but also the cost to extract the content from one million. If you know, for example, that 30% of them are going to be duplicates. So you want to make sure that all of your links are reduced to one main form. That is the thing that you see on top of your browser when you go to that page. So how do we get that? We use Python programs that handle that, such as requests, URL live and URL clean. And we wrote a script that we've also made available on our website that shows you exactly how you do that. OK, so you've resolved the URLs. The next step is extracting the text, the title, the author from the article pages. Now, you can do that two ways. You can do it manually, as in you write your own script that scrapes those pages and extracts that information. And there are tools for doing that. There's Python Scrappy, which we use a lot, BeautifulSoup or Selenium. Or there are tools that are already written by someone, and which, believe it or not, when you have 460 domains, all of them, so websites, with a different structure, it makes more sense to use a tool that has already been written by someone. And it's been tested. And you know for sure that it works very well. And not write your own program that does that. And so we went for that. So we took the easy way out on this one. And we used the Divbot, which is free for your first 10,000 URLs. OK, so we extracted the content out of those articles, out of those news that were either shared on Twitter or that people clicked on and they read on their machines. The next thing that we need to do is, well, turns all these links to text and turn that text into numbers. So we have to process the text. We use standard text cleaning and natural language processing methods to do that. We turn everything to lower case, remove punctuation and stop words. We stem the tweets. We tokenize them, which is dividing them into tokens. We do some part of speech tagging. We extract n grams. Then what can you do with this process text? You can count keyword in it, which is an easy way of doing it. You can also measure distances and similarities. So for example, you want to know the articles that people read on Brexit, on the Telegraph website. How similar are they to the articles that they read on the Daily Mail website, versus how similar are they to the articles that they read on the Guardian website? So you can do that. You can also do the thing that we are interested in right now in this presentation, which is topic extraction. So we want to know the topics in these texts. And to extract topics, we use topic models. And in this specific case, we use latent direct allocation, or LDA, which is the most famous topic model that everyone is using. And the aim is to uncover hidden thematic structures in our documents. So we want to know what are the topics, whether these are hidden. We don't know what they are. We want to uncover them. So the way LDA works is that it assumes that documents are a mixture of topics. And then it also assumes that topics generate words based on their probability distributions. And then there's an algorithm that determines the number of words in a document, first of all, and then the mixture of topics in the document. And then based on the topic's multinomial distribution, it assigns words to documents. OK, so there are multiple ways of doing topic modeling. You can use mallet, which is a self-standing program for doing that. We use Python, GenSIM, which works pretty well. There is also RQuantita that does topic modeling very well. OK, so we had the URLs. There were two things that we were interested in with those URLs. First, the content behind the URLs. What is the actual text? What are we pulling out? What are the topics? But then we also want to know, what is the network behind those URLs? Can we think of it in terms of a network of domains or network of articles or network of users? And the question is, why would we like to model this as a network? Not just because it's very popular, but also because we believe that there are interdependencies between observations. And we want to model those interdependencies, which we cannot do with standard statistical approaches. We care about patterns of information exposure. And also, we want to see if we are able to detect communities and echo chambers. So this is why we go for network analysis. Now, what is a network? A network, you have an example over here, is a mathematical construct of nodes and edges. Now, you can have information both about the nodes and about the edges. In our particular case, and the example that we are working on right now is the one in which the nodes are domains. So those 460 domains that we extracted initially. And the links between the domains are formed if a user, either in the clickstream data or in the Twitter data, and we will be holding this separately for the analysis, if a user reads articles on both of these domains. So we have another version of this that looks at the actual articles themselves. So we are looking at whether the nodes in that situation are articles, and then the links between articles are, again, formed if the same user has read the same article on those articles. So now data linking. We have all these three data sources. Most of the time in the social sciences, when you think about data linking, you think of data linking at the individual level, at the user level, which we also think about that in this way as well. So for example, in the survey and the clickstream data, we link data at the user level. We are looking at individuals that answered our survey, but also at individuals that were in our clickstream panel. So that's the individual level. However, you can also think about linking data at different levels, such as the domain level or the article level. If what you are interested in is the information that is generated on the domain or the information that the article is giving, and then how that is linked among different sources or among different networks. So in the clickstream and the social media data, we link them at the domain and the article level. But then along all three sources, survey, clickstream, and social media data, we can also link at domain level and article level. OK, so I'm going to hand it over from Susan. What I'll do is, after all that processing of the data, is to show you some of the results that allow us to compare across the different sources of data, the different platforms, and the different ways that people are exposed to information. So we'll first look at the first way of linking the data, which is at the individual level, which is comparing the survey to the clickstream data in terms of URLs. Then we'll look at the topics, comparing topics on Twitter versus in the clickstream data, and then looking at the networks of domains and the networks that are formed on Twitter versus the networks that are formed and that we can see developing or develop in the clickstream data. So first, we'll look at then where we get and what sorts of analyses we're able to do by looking at the survey data linked to the URLs, the extracted URLs. And there's different ways in which we can process these URLs. We can look at, yes, the domains of those URLs and count up how many different domains there are. We can also process those URLs in terms of identifying the URLs in terms of being news sources that are on the left or on the right. So that's some additional processing of the data that we can do. We can also, within the long URLs that have the titles of the news stories, we can do keyword searches. So what we can do is, say, for example, tell how many clicks an individual clicked on a particular news source that is in our survey and how many times they clicked on, for example, the BBC. And we can say, oh, we can divide those survey respondents then up by various sociodemographic characteristics or by their preferences, say, on Brexit. So that's the first bit of analysis. And this is what it looks like. We actually did my analysis, not in Python or in R, but in Stata. And this is just a chunk of what the data looks like that is in Stata. What it has is the username, so the survey respondent ID. It has the time on which they clicked on the URL. It has the domain of the URL, and then it has the long URL. So that's simply the data that we were looking at in terms of their clickstream use. And then we can analyze it or aggregate it in various ways. So again, what I did from those various URLs was create various quantities of interest. And the first bit was just to look at what was the most clicked on site in all of the browsing history data that we had around Brexit. And it is the BBC. So that was the most clicked on site for news and information during the Brexit campaign. And this confirms what has been in the few articles that look at this, sort of social media versus online use of information has confirmed. And these are mostly direct clicks. So people are not getting this data from social media platforms. It's not being shared to them. It's not being created or generated from their networks of friends. They're clicking directly on these links. Now, so these are the top, I think there's 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, top 10 domains that were clicked on during the Brexit campaign. Now this is, I think, I left this in as a reminder that we have to be very careful when processing these data to ask ourselves the same questions that we always ask when we look at any source of data. Does do these data make sense? Is there some sort of problem with the way that we're measuring the data? And there was a problem that we noticed in the data. Why would the Manchester Evening News be the second most frequently clicked on news source? Is there some sort of problem with the representativeness of our sample? Do we have lots of people from Manchester in our sample? It wasn't that. It happened to be that we had two or three, no, sorry. It was one user who clicked on or it was showing up in the data as having clicked on the Manchester Evening News for about 3,000 times. So that, again, there was a problem in the data. You have to be careful. Does it make sense? And I think that's something that we always do as good social scientists when we're looking through the data. Does it make sense? That initially didn't make sense, but once you clean the data, then it makes sense. So another thing that we looked at was, for me, it was important to somebody who looks and is interested in media exposure and are we measuring media exposure properly, our self-reports of news exposure consistent with what we're seeing in actual behavior. And linking the clickstream data or the browsing history data to self-reports of news exposure allows us to see whether or not we're capturing the right thing when we're asking people how often did you look at news about the EU in the week before, in the last week? So we asked that in the survey. So we asked a question about days that you spent last week looking at news online about the EU. We also asked them, did you look on social media, et cetera? But then we also have then the number of stories that they clicked on that had EU or some keyword about the EU in the title. And what we see is that, yes, this is a validation that in some ways our data are consistent and are representing what we think that they're representing. So amongst those people who said that they looked at news online about the EU every day last week, they clicked on more stories about the EU than those who said most days, or those who said most days clicked on more stories than those who said once or twice and then even more than those who said not at all. And then you wonder, those people who said not at all still were exposed or still clicked on some stories about the EU. Okay, so that's a comparison of reports to their actual behavior online. And then finally, this is looking at a coding, a processing of the URLs by whether they are from left or right-leaning news sources. And then by broadcasting, or by the green is by BBC or broadcast URLs. The blue is our conservative-leaning papers, URLs from conservative-leaning papers, and red are URLs from labor-leaning papers. We can talk about how that was done after in the Q&A if you want. But what you see amongst those, again, remember, we have their preferences for leave or remain. So we see amongst those who said they were in favor of remaining in the EU, this is their diet of online news exposure, mostly the BBC. Those who said leave, they were less likely to click on the BBC as a news source, and more likely to click on conservative-leaning papers. But again, I think the question is here, does this look like the ideological segregation that we were led to believe that online exposure would lead to? Okay, so that's the linking surveys and URLs. And then in terms of what Yuli was talking about, in terms of the extracted text and the topic modeling that she did, what does that tell us when we look at topics of stories across the clickstream and across Twitter? So these would be the stories that people clicked on in the clickstream, or they could have been shared as well, versus stories that were shared on Twitter. So what does that look like in terms of topics? Do we see the same sorts of topics and the same ranking of topics across these two different sources of information? What do you think? Yes, same topics? What might be some of the processes that lead to them being the same or different? It's late in the day. Who wants to put money on it? Who buys the first round? Who gets it right at the, okay. Well, you can imagine all the build up, they must be different, right? Okay, but here we have what the topics look like in the corpus of text. And this is why you sort of, if you have the slides on, if you've pulled them up on your computers, you will see, because the text is really small, that here's the label of the topic, and then the keywords that are found that are linked to that particular topic, okay? So that is what the keywords look like that are linked to the various topics. And next, what does that look like in terms of comparing Twitter, which is the red dot, so the topics that were found in Twitter, versus the topics that were found in our click stream URLs. So here we see the conservative topic, which is really, and if you look at the keywords, it's really stories that are about the internal, you might read it as the internal politics, sort of the leave remain within the conservative party, and about the, there's a lot on Boris Johnson comes up there, et cetera. But what you see is that story, and those types of stories were clicked on more in the click stream, then they were shared on Twitter, okay? It's a difference. Now, when you get to stories that are about, the topic is treasury, but these are stories that were really about Osborne, about the market, about the sort of the economic impact of Brexit. These are shared, again, you see them in the click stream more so than in Twitter, okay? But when you look at here, oh, I can't even read that, it is trade, the economy, stock markets, thank you. Thank you for reading that. Those are more likely to be on Twitter than in the click stream, okay? Think about that. And things like easy, sort of more, I wanna call it, softer news, it was sort of about the history, not softer, but history voting, again, on click stream more so than on Twitter. Think about how those topics differ and why they might differ across those different platforms, okay? But they do differ, some are similar, but these up here differ, okay? So that's content, that's comparing the content and the topics from those news stories, okay? Finally, as Yuli explained, from the URLs, we extracted the domains and we're able to compare networks of domains. So you've got the nodes being those domains and then how they're linked together by the users, okay? And whether users or how the users then link two domains, okay? So this is the network of domains, bipartite network, looking at domains and then the users on Twitter and then on click stream, okay? So what are some of the things that you might take away from that? It's really hard, I think, in these networks. Okay, yes, one is blue and one is red. One looks what? More dense, dense, compressed. This looks wider and more dispersed and more varied, okay? And that pretty much confirms what we would expect, more diverse or varied networks, not as densely concentrated in terms of how these domains are linked together. We can come back and discuss those in the Q&A, but that's just to get to those points of the different types of analyses that you can do across these different sources of data and again, comparing across social media and other sources of online data, okay? And this is just a reminder to us all that we are still social scientists and we are still interested in things like representativeness as we've talked about measurement error and causal inference. Can we, with what certainty and how can we conclude that social media leads to segregation and news exposure? You know, to what extent are there echo chambers and are these driven by social media, by Twitter? You can also look at political mobilization or at social movements. Is it the case that you can tell that social media is now driving social movements and political protests? Can you tell that just from Twitter data? You know, with what certainty are you willing to make that or draw that conclusion? So we're still concerned about representativeness in terms of people, in terms of opinions, in terms of information and data points and whether those are tweets or clickstream data. We're still concerned about measurement error. If we're measuring online exposure to information, are we capturing it in the right way? If we're measuring topics in stories, is LDA the most appropriate method to use? And ultimately also we're interested in causal inference. You know, what is the platform causing segregation, ideological segregation? Is it still self-selection, which we know has been a process that's been around for a long time, self-selecting into being exposed to certain information and confirmation biases? So we wanna be careful, I mean, I think we're still concerned about these and we still go back and ask ourselves these questions at each stage of the process. But we would, I would argue, I think you only would agree that using multiple sources of linked data is one way to approach these questions about representativeness and measurement error and causal inference when using online sources of data. And finally, is just to point out that we went through a lot of information, I think we're still under an hour. Are we still under an hour? Way under an hour. Gone quickly. Is that there are an annotated bibliography for reading available on our website. But also what I'll show you because there's time where Yulia can show you are the Jupyter notebooks that were created that are in the GitHub repository that have the code for how the data were processed, how Yulia processed the data and created the topic models as well as the networks as well.