 Today, we're lucky to have Ali Hashmi speak to us. Ali, other than being a friend, is also a researcher at the Center for Civic Media. He's particularly interested in three things, and the intersection of them, so language, social identity, and power, and specifically the inequities of that power, and especially in the journalism space, how that plays out. And that's what he's going to talk to us about today. So give a warm welcome to Ali. Thank you, fans, for the kind introduction. And because I think the audience is quite diverse, I'm going, and my presentation actually touches upon disparate subjects like machine learning, critical theory, and journalism. I'm going to leave more technical questions for the Q&A session, if that's OK with all of you. So the title of my presentation today is Ideology and Text, Classifying and Analyzing Discourse Using Machine Learning. So this course is actually, what is this course? This course is used in a broad sense in a variety of different contexts. But the way I'm using this course is actually in Foucaultian sense. So Foucault sees this course as a set of announcements or statements which regulate, define, disciplines, and objects and practices for a given epistemic space. So in other words, this course is not produced in power vacuum. It actually perpetuates power relations. It drives policy. And it accounts for wider sets of attitudes and behaviors. So the more important thing is that the discourse constructs objects that we use to understand and define our world. So I'll show you this flipping from 1907 from Delhi, Oklahoma. This is from the Jim Crow days. And what you see here is essentially a cartoon and text that reflects the wider beliefs in the society of the time. So the text, just like the cartoon, is a representation that defines the object that it is speaking about. So the idea of representation is really important in the Foucaultian idea of discourse. Now, I'll use this example in our world, basically. So how would we actually understand it in our context? So I'm using this Marilyn Diptych, which is actually by Warhol as an analog for the discourse in our world. So as you see in the Diptych, actually, the actual image barely exists. You don't have the actual image. What we have is actually a series of regressions from the original image. So what happens is basically it is the actual whole, which constitute the original image as a whole. So there is an absence of the original image. But because of these series of regressions, we are able to picture the image through that. So text pretty much operates like that in our world, operative computing. So it is not important whether something is true or not. What is important is how the narrative that it generates. And also, once you say something often enough, actually, once you repeat it, actually, it becomes truth, basically. So this is the truth effect. So I'll just go through some examples of how discourse operates. So here are some aspects of example. We all know how African-Americans are linked. When it comes to, for example, crime is more strongly linked with the social identity of African-Americans rather than socioeconomic factors. And this becomes the organizing principle for our society and for effects like segregation, policy bias, and racial stereotyping. But if you look on the next picture, basically, it shows you how single mothers are talked about in media. And then you have another headline which says, BBC put Muslims before you, which contextualizes Muslims as the other. So these are some examples. But the next one is quite a pernicious one. And the classical case of Iraq, where you see the Powell speech in UN actually spawning a discourse that drove a narrative for war in global media sources. And then the Rothrim scandal that happened in UK, where, honest of few, individuals was actually transferred to the entire community. And you can see the terms that are being used over there, for example, from Muslim communities, several sex attackers from Muslim communities. So the entire community becomes responsible as a result of this particular discourse. These are some examples of how discourse operates in our world. Now, there is actually a critical discourse analysis. It's actually a field that builds upon the work of Foucault and Gramsci. And what it does is basically, its aim is to understand the link between the social and the text. So it attempts to demystify power and social construction by understanding the link it is between the ideology and the text. And it is very much a hypothesis-driven approach, because you're really trying to establish the link between the ideology and the text. However, there are some issues with critical discourse analysis. So one critique is that the researchers who are using this particular technique, they cherry-pick small amounts of data to support their preconceived ideologies. And then the other thing is, of course, has to do with that a single text itself is quite insignificant and considering the fact because there is rapid explosion of text all around us. There are 571 new websites that are created every minute. This is a statistic from 2014. It is difficult for us to actually make sense of what the discourse is without taking into account large amounts of data. So how do we actually resolve this? So there is actually another field, which is corpus linguistics. So corpus linguistics is actually an agnostic way of understanding how language is produced in large amounts of data. So unlike critical discourse analysis, we are trying to establish what is the relationship between ideology and text. Corpus linguistics is focused on how is the language produced? What are the patterns that are out there? We don't go in with the hypothesis in this particular approach. So what I've done is, essentially, as part of my framework, I have actually aim to combine critical discourse analysis with corpus linguistics using an analytical framework that will allow us to inform us on defining features, metrics, and design requirements for a tool that will machine classify and analyze discourse. So if you see on the right side, you have the critical discourse analysis, which is establishing the link between the ideology and the text. And then you have corpus linguistics, which is an agnostic way of looking at the text. At the conduction, you have the corpus-assisted critical discourse analysis. So this is the framework that I'm actually working with. So as an instance of this particular framework, what I did was I actually looked at the coverage of Islam in global mainstream media. So my work was actually informed by two key thinkers. One is Edward Said, and the other is Samuel P. Huntington. So Edward Said actually sees the coverage of Islam as marked by highly exaggerated stereotyping and belligerent hostility. And on the other hand, we have Samuel P. Huntington, who sees the conflict between the so-called West and the so-called East in civilizational terms. So the Huntington trope actually has contributed significantly to the development and deepening of false lines between Muslim cultures in Middle East and European and North American societies. So that is a very dangerous theme. And I think this is the same theme that is actually picked up by extremist organizations like ISIS. They work within the same framework. So I think that is why it's very important for us to understand the discourse that is generated as a result of Huntington's philosophy. So as part of the hypothesis, I looked at four different hypotheses. The first one was quite obvious that 9-11 has, of course, influenced the framing of discourse on Islam in the US and, of course, in the world at large. And the second one is that the mainstream media, when it largely portrays Muslims as a group, as a homogeneous group, embroiled in conflict, either as aggressors or as victims, but largely as the former. This is the second one that I have. And the third one I have is that Muslims are portrayed as the other using the binary lens of the us and them and framed within the Huntington's clash of civilizational frame. And the fourth one is that Muslims are connected more strongly with terrorism than other religious groups. So in order to implement the tool, I focused on the second hypothesis, which is essentially that media discourse portrays Muslims as a homogeneous population, embroiled in conflict. The reason why I picked up this one was because I thought that particular hypothesis sort of conflates the four hypotheses into one. So again, to recap here, we have this hypothesis that we're working on. And then we use critical discourse analysis and corpus linguistics to actually come up with an analyzer and classifier tool. Now before we go into the actual implementation, let's discuss what I mean by conflict, because conflict is a very broad term. So for the sake of implementation, I defined conflict as something that is characterized by extremism or force or militancy. So if there is, for example, an ongoing struggle between two groups, which is non-militant, I don't actually characterize it as conflict in my definition. So this is something that we have to keep in mind. This was the basis for developing this particular tool. I think most of you are familiar with media cloud. So my tool is actually built on top of media cloud. And for those of you who do not know what media cloud is, it is actually a joint venture of Center for Social Media and Berkman Center. And as part of the project, we are curating more than 50,000 sources on a daily basis for the last nine or 10 years. And what that allows us is basically open source access to these media corpora to build tools upon. So I'm using media cloud to actually extract articles about Islam by using the media cloud query builder. So for example, how would I build the query to extract articles pertaining to Islam? So I will actually look for terms that are usually associated with Islam, which are Quran, Hijab, Allah, all those terms, basically. And I intentionally avoid terms which are related to, for example, extremism, because the whole idea is that we want to actually extract articles generally which deal with Islam. So this is a very high level view of the structure. And then we use machine learning classifier to classify articles on the basis of conflict and non-conflict. Now most of the people here are, I guess, not familiar with machine learning. Maybe they are, but I will just go through how the framework works, basically. So what we do is basically, so machine learning, we use different kinds of techniques. One is to provide machine learning model, where you actually provide a labeled sample of articles to use that labeled sample of training data to produce an inferred function. So for example, you have articles which are classified as conflict and non-conflict. They are actually fed into the machine learning algorithm, which sort of produces a function that will allow us to map new examples. So for example, so that we can predict whether the article is about conflict or not conflict. Now one problem with this approach is that we need to code the articles. So someone has to say, well, this is a conflict article and this is a non-conflict article. Because this tool is automated and we wanted it to be, because it runs on a daily basis, we had to automate certain parts of it. So what it did was we actually came up with a technique for defining the article as a conflict article on the basis of conflict terms. So WordNet is a tool through which you can actually extract semantic networks of terms. So for example, if you extract the semantic network of term conflict, you will get terms like war, combat force. So I extracted that set basically. And then what I do is basically I go through all the articles. And if there is a presence of any conflict related term, I classify it as conflict. And if there is absence of the terms from the set, then I classify it as non-conflict. So this is in a nutshell how it's actually what's happening. Now I'll go into there. There is actually, I have a slide on the heuristics. I think I won't go into how actually the actual algorithm itself, but briefly touch upon it. The goal here is essentially that we optimize the classifier by actually iteratively increasing the number of conflict terms until we get a high level of accuracy on the existing data. So I think most of the people I think will find it more technical. So I'll just show you here. So for example, we will actually increase the conflict terms until we hit the maximum cross-validation score for that given amount of data. And once that is done, then we serialize the classifier. That will allow us to actually use this particular classifier to make predictions on a new data. Now because this is actually based on automated data set, how do we know that this is actually classifying conflict appropriately? So for that, what we did was we actually developed a goal standard for it. So in other words, we manually annotated 103 articles. There were five different users who actually coded these articles. They looked at the articles and they classified it as conflict and non-conflict. And the first goal that we had was we wanted to ascertain whether there was inter-user agreement between the users. Because it's possible that my definition of conflict could be different from sans definition of conflict. And we found that there was a very high degree of agreement in the user. So there's a statistic that is used to do that. So our KFS statistic measure indicates almost perfect agreement among the users. So that allows us to, users are basically, I used it to create this particular set. So from different diverse backgrounds, they were just looking at the text and then classifying them on the basis of conflict and non-conflict. So once we have the truth labels for all the articles, what we did was we evaluated our classifier against those truth labels. So we have, for example, human classification results, which will classify an article on the basis of conflict and non-conflict, and then we matched it against the machine classification that results that we had. And for this particular study, we found that we had a very high degree of accuracy for our classifier, which was 99%. Now, this is actually very high because this is a custom classifier that we're building. Typically when you have standard classifiers, like for example, you are building a classifier on a topic like health or sports, they are easy to actually construct, but when you're building custom classifiers, it is you have to have, the degree of accuracy is quite low, but remarkably this came up. The heuristic was able to come up with very high degree of accuracy. So users? Well, that is subject to interpretation because KPAS statistic actually interprets it results in different matter. So by and large, there was a high degree of agreement, which means for example, like if there were 103 articles, let's say 98 of the articles, we had consensus on 98 of the articles. So just to give you an example. So this particular classifier is used to actually create an index, which is called side index, head hunting time index, which measures the polarity of conflict in an article. So 100 means that you have maximum conflict and zero means actually you have minimum conflict. And there are different ways assigned to that particular index. So I mean, first of all, we're using the classifier to a priori determine whether the document has conflict or not. So that has 50% weight. The second thing that I'm using is actually the conflict terms themselves because our conflict is typified by certain terms, we give more weight to those terms. So 30% of that actually comes from that. And then finally, because a lot of studies have already indicated that nation states are actually at the root of conflict, we assume that the presence of terms which denote nation state may indicate the likelihood of conflict between the nation state. So we assign 20% weight to that. Now we can of course alter these weights, but largely this index is driven by the classifier. Now I'll show you the actual page. This is the actual tool. And in the tool, what you can see is that you can look into, you can go through different media sources. What it's showing you is for a given media source, the percentage of articles that were classified as conflict. So for example, Al Jazeera here, the articles that deal with Islam, 93% of those articles were actually about conflict, which is quite remarkable because the aim of Al Jazeera was to introduce an Arab world reporting that is distant from propaganda. That was the whole aim. But they end up actually reporting 93% of the articles on Islam on conflict. So which was quite an interesting fight. And then below that is Time of India. And that was again an interesting insight. The 37% of the articles were classified as conflicts, which is actually low because we thought because there is an ongoing situation between Pakistan and India, we expected the number to be higher. So that immediately pops up. Now, I'll just show you just, this is this conflict scorecard that we produced as part of the study actually. So you will see that predominantly majority of the global media sources when they are reporting on Islam, they're reporting about conflict. So the tool actually confirms the hypothesis basically that we were aiming to deconstruct or establish. So I'm not going to go into the features of tool, but for example, you can, there are a number of different features, but you can look into conflict map for a particular news source. You can also look into the actual articles, how they were classified, et cetera. But another question that we had in mind was to understand what was the actual underlying discourse for a given media source? So for example, if you have Times of India when they were talking about Islam, what are different themes that are present in the discourse? So we used topic modeling to sort of extract the topical frames or themes that are part of the discourse that constitute Times of India articles. And how that essentially works is so in this, we used an algorithm called Latin traditional allocation, which essentially intuitively works like this that each article can have multiple topics. So for example, if there is a sports article in which a sports player gets injured, you will have two different topics in there. One topic will deal with sports, which will have like terms like scores, a ground player, and then another set of terms will deal with the topic of health. So you will have injured health hospital. So you are able to sort of lump these clusters and you're able to extract these topics. Now we ended up actually using five topics for each media source. We extracted five different topics. That is an arbitrary number. You can actually have more than five topics as well. But we thought that that is a number that users were more convenient with when they were actually looking at the data. So what you see here over there is a dashboard, which will show you for all the given media sources, the topical frames or different themes that are there. And right next to it is actually a square, which sort of shows you how it is classified. So if it's green, it is of course a non-conflict frame. So what is interesting here is that if you look at Alzira, most of its frames are actually conflict frames. And if you look at Times of India, you'll find only two of the frames are about conflict. Now we'll go deeper into what- We can't see the column heading. They're all conversations basically. So the same column heading. So I'll go into the, I'll show you the next, you'll see. All right, so if you click on a conversation, so you can actually look at the cluster of frames. Then there is an index that we have there, which tells you how we have classified it as. It gives you topic strength, which means how strong the topic is. The size of the actual, the circle over there denotes the weight of that particular term in that particular cluster. And we have coded them as, color coded them as red and orange on the basis of conflict and nation state. So if there is a term that is about conflict, we code it as red. And if the term is about nation state, we color it as orange. So that will allow you to quickly visualize what the frame is about. You can also, for example, click on a particular term and look at the actual concordance of that particular term, how it was mentioned and how we classified that particular article as. So this is just how this particular conversation is actually structured. But I'm just gonna go and show you our results on this one, a user result. So we showed these particular frames or conversations to the users. And the users agreed with our classification 24 out of 25 in 24 out of 25 cases, which is a high number. And the other thing was that there were no conversations that the users found not coherent because this is a statistically generated tool. It's possible that you can have topics which do not make any sense. So our results sort of confirmed that the topic modeling here is giving us robust results. Now I'll just show you how this actually will enable us to understand discourse that underlies the particular media source. The first cluster that we have, I don't know whether you can see the terms or not, but we have labeled it as domestic politics. The reason why we label it as domestic politics is because it has terms like minority, religion, Modi, government, Sikh law. Pretty much it exhibits heterogeneity of voices that constitute Indian democracy, which is the largest democracy in the world. So that is a, and we've classified it as non-conflict as you can see from the green color code. And so this is the first cluster that you extract. So this is something that allows us to extract frames and conversations which are not about conflict. So this is one example. If you look at the second one, this is the classical conflict frame that you'll find in majority of the global media sources. And this particular one deals with Syria and Middle East. And you can see the terms like military, terror, violence, ISIL, Syria. These terms suggest that the conversation represents the conflict that's happening in Syria. Again, this is our interpretation, but again, the whole purpose of this is to actually find coherent topics that are part of the underlying discourse. The third one we have is actually also a conflict-based frame, but what is interesting is that the topic modeling was able to extract it as a separate conversation. So this actually deals with the conflict that is going on in Yemen. So this study took place in April. So the Yemen conflict was a separate conflict that was happening in the Middle East. So it treated it, even though it treated it as a conflict frame, it treated it as a separate frame. So it was able to tease out this particular topic as a separate modular frame. The fourth one that we have deals with Muslim community because if you look at the terms here, the view of terms like, for example, Muslim would do Islam, they typify Muslim community in India. And the index for that one is seven because this really doesn't deal with the militant conflict that we were earlier talking about. That's why it's classified as a non-conflict frame. Now finally we have, because this is all based on statistical algorithms, you have some aberrations as well. So this is the fifth frame that we have, even though the frame is actually quite coherent and it deals with civic and municipal politics in India because you have terms like, for example, resident house, local district, mosques, temple, church. There is this, the presence of the term terror sort of is an aberration for this particular topic. But by and large, this is a coherent topic and we've classified it as a non-conflict because the majority of the terms were not about conflict. So using topic modeling, we were able to extract frames of conversations which were not about conflict and we also were able to identify the dominant frames which are part of the global media sources in our tool implementation. So in a nutshell, basically our text analysis tool, it uncovers the ideology that is embedded in the text. So unlike other tools which just focus on extracting patterns and data without focusing on the ideology. And the second thing is that it also bridges critical theory with media analysis and critical theory has been criticized for elitist theorizing and lack of extended empirical studies. So this tool will allow empirical evidence on real-time basis that we can use in newsrooms or in academia and basically it's providing you practical hermeneutics for media narratives. So I think I want to thank some people here before I open the forum for questions. We have Ethan Zuckerman who is my PI, Kathleen Havasi, Edward Chapa and of course, Sans Fish. And I guess I'll open the forum for questions now. Yep. Thank you. I'm going to do here and ask you one quick, first question, it seems to me like your definition of conflict is really core to this entire project. And I know that you put a lot of effort into trying to have that be a balance to definition of conflict. One thing you said earlier was that if there is ongoing conflict between two parties, you didn't consider that as part of your definition of conflict. And I wonder if you could just elucidate a little bit more how you're defining conflict for this and maybe how it might need to change if you were to focus on a different culture or a different topic. So I think what I said was that if there's an ongoing conflict, let's say for example, there is a minority which is actually is having conflict with the majority in a particular nation state. If that conflict is not marked by militancy, I do not treat it as a conflict because that is part of politics. The point is it becomes militant. At the point that there is use of force there, then I will classify it as conflict. So I think for me, the key linguistic marker which will define what conflict is has to do really to do with force and militancy. Have I answered the questions? Could you give a couple of examples, for example, Kashmir versus the Rulani minority in Belgium? Sorry, Burma. Yeah, okay, so yeah, so for example, so yeah, so both of them would be marked as conflict because there is militant conflict there. In Kashmir you have an ongoing issue between India and Pakistan, and in Burma you have the Rohingyas, there's the conflict which is marked by force between Buddhist monks and the Rohingya community. So that will be characterized as a conflict situation for me. I'm wondering, I just curious like how sensitive you think your results are to some of your parameters, like for example, like number of LDA topics. Do you think that if you sort of had searched for 20 LDA topics, maybe you would have found conflict frames that didn't sort of bubble up when you searched for five. Maybe there were conflicts there that were kind of like subtopics. Okay, so I mean if you were using fewer topics, you will end up with predominantly dominant discourses. So if you have three topics, you will actually see just the dominant discourses that are there. So the idea here is like you increase it to a number where you're able to extract topics which are not part of the dominant discourse. So what we did was we wanted to go take it to the next, like take it to like N number, like seven or eight different topics or nine topics, but we found that it was very difficult for users to actually navigate what's going on over there. But certainly we can actually, I think if we actually increase the number of topics, we will have, I mean, you will see a very different composition of topics. We'll be quite different from what we have here. So I think that will probably, I mean, you have to go through some sort of study to answer that question, I guess. Yes. This might, maybe you want to address this as you move along, but I'm just wondering as a practitioner, what I, you know, but I work with Global Voices, an international community of journalists and bloggers and writers who we think a lot about these issues and language used in mainstream media discourse. And I think this is the kind of thing that we're, we're always like, oh, those publications portray things this way, but of course we don't have data. We just like know that. But so what? So what should, what might this, what does this suggest for these publications that you studied, alternative groups that are interested in, you know, using a different kind of language or what, like what's the, if you were gonna propose sort of some kind of changes or actions out of this, what would we do? I mean, this tool could be used in a number of different ways. I mean, you can use it in academia to understand the discourse itself. When it comes to newsroom, you can look at what sort of coverage your particular organization is actually focusing on. So one of the like, the future things that I have on the list is actually to use Global Voices to see if we're able to come, to see the difference between global media and the mainstream sources that are there, how the discourses, how the coverage is actually different in the sources. So that will allow organizations to see whether they're reporting on just the dominant discourses or do they have other areas to focus on. So for example, Times of India is doing a great job. Al Jazeera is not. So Al Jazeera, which has an agenda to actually sort of like, come up with the coverage of Islam, which is more neutral, it's actually involved in, you know, in covering Islam in conflict terms. So I guess this tool could be used by them to sort of understand their coverage. I think so. I have a couple of questions. One draws on, what was your name? Ellery. Ellery's last question to any other separate. So that one, are you planning on using this tool to reach out to these organizations to connect them with your findings so that they see sort of where they fall on the scale and then might Al Jazeera then be encouraged to do something differently and then would you come up with specific recommendations on how they might cover things not in such conflict terms or is that something you leave to someone else? Definitely not going in that direction but certainly I was approached by Global Voices, Evan Segal, he was interested in actually understanding where how the coverage in Global Voices is different from the mainstream media sources. That was a question that he had in mind. And so I think a lot of the organization would be interested in looking at this if this particular discourse tool is available to them to sort of like do an audit on how they're covering a particular issue. Was how duplicable is this? So this is something that you've done with respect to conflict and your hypothesis on Huntington versus more nuanced terms of seeing the world. And is this the project or would you contemplate having other hypotheses and then would you need to also have other user-focused groups to calibrate with the machine learning? Yes, so one of the questions that we had about this tool was can we generalize this for any given habit? So and again it's a quite a challenging problem because you can see that this is a very nuanced sort of analysis that is produced. So when we are building this tool we have to take into account what we are looking at. So as an next situation we are going to look at African American community and link it to either crime or inequality in American media basically and see how this tool actually is able to extract the discourse there. And the next step would be basically to find some generalized parameters that we can use to build a tool that could be used for answering any kind of hypothesis. There was a question there. I mean this is fascinating work and it's the sort of thing you hear and it's like yeah obviously you do all these things together and in retrospect it seems almost common sense but I've got two very different questions. One is sort of methodologically unless I miss something you're not actually using machine learning. You've got, you've built your classifier based on human learning of various terms et cetera and just sort of hooked up your handcrafted classifier to the class, the sort of bulk data processing of a machine learning thing but you're not actually using, the machine is not actually learning from, you're not trading it with data, you're providing it with the algorithm. That's sort of the first question. The second question which is very different is you know this seems so obvious in retrospect. Is there anybody else doing this sort of stuff? You know marketers you think? Or you know Google or whatever. Okay so I'll answer the first question. So I think, I mean essentially I am using machine learning here because the first part where we're actually extracting the, I'll just show you. Okay so we are extracting the articles from media cloud using the query that we have and then we are actually classifying the articles on the basis of conflict and non-conflict from the terms that we are getting from the semantic network. So that is the training sample that we have and then we do what we do is that we actually build, we do cross validation on the existing data that we have. So the classifier is actually generating the training sample itself and the reason why we want to do that is because language is evolving in real time for us. And to ensure that our particular classifier is actually corresponds with our definition of conflict and others we have this as I mentioned the gold standard that we have created which is human coded articles that we also use to check whether the classifier is actually giving a high accuracy on these particular articles. So the heuristic I mean I think you can, I didn't go in depth about the heuristic essentially but I mean it looks at a variety of different classifying algorithms, it looks at different features so I look at different kinds of grammars, I look at different vectorizers and I optimize the classifier based on the cross validation score. And then once it's serialized then it's used to actually use to predict all the articles that are part of the extract, transform, load process for a given day. And the other part that you have is the human coded articles will provide sort of like an objective picture of whether the classifier is working appropriately or not. The second thing is the question was regarding the marketing and stuff. Yes, I mean Unilever sort of was interested in actually using it for marketing and insights, this particular tool. So it can have different kinds of applications because they can ask different kind of questions and look at that but really what we have to do is change the pipeline for that. But for our purpose we were really trying to actually link the ideology with the text in our implementation for this question. Sure, okay. Two questions. Technical question, are you running this on a NoSQL database? Would that be your ultimate platform? Second and maybe related to yours, I come from the commercial side and I'm interested in rhetorical strategies, influence strategies and I have an ontology that is intriguing how it could be made into this. Is there room as you study discourse to try to tease out from your data the strategies, the rhetorical strategies that are involved in or behind the discourse? Or is there having an exchange, you're having a conflict, you're having an agreement? But what are the underlying strategies that the parties are using to essentially navigate their point of view? Is there room in your system to find that or tease it out or incorporate it? Yes, the first question is NoSQL database. I think I'm not using NoSQL database, I mean like MongoDB or something like that, but it is really extracting the JSON based, it's JSON based so actually I can scale it using any kind of NoSQL database. So that's not a problem. That's an architecture issue, because this was developed as part of the thesis, I didn't take into account that I have to scale it in terms of an architectural system. The second question that you had was, I mean I showed you some aspects of the tool, but the tool is quite nuanced, you can go and look into concordance, you can look into other aspects as well and because corpus linguistics is itself is a very broad field, you can actually incorporate a lot of the techniques that are used in corpus linguistics, co-location and other stuff to answer the questions regarding what sort of rhetoric or techniques are used by different communities or in different discourses in a given media source or given pipeline that you're using to understand what the discourse is. So that's certainly doable. There was a question here. Yes, I was wondering if the difference that you saw between the times of India and Al Jazeera reflects the fact that the times of India's coverage is mostly of things inside India and except at the edge around Kashmir, relations between Muslims and everybody else in India are mostly not conflictual, whereas Al Jazeera is focusing on an area that has a heavy amount of conflict. Right, so why times of India is remarkable is because we're extracting articles on the basis of whether they are about Islam or not, and if you look at the mainstream media sources in India and Pakistan, you will find that religious discourse in the countries essentially focuses on issues of conflict, most of the times, and what was remarkable about times of India was that it is actually lower than some of the media sources that we have in Pakistan, so which means that they are looking at other aspects of Muslim life as well. Right, that's true because India is a country where mostly the Islamic population is not in conflict with anybody else. Yes, that's true, but they are focusing on other areas as well. So I think if you want to use the same model for Al Jazeera, they can actually focus on other aspects as well, not just the conflict part. So that was I think the difference basically between the two. Yes, okay. Or just something tied into that. Did you have a threshold requirement before you consider a newspaper? Like did you, for example, say, you're going to look at newspapers which had, say, 100 articles a year on Islam, like generally, you know? Or did you only take those into account because otherwise there might be a problem that the coverage would revolve around, say, sensational events that surround Islam, but the large portion of the coverage would be on, you know, Monday and day-to-day non-Islamic things, like, say, in India for the times of India. So was there a threshold requirement for considering newspaper? No, there was no threshold requirement. So I didn't take into account that. Okay, yeah, Sam, you do that. So it's kind of piggybacking on your case question, but a little more general, and it's not very practical, but since you're suggesting it for academia, perhaps you got a chance to look at it. In general, if you look at the media, you know, the media publishes more articles about fires and murderers and conflictual things in general. To define the percentage, have you even thought about looking if the percentages are any different? I didn't get the question. Other than conflict, is that a major question? So if you look at a regular, you know, the whole media that published anywhere, and you look at percentages of conflictual publications versus non-conflictual, is there any delta between those, any difference in the delta in your foundation? So I didn't actually do a study on that, but between the sample, the publication that we had, we found some, for example, times of India dawn and news as different from other sources because they were not primarily focused on conflict. So that was, we didn't really focus on whether what's the difference or delta over there, but that could be an interesting question to look into. I had a question about methodology. So you had two classifiers, right? One based on WordNet, which is a keyword classifier. It's not a classifier, it's just a set of terms, which is just basically, which are related to conflict. So that is just a linguistic marker for us to say that, okay, if a particular article contains any of those terms, we will use that as a training sample for conflict articles. Define those articles as conflict articles, right, based on the keywords. Right, so what basically I'm doing is I'm, let's say if I'm actually holding the data on 60%, testing it on 40%, and then once the classifier is actually built, the whole classifier is built, serialized, then I use the human coded sample to see if the classifier performs equally well over there, because it's easier to actually come up with a, you know, like a really high level of accuracy on the data that it's trained on, actually. Right, so that I do. But the other part helps us where you look at the independent data set. So what I'm getting at is, if you are deciding what is a conflict article and what is not a conflict article, based on keywords, why do you then go and build a machine learning algorithm to then, like, classify further? You cannot, because if you, for example, what that allows you is if you look at the heuristic that I have is that if you have, let's say, one conflict in a given article, and you start actually using, you build a classifier based on that, you will get really poor results. So you want a classifier which is robust, so what I am doing here is, I increase the conflict terms one by one, iteratively, and then see what the classification score is. So there are articles which do not have any conflict, and there are articles which have X number of conflict terms. So the goal is when we reach the maximum score for on cross-validation score, we stop there because we retain the large feature space, which will allow us to actually classify articles which are new or not part of the main space. That is the way the heuristic functions are. For example, the slide that you see over here, what I'm doing is because majority of the articles are about conflict, I actually use pre-2001 corpus to extract articles which are not about conflict. So that helps as well, and then you have the articles which we are getting on a daily basis, which are used to create a training set which will give us the ideal classifier. Because if you just use it on a classified article based on a conflict, you won't get any robust results. Yes, yes. Well, I always have loved it, and I still love the idea of bridging critical theory, like making critical theory empirically testable. I feel like that is one of the really, like the overarching concept that's the most strong about this, and that you could take that in many different directions and start to iterate on that idea in a lot of interesting ways. I think that's really compelling, and I'd be really curious to see, like to hear from humanities people or critical theorists themselves about the kind of response to that. And I'd kind of be curious if you've gotten any responses from them and whether they are resistant or not to the idea of making it empirical. But then the other thing I was gonna say which I think relates back to Allery's question or comment actually was around the sort of practical application and potentially like the theory of change here. And I've thought about this in relationship to the Center for Civic Media's tools more generally as well. It's like, if there are a way, I know it's like at this point this was like a thesis project you're saying you haven't built it in a more generalized way of scale or whatever. But if there were a way to do that, and then I think for also some of the Civic Media's other tools like I'm thinking of the geography classification stuff that we worked on and things that other people are working on. Like I could see the application if we get those tools to a more generalized standpoint to where they can be used at that point of editorial decisions. If you have the real time insight as the editor of Al Jazeera into like, oh okay, look at what our coverage of Muslims has looked like so far. Like it's where it basically have been talking about Islam in terms of war and conflicts. Can we modulate that? Like can we adjust that in some way? So if you had those analytics and you had them in this kind of real time way then when you're deciding whether or not to run a story or whether you're making these large scale investments in longer form stories and things like that, you might potentially make different editorial decisions. So I mean I feel like that if we could scale the tools to where they're easy enough to use that they could be used in the news organization. I feel like that could be a really powerful way feedback mechanism into the media as well. Yes, I think that is one of the things that I have as part of the future work that I'm gonna work on as part of this tool actually at the Center for Civic Media. I mean I think I know there's like a lot of technical things that make that difficult and it also be difficult from the news side like they'd have to have access to their whole corpus to plug into it and everything. Yes. But I mean because we have media cloud available there's a lot of, we can leverage media cloud to build a lot of these tools which will enable us to actually flesh out this. Critical theorists, just one friend of mine, he's an anthropologist and he found it quite interesting because he thought it could be used to look into discourse around in different communities. For example, how Christians are covered in Indonesia. So that's the deal. But no one actually gave me a formal sort of like feedback. There's a question in there. Yeah, I was wondering if you had any stratified studies to look for the influence of possible confounding biases, some of the conclusions. For example, Al Jazeera might have more coverage of Syria and Iraq. Have you compared the coverage of Syria and Iraq in times of India versus Al Jazeera to see if you'd get the same level of justice difference in the use of conflict terms? Or are you partially seeing the effect that times of India may have more cultural orientation and looking at other things whereas Al Jazeera may be hard news and more focused on the Middle East where there's a higher level of conflict. For example, in Pakistan, there may be a lot of coverage of internal conflicts and internal terrorism that doesn't exist in India which might also bias things a lot to a higher level of conflict terms. So I didn't look into that which was interesting like comparing the coverage of a particular, let's say, Syrian conflict in times of India and Al Jazeera. But having said that, I think if you look at the Middle Eastern geography, it's quite diverse. And if you look at how Al Jazeera is approaching the problem is it's actually treating it in the context of, okay, this is the larger Islamic world and that is the same paradigm that Huntington has. Like, feed this as an overarching sort of like map for a lot of different communities because I think that Pakistani community is very different from Indonesian community. And just to lump them under one appellation of religion basically, what that does is that creates those sort of like stereotypes that I've talked about in my discussion. And I think, so with Al Jazeera, I think they can probably focus on other aspects if they treat these nation states and communities in a disparate manner. Well, I don't think they're doing it. Again, I mean, this is my assumption that I have based on the coverage that I've seen in Al Jazeera. So Al Jazeera doesn't use terms like AIX versus Sufis versus Sufis versus others? They use these terms, but when they are talking about, for example, they will look at a conflict issue happening, let's say in Syria or Yemen or any other, yeah. But they don't focus on other aspects of these communities, the region. For example, if you look at whether they're covering, it's a Qatar-based media and they don't have enough, like, if you look at the coverage, their coverage of Qatar, it's not that diverse as well because you would expect them to actually focus on some of the other areas that deal with their daily life. So that's not there as well. So again, I mean, there are a lot of things that we can look into because there are so many aspects to it, it's hard to actually, this was my sort of like first attempt to look into the discourse question. So each of these dimensions actually posits, a different way of looking at things than, you know, a study to look into what are the findings from there. So which is, I think, which is sort of like beyond the scope and when I was actually working on the project. So I'm just curious, I kind of should follow up on some of these other questions over here. Some of the difference between Times of India and Al Jazeera can be sort of explained by Al Jazeera sort of looking to have pitched stories to an international audience. And Times of India going to our domestic audience and international audience is just sort of being interested in kind of state level or larger scale things rather than a domestic audience where they're more into kind of day to day. Yeah, you can explain that through that better, but my question is that if you're actually doing that, you are sort of like, you're doing the same thing by actually looking at this aspect in a homogenized manner and you shouldn't be doing that. You should, I mean, that's again, my understanding is that if you do it, then you are actually creating these stereotypes which are about like civilizational conflict or which sort of fit one group with the other at a higher level. That's the reason why I'm saying that they have to come up with other ways of reporting to report on other aspects of Muslim life. I'm curious if you could take it a little bit deeper on the question of associating Islam with conflict. I mean, there are a lot of different, like I look at the words in the circles and I'm like, what about all the other words? Like there's a lot, there's so much nuance that might be brought to a story that is about, that does link these things and another story that might include a lot of the same words could be sort of hopelessly ignorant and incorrect. So I'm curious if there might, I mean, I know that that's not quite the area that this study was intending to suss out, but I wonder if that would be in a next iteration, if there could be sort of a new set, like a new linguistic analysis layer on top of it to try to dig deeper on that type of thing. Absolutely, but I think one idea that I should mention here is that the Foucaultian idea of the discourse, which is that it has produced actually, it produces certain kind of representation and that's where the whole thing comes in where you're sort of like producing these objects which deal with just the conflict essentially. So that is, I think, what drives the whole thing. So I think we can look into that linguistic analysis, yes, but I think for this particular, in this particular implementation I was looking at the idea of the discourse. Objects, are you talking about individual linguistic markers or are you talking about like a more complex object or a co-occurrence of words? So when Foucault talks about objects basically in discourse, he talks about announcements or statements. It could be anything. So it could be anything that is part of a discipline, a practice, something that comes up in media, something that is part of a medical practice, that all these statements are part of the discursive formation which are a result of the power relationship. That's how actually he defines it. Well, thank Alina, thank you very much.