 Hi all, and welcome to this virtual Berkman Klein talk series. And I still see some participants joining, so I'll take it slowly here. But first of all, welcome all to this series and this kind of like unusual setting, at least for the Berkman Klein Center in these very unusual times. On your screen, you can see some of the house rules that we have. Most notably, audio and video have been turned off for you. So you will only be able to see the slides as well as the presenters, that is Adrian and me. If you've got questions during the talk, submit them through our Zoom Q&A tool and we'll basically go through those questions and answer them after the talk. After the talk as well, the public chat will be enabled. So right now you can't chat with other folks, but once the talk is over, you'll be able to do that more interactively as we've kind of like learned in the past that this can be a little distracting if the public chat will be enabled throughout. And finally, a word of caution as usual with these talks for Berkman, the webinar will be recorded. And so, you know, like if your questions are being asked and, you know, we might mention your name. So keep that in mind. Or, you know, if you don't want your name mentioned just add that to your question please. With that, having said, I'll now share my screen with you. And hello all in this virtual environment. Today, Adrian and I will talk about our research on bot detection, or more specifically on the issues of bot detection. And so this talk as follows is structured in a way that first I will guide you through kind of like our thoughts and how we can like structured our paper, what kind of like the situations that we're in right now when it comes to bot detection. And then Adrian will guide you through our results and give you kind of like some thoughts on what we think can be done and maybe should be done. But first and foremost, we're just very happy that you chose to join us online. You know, like these times are quite stressful and uncertain and very thankful that you participating here. And I hope you all are safe out there. So when it comes to, you know, this is a very abrupt change but when it comes to bot detection which is a very more like abstract topic. You kind of like have to take a step back and think about the public interest or worry really in bots and how this has increased over time you can see the plot on the right that is from media cloud. So this is just media attention in the US media between January 2015 and April 2020 for bots and fake news misinformation or disinformation. And this kind of like just shows you that really after fall 2016 so after the presidential election in the US, the public attention and probably worry about bots and kind of like misinformation really really skyrocketed. It's not only true for kind of like the public and the media interest but really also for academic research. So Google scholar has 18,900 hits for me when I search for bots and Twitter since 2016. I couldn't use scopus because as many of you are I'm working remotely. And so I had to take Google scholar as my database of reference. It shows you kind of like that this field of bot research and kind of like what what's the role of bots in our public discourse is an urgent one, right. I think no one is disputing that we all would love to know kind of like how many automated accounts that spread political messages or other messages are really out there you know like we all want to know if we debate with someone online, whether that person is a person or really a bot. Right so there is a need and an urgency for that. There have been studies in recent years and articles about this. So this is for example a Pew study from 2018 called bots in the Twitter sphere and where they say an estimated two thirds of tweeted links to popular websites are posted by automated data accounts, not human beings. Another study that was covered by the Guardian this year had the headline revealed quarter of all tweets about climate climate crisis produced by bots. This I think it was by Gizmodo. This is an op-ed which says social media bots are damaging our democracy on the internet nobody knows your natural language processing system. Again, I really know like it's not a worry it's not only a worry it's also you know it might be really bad for our democracy right. And finally, you know like all different kinds of things tweets about cannabis health benefits are full of mis-truths. This was I believe on the conversation about a Twitter analysis that some academics have done. And this is kind of like the interface that we're kind of like looking at between academics doing research. And kind of like looking at this courses on Twitter and being like okay how can we measure this what's happening there. Then writing a story study and then kind of like this getting out into the media the public discussing about this right. And so the importance of the academic of the rigor of the academic research is in a warranted in that regard, because often kind of like academic discussions and studies are usually staying within academia here. It's not only in academia. It's also out in the world out in the world and out on the media. So when we talk about bots, we kind of like, you know, have to think about the terminology. And we just like have some definitions here, for example, but you at all who say bots are automated programs, but they kind of like also differentiate between bots and cyborgs. That is bot assisted humans or human assisted bots that is kind of like think of a Twitter program and sometimes a human will also write tweets from that account. But this is only one of many definitions that you'll find out there when you kind of like look in the literature. Another is by Colani Howard and really from the OII back then at least who define bots as automate interaction with other users. So this is already kind of like it's about interaction with other users. It's not only like, you know, like that they send messages into the vortex. Third, it's from Bessie and Ferrara, who's not talk about bots, but about social bots, who emulate the activity of human users but operate at much higher pace, while successfully keeping the artificial identity undisclosed. This again is a different definition and you kind of can see that like from definition to definition, the line and lines of what's a bot and what's not a bot, beginning to blur. And finally bot Sentinel is a website out there that came up with the term troll bot, which they define as troll like behavior with a repetitive bot like nature of their trolling. Where, you know, like it's really unclear if that's even a bot or not. And, you know, like, in that kind of a sense, does it even matter if it's a bot or not. And so you kind of like you see that definition wise you kind of like have a broad field. And so we believe that it's kind of like important to take a step back and say, for us bots are fully automated accounts and you know that's for us is the end of it not because we do not account and believe that there are different ways. out there that bots are being used, but rather because we don't believe that like this all these terminologies help us kind of like figuring out what we want to study that is, you know, how many bots can we identify and Twitter. And when I talk about bots, you know, like there are some very obvious bots out there, like the museum bot or the Soviet art bot, often accounts, you know, like that have their engineer in the description and a link to the GitHub repo. So how do you identify bots or how have, you know, scholars and journalists and researchers thought about going about this. There are different ways that there's the fragmented approach which is really just saying, you know, every account that tweets more than 50 times a day is a bot. There's a network analysis approach where kind of like look at the follower communities. There's a machine learning algorithm where you say, okay, we know these accounts are bots and these accounts are humans and we train this algorithm so that it will detect these going forward. There's digital forensics which is kind of like a mixture of human inspection and these other tools and finally human inspection where it's just really just you look at the account and you look at what they tweet and you kind of like try to make sense, you know, like does the image fit etc. So here in our talk, we focus on machine learning algorithms, most unnamely bottom meter, which can be understood as the gold standard in social science research to identify bots on Twitter. Indeed, if you kind of like look through the literature, you'll find several hundred studies that have used bottom meter, including in kind of like communication journals like political communication, but also for example in general science journals like science. And so, you know, bottom meter is by far the most used tool in that regard. And so we believe it's valid kind of like to ask, you know, how good is bottom meter. And, you know, I went through bottom meter with these two bots that I just showed you. And the bottom meter score here is 2.4 and 2.3. So, you know, it's in the middle, like bottom meter is kind of like saying, you know, depends. The whole left, like blue is not a bot, red is bot. So it's right in the middle. If you're wondering, and if you've never used bottom meter, how it works, this is the web interface. And you log in with your Twitter credentials, and then you can just check users. And what you'll get to see here is what I've done yesterday and I checked Adrienne's and my account. And apparently Adrian is a little bodyier than me, but still, I think we're both relatively in the clear and tend not to be a bot according to bottom meter. So what you see here are the universal scores. You can see this also on the next slide where I looked at Adrian's account and where we can get two important values that is the complete automation probability which is 1% and the universal score. These are the two main values that you'll also get from the bottom meter API. And with, and those are the values with which research researchers usually work. Most prominently, however, researchers will take the universal score, which is then re-sampled to 021 where 0 is not a bot, 1 is bot, and they will choose thresholds. And above or below those thresholds it will either be a bot or not a bot. So Pew, for example, had a threshold of around like I think 0.43 and everything below 0.43 was not a bot and everything above 0.43 was a bot. And that is the universal score in that context. And if you talk about these classifiers and bottom meter is a classifier, right, we talk about pitfalls and what can go wrong and we see this currently actually with the coronavirus and testing where it's really important how good is a test, right? And this is very similar with classifiers like how good is this actually. And so if you have a sample of 50% bots and 50% human users, the question is what pitfalls can there happen? Generally speaking, you know, you will have a classified data set. And within that classified data set you will have true positives that is in our case, the bots that we identified are bots. And you will have false positives that is you will have humans that have been classified as bots, but obviously aren't. Obviously, you know, this is something that you will never really like get rid of. But obviously kind of like want to reduce the false positives and have high true positive rate. However, an issue arises here. That is, that is not the only pitfall that you have because true positives and false positives only tell parts of the story. It's also a precision that is from the all the accounts that got classified. The question is how many of these classifieds were actually correctly classified like how many bots were selected within the classified data that is precision. And we have got the recall. So how many bots were actually selected correctly selected from all the bots that we have in our data set. So when we think and going forward for this talk, it's important to differentiate between true positives, false positives, precision and recall, and you'll find out later why that is and why we can like talk about this. But more specifically, because there's an issue if we only look at true positives and false positives, namely that this data set doesn't really happen on Twitter, because it rather looks like this, where we have about 15% of all users. According to Twitter, our bots and the rest are humans. And so this is not really being accounted for. And you'll see the impact of that. So these is all theory and this is kind of like all the aspects that we thought about in the last one I have years while doing this project. Just coming up early with this question like how precise is this tool in detecting bots. And we've got four bigger questions and smaller questions that regard I think that is how good is the diagnostic ability of bottom meter when used for five distinct sets of Twitter accounts. How good is the precision and the recall of bottom meter scores when used for five distinct sets of Twitter accounts, presenting the bot human ratio and the general Twitter population. So not only in our data sets but also when we resample this for the general Twitter population, which is kind of like what we really interested in right. But also what's the difference between the languages here. We know roughly that there's a difference between how good, for example, bottom meter works in English and Swedish based on other studies. And finally, how stable is bottom meter over time because usually how studies go about it is they collect their data, they run their data through the bottom meter API, and one time, and then they have the results and that's usually it. And we kind of like also interested in how does these how do these values change over time. So what we did is we constructed a data set of five different data sets. Namely, we had clear humans, clear bots, and a training set from bottom meter. The humans where you ask politicians to use some examples here usually verified, we have German politicians mostly verified, but all you know like all accounts that were clearly humans and you know the media would notify the public if something really weird and fishy would go on. It is automation in one or the other regard. And finally, we also looked at bots right, we went to a wiki that has Twitter bots on them, those are usually you know very transparent Twitter but so if we were to do human inspection we would all be like oh yeah that's a bot. And we did this for new bot for new English language bots as well as German bots. And then as I said the bots that were used to train bottom meter initially. We had then this data set of 400 or 4000 something accounts that we then create bottom meter with. And we create bottom meter for three months so we kind of like wanted to really have like a lot of reference scores to then calculate our scores and with that, I'll now let Adrian guide you through the results. All right. Many thanks Jonas. So let me start with the five different data sets that we have created for our analysis because Jonas has already introduced the single data sets and to really test a classifier a binary classifier. We have to of course use data sets that have both bots and human accounts so the first data set we use is like just all the accounts we have identified together that's here. Red, the all, then we combine the German politicians and the German bots but we only had very identified very few German bots that's why we created also like a third combined data set with the German politicians together with the English bots and then we have a fourth data set the US politicians together with the English language bots and then as a fifth data set we use the data set created by the first that was used for the to kind of like train the classifier by the creators of bottom meter. So usually when you want to report or analyze the diagnostic ability of a classifier you report something called ROC curve which stands for receiver operating characteristic and then you calculate the area under the curve so this means like the area that is covered by the curve what what what follows under it so the larger that area the better actually the classifier so let me start what you usually plot is like for every single threshold and for us, the bottom meter score goes from zero to one. So we start on the right hand side with bottom meter threshold of zero so if you say your threshold is zero and every account that gets a score above zero you classify as a bot you get a perfect true positive rate this means that you will identify all bots within your data set and at the same time of course every single human user in your data set will also be classified wrongly as a bot, which means you get a perfect balls positive rate of one. That's why you start on the right hand side in the upper corner with these visualizations. And if you move up with the threshold you increase the threshold, then eventually of course you can use the highest possible threshold of just below one maybe, and then what you get is actually the true positive rate that maybe includes only one bot so below even 1% but at the same time, you also reduce of course the false positive rate because no human account will be classified as a bot, wrongly classified as a bot, but at the same time you won't really identify also any of the bots in your data. So we compare here in this visualization the different data sets and we used actually the mean over the three months so every day we measured once the score for every account and for this part of the analysis we just took the mean the average score that an account received over the three month. And what you can see here already that the US politicians and bots curve the ROC curve for that data set is better than the other curves as you can see the German politicians and bots as well as the German politicians and German bots. They are a little bit worse at the area under the curve is actually smaller and you see that here as a summary this is usually what is also reported in studies and also the bottom meter creator report the ROC in their paper and it's pretty high actually in their original paper. And here you see the US politicians and bots they really receive the highest ROC area under the curve score various the German bots and the German politicians and the German politicians with the English language bots together, get a rather low overall ROC AUC score of 0.76 and 0.7. So the problem is like with the ROC AUC approach, you get like relative values you get a percentage basically for a single part of the data set like you get once measurement for the bots and you get another one only based on the human accounts. So Jonas has already mentioned it when we use this classifier, our target population is not our training data set or an artificial data set that is balanced in reality actually we want to analyze the trigger population and within the trigger population. We can assume I think we all can agree on that that we have less bots than users and of course, there is no clear number for that but you find numbers like 15% of the accounts active on Twitter are actually bought so we try to create a new data set that has this kind of imbalance because an imbalance data set is what you will actually see in the real world when you analyze Twitter outside of your experimental setting or test setting. So we created a new random sample with replacement with 100,000. Each data set we created a new version of it with 100,000 accounts, but we adjusted the probability weights during the sampling process that in the end we get basically a data set with 15,000 bots and 1885,000 human accounts. But I told you already before the ROC approach actually gives you relative value so no matter how in balance the data set is as long as you use more or less the same population or the same kind of accounts, you will actually get the same ROCA you see score here on the left hand side we see the original one based on the training or our data sets most of them are actually balanced and then on the right hand side here, which is like label the sample, you see the new data sets we created with the imbalance with only 15% of bots so the scores here are overall the same. However, if you use as a measurement actually the precision, which analyzes like it shows here with the imbalance one it analyzes like from all the identified accounts that are above the threshold how many of them are actually bots in our case right. And if it's an imbalance data set, even a very small false positive rate or overall false positive rate will lead to a high absolute number of wrongly classified human accounts that are suddenly classified as bots and the number of bots is actually smaller within this identified part. So, what does it mean for the PR the precision recall area under the curve on the left hand side you see the PR scores for the original data sets, which of course in most cases is better because the data actually is more balanced. And then, in contrast to that on the right hand side you see the PR score for our created in balanced data sets the only value that actually increases what you can see here is the German politicians and German bots. The PR score here is smaller for the original one because we had very few German bots. So we had less than 15% overall in that data set so when we created a new data set actually the score increased because we had to add more bots. But for all the other values, actually it gets worse. So let me show you visually what that means. So if we compare the two PR curves, the one for the original data sets and the other one for the resample for the newly created imbalance data set, then you can see here for example for the data set all the curve gets pushed out. So the precision that's really the value we're interested in gets worse actually the curves look more or less the same because we sample from the same population and the recall of course stays the same. But you see here for the German politicians and the bots on the right hand side that it really pushed down this curve. I will now explain what it really means and how you can interpret these curves on the next slide here. So these curves read a little bit differently than the ROC curves you start really on the right hand side with a bottom meter threshold of zero and on the left hand side you plot actually the thresholds that are the highest so from zero on the right hand side up to the left hand side with a threshold of one. So again, if we take the lowest possible threshold and we say like every account that gets a bottom of the score of at least above zero right, we will classify that account as a bot, we actually get the proportion of the population and as I have told you before now in our newly created data set we have 15% bots and 85% human users. So the precision is exactly 15% because we get all the bots, but at the same time you also get all the humans. So we have 85% actually wrongly classified accounts. So we plotted also into this visualization points that show you thresholds that were used in already published studies. So the one more in the middle are actually the points, these points show you the threshold of zero point 43 that's the threshold as Jonas has already mentioned before for the Pew study. And if you take, for example, the data set all you can see actually the precision is not that great. It's actually 0.5. So if you say all accounts that have a score higher than 0.43 the identified accounts 50% of them actually will be bots, but the other 50% will be false positives they will be human users. If you do the same for the German politicians and bots data set, as you can see performs a lot worse. It's even less actually it's more in the 30% rage of like the number of bots that you will identify and more like over 60% are human users. The points on the left hand side are the threshold for 0.76. These were used in the German context in a study that is published in political communication. And as you can see there, it's a rather conservative threshold right is a pretty high one 0.76. And you still you still get like a lot of like wrongly classified human users into into the data, and only around 25% of the as both classified accounts are really bots and at the same time and this is the recall right that you can see here on the X axis that we call this very low if you use a very conservative threshold. So conservative here means a very high one where you of course want to reduce the false positives but when you use a very high threshold. At the same time, you will identify very few bots from the overall population of bots within your data set so what this visualization also shows you. There is obviously a visual difference between the quality of classification between the English language more general cordless versus the German cordless, which is here a lot worse. Here's a summary, all the scores. And what is here interesting for us actually is the last row with the PR scores, and you can see the German data sets perform a lot worse with automator, whereas the English language data sets all have a higher score for the ROC as well as the PR. And these differences are also significant, you can read more about this in the paper so there's also a test to check whether it's it's significant. So as a last step, we also wanted to find out whether what the meter gives stable scores whether there's like a low volatility or high volatility and we have like selected a few of the accounts so here we have a far right distribution from the alternative for Germany, Alice Weidel, you see here over the three months, the scores plotted that we measured on every day and there's a lot of volatility. And if you take the threshold from the Q study, where you say like every account that gets a score higher than 0.43 is a bot. And actually here, just at the end of March, she would have, she would have been a bot. Whereas, on other days actually during these three months, she would have been classified as human but there's a lot of volatility so let's go to the next one this is just in one and US politician actually a Republican has a very low bot for the meter score in the beginning but then in increases and towards the end also again based on the pure, pure study threshold, he would have been classified as a bot. And as a last one, we have of course chosen about this is Jonas Jonas has identified this one. It's the Boston snowboard and as you can see here obviously the Boston snowboard is is tweeting when there is snow, and it seems like in March, I don't know Boston well, there was some snow but then towards the towards April and then starting from April, somehow probably this spot wasn't really active anymore. I think you're also like what we were captured probably the activity level and it was far more active in March than in April and maybe even stopped in April so what we see here with these very few cases already, there's a lot of volatility. If you think about most bot studies they only measure the bot score once on a specific day, and not like over time and try to take an average and this really can be an issue depending on the day. An account will be a bot with a certain threshold and on another day, you won't. But we also have like a summary this one to find also in the paper where we really check over this over the three months and for every single threshold actually. Another in account was at least once above and once below the threshold with with the score and if you take here for example on the left hand side with the threshold universal score. The threshold of 0.4 which is very near again to the Pew study, then you see over 55%. This is the red line of the new bots of the English language bots we identified. They had at least once on a day a score above that threshold of 0.4 and once during the three months, at least once a score below the threshold. And as you can see the same holds true for the German bots with the human users, it really depends. It's maybe less problematic but still pretty high for example you can see the German politicians again on the left hand side, if we take the threshold of 0.4, we still get around 30% of the users that at least have once a score above the threshold and once a score below the threshold during these three months. All right, then we're at the end. So what's the conclusion? What are the learnings here? We should of course be very careful when we're using botometer because the scores are not stable. There are problems with the language. Maybe it works with English. It doesn't mean it works in other languages and then what we believe is really our kind of like strongest message. We really should consider the imbalance sample, right? So the populations we're analyzing with a classifier, the population maybe is not as balanced as our training data. And this is a general problem also if you read the literature about classifiers in bioinformatics, there are a lot of discussions about that. If tests are developed and then used in a real population. So you also have, and this one we haven't mentioned before, you have to consider what is more important or what is more costly. False positives or false negatives. But this is also something beyond our paper but definitely a question that every single researcher when he approaches the problem of bots should answer. So how should we go forward? We recommend, of course, as a communication scientist always do some manual validation just because a classifier was validated in a prior study doesn't mean it still works with the new data set or new population. So for example, in the Swedish case, some colleagues have tested bottom meter. They got very bad scores. So the decision was like they created their own classifier for their specific Swedish data set. And then it worked. It got better ROC scores but also better precision. So you need also to validate over time. This is something very general. We think also, in general, any kind of classifier that analyzes Twitter data or Twitter accounts, you know, maybe beyond just bots analysis, they should consider that if it's not like stable over time, then of course the validation in different languages is more format. Let's say European or Asian perspective that is for us maybe obvious but for a lot of US researchers maybe less obvious when they create their classifier. So of course, if we now move forward and say bottom meter didn't work well, we can just create a new classifier. It's very easy now based on this data set to almost create a perfect classifier. I mean, I can just create an ad hoc classifier where I check whether bot is mentioned in the description or in the username, and then the classifier already works pretty well actually. But that is definitely not enough even when there will be a new classifier available that maybe was validated when other researchers will use it in the future. They have to validate it again, like we say. So if you're interested in that, please read our paper. We explained there in more detail how to move ahead. But also all the technical parts are explained in more detail and of course what we also call for we say these kind of black box tools where the code is not available and the full data set with all the measurements that was used to create these classifiers if that one is not available. That's a problem, right? We cannot really reproduce one to one, exactly the same classifier. And we need that to evaluate actually the quality of the classifier to really understand where the classifier is biased. We can now only somehow reverse engineer it or do some guess work based on single examples where there is a bias. So we also share, of course, when we call for more transparency, we also share our data you find the code in the data in the Harvard Dataverse. So that's it from my side. Many thanks for joining. Thank you so much, Adrian. I'll now have the double function of being both basically host and asking questions as well, maybe answering some questions. Also a special thank you to Adrian because he's in Taipei at National Taiwan University and it's 1240 a.m. there right now. So thank you for staying up so late. The first question actually reached us via email before the event had even started. So we want to give space for that too. And the question is, what do you think are the prospects for passing federal regulation on bots in the near future. And how do you state laws on bots being carried out. And as Adrian is an expert in that regard, I would want him to go first. It's a very interesting question because like two years ago before I came to Taipei, when I was back still back in Switzerland, a group of Swiss members of parliament invited me to burn and we had a kind of like a closed store discussion with some legal experts that have written also papers about like bot regulation or potential bot regulation and they had this idea of like regulating bots because they they read all these stories in the media right about like bots take over democracy or it's the biggest threat to democracy and then we had this discussion and even back then I was rather skeptical. So let me start now based on our analysis. Now, we can say it's very difficult. First of all, from a conceptual perspective right and I'm not a legal scholar but you need to be very clear what you mean by a bot if you want to regulate that one right the object. It's very difficult. Then, as a second point, it's very difficult to kind of like identify bots like to measure, measure it. Often it's not really clear who's a bot and they are like accounts that have some some a certain degree of automation, but then they are also like humans behind it. So it's very difficult to identify them and also very difficult at least at this point to define it. However, and this was really the last point that we were discussing there and actually the legal experts back then at least the Swiss ones. They also agreed with me and other people involved in this discussion that probably we have to ask the question of like, how, how, how big of a threat are bots in comparison to other threats to democracy but also like, what's the effect of these bots. And what we can say or a lot of like political communication scholars say more traditional political communication scholars they say it's very difficult to change the opinion of people in general even like with strong campaign techniques. So of course, if at all there's an effect it will be a very small probably a very small effect and at the same time if we think about foreign interference and now I'm speaking more from a Swiss perspective but also I can speak from a Taiwanese perspective. There was also this debate before the presidential election here in January like the China threat, right, especially online war fear. However, if you have a broader perspective, I would say the biggest threats actually are offline. So in the Taiwanese case, for example, there's the discussion of direct proxies probably that's far more effective. So this means you go over organizations or persons or even politicians right. Of course, that is happening and if you want to fight foreign interference you should start with the biggest issue or where there is the biggest threat and I strongly believe bots are probably not the biggest threat. At the same time, of course, and this was also the conclusion and burn back then we should keep an eye on it, maybe the situation of course is changing there will be new developments and maybe in the future and a certain context, bots might be a problem, but at the moment, it doesn't, it's, it's, it's, we were advising against a regulation. And I think the same, I would still say today, it's very difficult and I wonder how it's possible to regulate it. And then at the same time, is it really the biggest issue we cope with at the moment. Thank you and I think I don't really have much to add to that my my intuition is before and after doing the study that if you do try to implement a regulation that the limits of the regulation in the end will be so tight that you won't capture a lot of aspects which then in turn just like ask the question how, you know, effective will be the regulation in the first place. Plus, I don't, you know, like I don't in the end see bots as kind of like the biggest problem, even when we think about misinformation. A second question, then, and this is maybe one for you, Adrian, as well is from bow bow. And she asked, could a classifier like bottom meter use Bayesian methods to take into account the imbalance between bots and real humans on Twitter. They can use what's the latest breakdown of bots versus humans as their prior theoretically, yes, you can you can even it's more about like testing testing your classifier. So during the validation phase, you need to take that into consideration. So if someone tells you here I have a new classifier. It works perfectly well has a very high ROC score over point nine. And then you want to use it of course you should validate so you could validate that classifier with your own data from the population. And when you use a data set to again validate it, then you need to be careful to not use a balanced data set where you have like 50% bots and 50% human users, you can use that to train the classifier to add more information, put more emphasis on one rule. But to validate it actually and that one you have seen now in the presentation with the precision you should take this imbalance into account and then there you have of course different methods. How to how to how to how to test it right. A Bayesian approach itself. I'm a big fan of Bayesian regression models or in general of this paradigm or way of thinking, but at home, I can't think of like how to use it of course you should always this is your idea right. Of course the mindset is similar. We take into account. We have some fires right using the population is not balance. I would just say the validation gets better if you such an approach but if the classifier then itself is better. I don't know. Thank you. The next question and this is, we have several methods question and more general ones and I'm just like going to switch between those. So the next one is from Rod, and he asks, or he first last thanks for that fantastic perspective, given that Oxford Internet Institute research points to the increasing disinformation architecture of countries like India, Russia and Iran. Can you point to work if any being done on multilingual sentiment analysis. I answer. Please go ahead. Okay, again, it's exactly the same. I would say, if you're talking about sentiment analysis and let's ignore all the conceptual discussion about what is sentiment let's say you can really measure sentiment right, and it works pretty well in one language. So, with multi language and you're interested in the comparative perspective where something is stronger or weaker. Of course you need to validate it not only in every single language you also have to check whether it's similar in every different language to scale right and the strength of the sentiment then I could start with the cultural question, whether these sentiments are the same within a culture maybe a strong sentiment in Taiwanese culture or express sentiment is stronger than just a strong sentiment in the US culture right where it has to be even stronger to be considered extremely strong right but whereas here in Taiwan, strong sentiment would be already considered as a extremely strong one. Again, validate. I also test this always with my students don't just take these out of the out of the box solutions without any validation. And some of my colleagues in communication science have tried to revalidate a lot of these word lists that are used for sentiment analysis. And actually, if you don't do anything don't change these word lists. And you try to validate them in a new context they don't work at all so as a baseline usually and this comes more from communication science again. So the gold standard the gold standard is usually the human coding and then you're matching whether human coding is the same as as what the sentiment classifier the sentiment analysis measures when it is like a strong overlap. So between different languages very difficult but what I can tell you if you have a list for a specific language if you just within one language, you need to change the word list for the specific context it can work quite well actually. But again, you need to validate it. If it's not validated. I would be very skeptical. The next question is a really I think like was like six minutes into the presentation is like on the we want to know if we are debating a person or a bot issue. Is it your sense that there are bots that can pass that particular version of the Turing test. I'm just going to take that question myself. First and I'm also excited to hear Adrian's opinion from everything I've seen on Twitter, I would say no. I think a lot of these, you know, like Twitter just debates especially are usually short. And so you don't have a lot of time and I would say also like more heated discussions probably a lot of projections going on. And I haven't seen anything that would kind of like point me to a different opinion on that. Yes, I agree with you. I can maybe add one more point from two German colleagues, Pascal Juergens and Simon Kruszynski they also analyze astroturfing and automation on actually on Facebook. And what they say there's actually other way around you will be surprised how many human users use very simple communication patterns that could be interpreted actually as automation or as like, even like the text is like automatically created they just use emoji, because the majority of people actually on Facebook because they had a very large data set of comments and what they say, majority of people they're not communicating like academics on Twitter and they have a debate. Most users, they write a very short sentences not even sentences, a combination of emoji, or even like wrongly spelled words, single words, and they are still human users. So we need to be very careful, even the other way around to kind of like be careful to not label human users as kind of like bots and in reality they're actually a human users and with regards to the Turing test and the bots. As far as I know, there's not really a bot who has passed the test, but maybe maybe we will see in the future there will be a change. Thank you. The next question is more on the method side. And the question is what is the total sample size for each of the groups that is the end. I had that slide up in the middle of presentation. Overall, there were 4400 something accounts. I think the majority was 2000 something accounts were from the bottom meter training set. And the rest was I think roughly 5050 or like I think there were a little over 1000 humans. And again, like a little over 1000 bots. And the second question that regard is the size of one thing. Yes. Yeah, that's that's also limitation we have to add. Yes, we have a very narrow data set, and we specifically have chosen this one but at the same time, these very homogenous groups are also a problem a kind of like a limitation. So, maybe with other data sets will work differently. Yes, I think that's important to note. And also why you can like chose this data set was because we really wanted to be sure what we have in our data set right because otherwise it gets very like tricky. And so the second part of the question is, is the size of n more important than the level of imbalance of the data set on what is the impact of different values of n. The two small and of course is problematic and then again I would say like, it's more about like, is the general population homogenous or heterogeneous in which context you want to use a classifier. And with typical traditional sampling. You can say the more heterogeneous your population that you want to analyze the larger the end also for your training data set to really get a good classifier if you have a very, very homogenous data set. Of course, you can even use a very small and and you will get a very good classifier. Thank you. The next question is from Maria and she asked how does this compare with our package tweet bot or not. Which one is this one and I have to preface that answer with that we haven't checked specifically other packages against this. We know that there's, you know, like other options of blood detection out there, which, you know, all have their advantages and disadvantages. We pick bottom meter specifically since it's kind of like been used for most like in most studies and Adrian has pulled up I believe the R package right now. Yeah, exactly. No I know now which it is. It really depends what my kidney was using to train his classifier and as far as I understand it used probably even the same list but he can maybe tell more about this if you ask him. I think he has chosen also like these obvious spots so in that sense, I strongly believe that package will probably have a better performance than bottom meter. If you use these accounts but you need to check right. If you use the accounts that were used to train the classifier of course you will get an excellent performance. So again, our recommendation is that even if you get a almost perfect classification with an artificial training data set, use it with the population you want to analyze and then validate really what you get there. As like as bot classified accounts and manually check how many of them are actually false positives. And then at the same time and this is more difficult because usually the majority of accounts will not be classified as bots right you still need to check and this is a lot more because they have a very large data set right how many of the bots were wrongly classified or not identified and wrongly classified as human accounts. So you need this kind of manual classification anyway. Awesome, thank you so we have seven more minutes so we'll try to keep our answers short to get through all the questions. I was I was told we have until 105. So Mason asks, are you aware of any use of bot classifiers by platforms as part of their content management strategies. And if so, do you think these classifiers are doing more harm than good. And I personally, you know, like, haven't talked to anyone, you know, within the social media platforms about that. I would assume that they have classifiers that they use, amongst others for content management. You know, like obviously platforms don't want bots just like rampaging around on their platforms. And it, you know, I think if they don't invest also heavily in human, you know, like eyeballs that check and validate the accounts that they're moving automatically I think there might be some some harm being done along the way. What do you think Adrian. Yeah, I totally agree with you. I mean, obviously they are. They have probably their own classifier and they have like a lot more information available right we only also the creators of bottom here they only have the information available that you actually see when you open a tweet or an account. And basically what the API returns. Trader, of course, they have all this back end data when users locked in IP addresses right this information we actually lack and I think if you have that information. It's easier to create a good classifier but again, probably what the platforms think about they really think about this false positive rate right so again there is really a cost if the false positive rate is why you really say every account that is above a certain kind of block them right and you block through many real users of course it will have like consequences and people people become aware of it so I think they are rather conservative with this. What is your in your opinion, Jonas what do you think I agree. Okay, Paula asks, thank you Jonas and Adrian fantastic work. It is clear that bottom meters not reliable to identify what's but what other methods do you recommend I agree that bots are not the major problem but they still pollute the pollute the public sphere. And I think the quick answer to that to kind of like get to other questions as well is do several methods at the same time. Manually validate which is kind of like you know annoying you can like do automation to get away from this because it's just like time intensive. But you know like you have to do this in any instance. I personally like network method detection like network detection, but these are also very time intensive and there you to have to kind of like manually validate. And so, what do you think Adrian. I totally agree. There are methods to identify them maybe clearly but this really means like you to look into the data and some data crunching and look at it from different perspective but definitely not use a classifier in that process. Yanting from NTU Adrian asks, have you already figured out that that so since you have already figured out that the diagnostic ability of bottom meter will differ because of languages. But do you think that different social media would also interfere with the precision of bottom meter although all in English, like we use the same classifier for different social media platform. Yeah, definitely. If the same let's assume there's another social media platform that has exactly the same affordances more or less right what you get back is exactly the same you have something like shares which is a retweet you have replies you have mentions, all the same Yes, I would say, of course they're always cultural differences we can even stay within within Twitter right, or we can move to another platform where it's maybe even more obvious not something like Instagram. In some countries, for example, the kind of common section is extremely important right and a lot of things are happening there and another cultures. The common section is not so important. So if you create a classifier and in one culture where the comments are not so important. You classify users that post a lot of comments as kind of like bots because there is automation. And then you use that classifier in the other culture. It would probably kind of like identify a lot of users that really write comments as box so we also need to be aware not only about like it's a different, different platform with different affordances. But within one platform, of course, there are like different different cultural spheres in which social media are used differently. Awesome. Thank you. And so the final question of the day or the night for you. It is what is the ideal way to create gold standard data sets for bot research. Is it possible for humans to classify bots in the wild. There are two parts gold standard within communication science gold standard within computer science. So let me because we don't have enough time. Let me talk about the one in, in, in communication science. So like I say with the sentiment analysis we assume we assume in coders can identify the sentiment if we read the text but we know also from content analysis sometimes there's a lot of ambiguity. It's not that simple right. If we use the same for bots, we have to assume that we as humans can recognize bots. So the gold standard in that sense it's like you as a human take a sample from the population and manually classify these accounts where you can read the text for right for the bot in this you could check that right and then you compare the values you get with the values that automatic classifier gets so from a communication science perspective of course the gold standard of what I would say or I think also most of my colleagues would say what we as human coders are seeing and then you can even go further actually in communication science we say like it's not enough that just one core especially with sentiment that would say it's not just enough that I check it because if we just compare sentiment analysis between what Jonas sees in the tweet and what I see in the tweet even if we're from the same culture, probably there is not like 100% overlap right. You maybe even have to add more than one human quarter and then you can compare the kind of like labels that we give as human quarters. This is the gold standard obviously with the what the classifier gives so in our case we can test in this case here about a meter because we took obvious accounts right it's very clear. So here these accounts we have selected how you the box or humans but I would say in the general trend of population. They're also actually probably bots that are not just labeled as bots and then you actually a human first has to identify them. And that's like manual coding so the gold standard again to answer this question in my opinion is like the human coding I think that sounds about right and due to time I will just refer to Adrian's perfect answer. And with that, I think we all can wrap up this event. Thank you so much for attending. I hope if you've got question please feel free to reach out to Yes, I believe we are Twitter we are. You can also find us on our institutions websites. I hope you all stay safe and healthy. And we'll see each other the next with the next virtual event. Thank you so much. Thanks for joining. Bye bye.