 Today, I'm going to present our research about a quantitative evaluation of different models which are used to detect Sockpuppet accounts in Wikipedia. This research was completed by me, in the case of Merlinita, a final undergraduate computer science student at the University of Edinburgh, and by my supervisor, Dr. Bjorn Dros, who is a lecturer in computational social science at the University of Edinburgh. Let's start by clarifying the main terms. Sockpuppets are multiple accounts created by the same user and used for some malicious purpose. The user who creates multiple accounts is called a Sockmaster. There are many reasons why Sockmasters create multiple accounts, for example, to circumvent a block, ban, or sanction of a primary account to create the illusion of greater support for a position, etc. Obviously, all of these actions severely violate Wikipedia's integrity. Wikipedia has a way to catch the government accounts, which I will now demonstrate. This is a two-step process. First of all, accounts are reported by other users. For example, all of these accounts are reported, and then they are reviewed by administrators, clerics, and check users requested as indicated by all these statuses. Arguably, the current reporting procedure is too manual. This can lead to human biases and errors and possibly not consistent review process. The literature tried to prevent this by suggesting many machine learning solutions. Solutions can be grouped into pregroups depending on what features were used. Some studies use textual features, others use non-textual features, and the majority of works combine different features, like textual with non-textual or account-based with content-based features. Even though many solutions were suggested to solve this serious problem, as we see, no effort was made towards comparing called the solutions together and finding out which solution is the best. Since all of these studies use different evaluation sets, which differ in size, namespaces used, and so on, results reported in papers cannot be used for direct comparison. Also, some of the papers are rather dated and they do not account for the latest innovations in the field. So this leads us to two research questions. The first one is about recreating models and evaluating them on SIM dataset using the SIM standards metrics. The second question is about bringing some innovation towards this problem and using transformers to improve classification results for detection models. To answer these questions, we needed to recreate an archetype model for each of the groups. Source code of models was not publicly available, so we needed to recreate models from scratch. And since all of the details can be found in the original paper itself, we won't go into the implementation details here. To enable direct comparison, some models had to be modified, though, including changing Solariors model type from linking SOC masters with their sub-puppets to binary classification. This allowed to compare it with TMX and USE models, which were of binary classification type. However, using authorship attribution features for binary classification actually led to poor results for detection models groups, so we needed a better representative for the detection group. Additional reason to switch to other model was that the original Solaria study was stated it was 10 years old. During these 10 years, much change in the tech technologies field, and we believe that transformers were the most significant revolution, so we switched to them. We used tagging face library to download these transformers, and we fine-tuned them on our classification task using trainer class. We selected four most online transformers, which were trained on Wikipedia, and namely these were BERT, Distilbert, Roberta and XLNets. Once models were recreated, we then evaluated them on the same dataset. As no datasets were publicly available, we created a new one. The dataset was created from users' contributions to the talk pages. The main method of creating datasets was using media VK API endpoints. Since stockpuppet accounts eventually get blocked, we used API blocks and points to retrieve accounts blocked due to stockpuppetry, and in this way we fetched around 10,000 accounts. Then we used API user contrips and API compare endpoints to fetch the contribution themselves. And since not all of these stockpuppets contribute to talk pages, we removed those who didn't, and the number of stockpuppets dropped to 3,483 accounts. Of course, we also needed the control group for each stockpuppet, a matching non-stockpuppet account that contributed to the same talk page and was hacked around the same time was found with the help of the API revisions endpoint. Careful selection of control group accounts resulted in a balanced dataset as depicted in these charts. The final dataset consisted of around 5,000 users, and altogether they made 150,000 contributions. The dataset covers more than 20 years of commenting behavior. We tested each of the recreated models three times on three different data splits in order to check if the model's performance is independent of the split, and then we reported the average metrics. As for the machine learning models, we tried five different ML models in order to find the most suitable one. As already said, we found that the original textual model recreated by the Solarius paper turned out to be incompatible with our problem, because as we can see, like it can only at most attack 38.5% of stockpuppets, which is of course not enough. From the remaining groups, the most important finding is that transformers achieve the best results. More specifically, Roberto Transformer achieved outstanding results. Three positive rate precision and F score are all above 84%, and the most positive rate is 9.8%, which is the best result around all of the models. Even though 9.8 looks like a small number on paper, in reality the majority of accounts are still legitimate, hence this would translate into a high number of incorrectly blocked accounts, so we should keep that in mind. Let's conclude by assigning the main contributions of our work. First of all, we directly compared three groups of approaches to detect stockpuppets on Wikipedia. This was a gap in research, which basically prevented from knowing which model is the most promising one. Secondly, we also found that transformers are out to be the most compatible model for our problem. This is a novel approach to the best of our knowledge, transformers have not been used on its own for detecting stockpuppet accounts on Wikipedia. While Sikib's work used third transformers and beddings, they were combined with other features rather than used alone. In this study, the final classification decision was still made by machine learning models rather than transformers like in our study. Transformers have great potential for semi-automated stockpuppet detection in Wikipedia. Due to still high false positive rate of 9.8%, full automation is not recommended as the cost of false positive rate is too high. We cannot incorrectly accuse a user of being a stockpuppet because that user can lose trust with Wikipedia and can even stop using Wikipedia. But still, we can semi-automate that. For example, we could use Roberta to flag suspicious accounts for administrator CVU or automate the second phase where suspected accounts are already reported and Roberta can make the final decision on whether to block an account or not. And the final contribution is the dataset itself. The dataset can be used as a benchmark to evaluate other stockpuppet detection models, including future models. It can also be used in similar and related problems such as analyzing if stockpuppets contribute to some topics more than others.