 Hello everybody. My name is Tal Ayalon. I am the project manager for the Open Knowledge Repository of the World Bank. It's a pleasure being with you today and it's an honor and a privilege to be able to present as part of the CNI annual meeting. Today I'd like to discuss something that has occupied a lot of our time as administrators of Open Institutional Repositories and other Open Repositories or Repositories of Open Access, and it has to do with accuracy in reporting of usage statistics and how can we make sure that we have as accurate results as we can or statistics. Let me share my screen. Now we can start our presentation from here, the United Rainbow Collars of Bots separating the wheat from the machine generated chaff in Open Repository Statistics. Now what do I mean by that? For this presentation's purposes we define bot activity as machine generated interactions with a repository, which could be visits, views, downloads, etc. that are not part of the intended services offered by the repository. Those can range from malicious activity through statistics boosting to benign or even helpful web crawler activity on the repository server. Our challenge as administrators or as users or as institutional users of Open Access Repositories can be summed by this excerpt of the Open Access and Analytics presentation from the last NISO conference. While traditional paywall publishers can take advantage of industry standard counter reports to communicate usage to subscribing libraries, no similar standard exists for OA content Open Access. Instead, many organizations are stuck with proxy metrics like sessions and page views that struggle to discriminate between robotic access and genuine engagement. To help with this challenge on the OKR, the Open Knowledge Repository, I have come up with a tool that helps define bot activity and track it and mark it. It's not a perfect system but I hope at least some of you would find it helpful. How do we detect bot activity and remove it from a repository's real statistics and metrics? Introducing the rainbow scale of bot activity from red, which is the easiest one to detect, to violet, which is the hardest to detect. The red level is the one that's easiest to detect, a DDoS attack, distributed denial of service. This is a no-nonsense malicious activity by machines. Distributed denial of service attacks are malicious attacks by bots on an online system intended to overwhelm the server with false requests until it shuts down. Such an attack is obviously bot activity, raising an alarm with us and our vendor, DSpace vendor, resulting in a complete block of the offending IP addresses. And yeah, clarification, the OKR is built on DSpace and our DSpace vendor cooperates with us in locating bot statistics and bot attacks like this case, which is the red color in the rainbow. The next level are crawler HTTP agents. Many of the most common bots operate in the open and perform, sorry, operate in the open and perform benign collection of information. For example, the Google Scholar Crawler that indexes academic websites for relevant metadata. Those types of bots are initiated by HTTP agents that are easily identifiable to system administrators and do not pose a threat to the server. However, they should be marked so that statistics they generate do not infiltrate the server's human generated genuine statistics. And as you can see on the right side, you can see a list of HTTP agents that are known to belong to bots, such as Scrappy. In this case, and it's different versions of Ruby, which is another crawler. Those are excluded from the statistics in the OKR by a configuration file that you see on the right and your statistics will not be counted. The yellow layer is a bit harder to detect, but with a little bit of manual work, they can be weeded out as well. Those are known blacklisted IP addresses. Cybersecurity websites often publish and maintain lists of IP addresses that have been known to originate spam attacks and other malicious or disruptive activities so that administrators can block them, aka block lists. And we at the bank also subscribe to a few of these block lists. Those lists are a great basis for the formation of an exclusion list for servers, but new bot servers are created every day. So those relatively static lists cannot be relied upon is the only line of defense against bot activity. So you could have a very rich IP blacklist or block list, but for example, if it gets updated once a week and a new bot attack was initiated the day after the last update, you'll have six days in which the bot can do whatever it wants without being added to the block list at least six days. So there is a delta there of time that is not covered by publicly available blacklisted IP lists. And for that, we have the next level, which is a bit harder to detect and to filter out. Those are newly found blacklisted addresses. Many malicious and spam bot registries exist online, as we see. To get the most comprehensive and up-to-date information regarding blacklisted IP addresses, use a meta blacklist. All IP addresses that pass a certain threshold of interactions with a repository get analyzed by such meta blacklist sites and are added to the block list if they are blacklisted on one or more trusted blacklists within the meta blacklist. New spam bots and worse are launched every day, so keeping our blacklist up-to-date is paramount. So sometimes you can find an IP address that is not raising a flag on any one of your specific block lists that you subscribe to. In that case, you can also compound that search with a meta blacklist site that will go out and send your query to dozens of other blacklists and see whether one of those has already been updated with this IP address as a malicious or a bot IP address. Now comes a part that is a bit of detective work. It's not an exact science, but it does help keep your block list up-to-date. Non-blacklisted IP addresses behaving suspiciously, that is the blue color in our rainbow. Sometimes an IP address engages a lot with the repository and does not appear on any blacklists. Engages a lot or engages in a strange way or a suspicious way that looks abnormal. However, a deeper analysis revealed behavior that appears non-human such as SQL injection attempts. An IP address downloading only one item 10,000 times in a week, which has been known to happen, or an IP address downloading each of the repository reports the exact same number of times. All these are suspicious behaviors by IP addresses. These cases are a judgment call. Factors that affect the blacklisting of such IP addresses are, is it registered to an entity that has other blacklisted addresses that could be a strong indicator for false activity or machine-generated activity? How inexplicable is the behavior? For example, an IP address from Eastern Europe downloading a report about South America 25,000 times. And when you do that detective work with your repository statistics, you will many times find very suspicious activity that is not detected by any of the blacklists or meta-blacklists. Moving on to Indigo, we have bots that are disguising as VPNs. Most VPNs, virtual private networks, are legitimate repository users being lumped together through one IP address due to various internet access restrictions or via large institution. So for example, when I see a lot of action coming from one IP address, let's say, in Bulgaria, for example. Well, the World Bank has a large office in Bulgaria and it might just be that that whole office uses one VPN to mask its activity and everybody in that office that searches something on the OKR is registered as a statistics from that IP address, which doesn't make it in itself a machine-generated statistics or statistics that should be excluded. Those statistics are human-driven and welcome. Bots behaving like VPNs, a large number of statistics originating in one IP address but distributed among many different HTTP agents interacting with a wide range of items and what looks like a random pattern are very hard to detect because they look very much like human activity within a VPN. A strong indication of a bot as VPN would be its IP address being registered to a previous known offender known to operate malicious or stamp bots. Another one would be in converse to what we said before about the HTTP agents is, for example, one IP address that has a lot of interactivity with the repository but only uses one or a very small number of HTTP agents. So it looks as if everything came from one computer and that usually means that there's less of a chance that this would be legitimate VPN activity. Which brings us to the most difficult kind of machine-generated statistics to detect and one where we sometimes just throw our hands in the air and say that we're as close as we can to pure statistics or pure human-generated statistics but we will probably never approach 100% of accuracy with open access repositories. Bot client arrays. Identifying bot activity is not an exact science. No system can read itself of non-human-generated statistics 100%. I call it the Heisenbott principle. If someone maliciously designs an array of bots with different IP addresses from different places limits activity below a reasonable threshold and apply some randomizations to their activity they could probably create non-human-generated statistics that we will not be able to detect. Bots are becoming more sophisticated and so should we. Future trends in the field of bot detection and repositories. The bot net detection market is estimated to grow with a compound annual growth rate of 37.6 percent over the forecasted period of 2021 to 2026. The number of bot net attacks is quickly rising due to which the demand for bot net prevention is expected to grow. And also bots also know how to forge common attributes that are used for bot detection. They manipulate HTTP headers and their values in order. They can change and they change their browser fingerprints. The best bots are able to perfectly mimic human behavior. For example they can easily pass captures and can also imitate realistic mouse movement and keyboard strokes. And below is a diagram of a plan for machine learning bot detection which will probably be the future trend and will help us a lot with parts of that rainbow in which we are not yet at a high enough level of detection. So this sounds a bit scary but still there are very a lot of brute force or pretty easily detectable bot activity going on and it will probably go on even after there are more sophisticated bots because they will also tend to be more expensive and harder to implement. So we still have a very good piece of the machine activity that we can detect and we will never be 100 percent perfect probably but we can keep striving for it and keep doing our due diligence and our detective work on our statistics. Please let me know if you have any questions anything to add anything that I'm missing in my bot detection algorithm and my email is down below and I would love to hear from you. Thanks a lot for your time and I hope it was helpful to listen to this presentation. Thank you.