Alexey Grigorev - Large Scale Vandalism Detection in Knowledge Bases





The interactive transcript could not be loaded.


Rating is available when the video has been rented.
This feature is not available right now. Please try again later.
Published on Jul 26, 2017

Wikidata is a Knowledge Base where anybody can add new information. Unfortunately, it is targeted by vandals, who put inaccurate or offensive information there. To fight them, Wikidata employs moderators, who manually inspect each suggested edit. In this talk we will look into how we can use Machine Learning to automatically detect vandalic revisions and help the moderators.

Knowledge bases are an important source of information for many AI system: they rely on the bases for enriching the information they process to make better user experience. Obtaining such Knowledge Bases is difficult, and which is why this process is crowd-sourced. One of such bases is Wikidata: they allow everybody on the Internet to edit the content and add new information.

Unfortunately, Wikidata is often targeted by vandals, who misuse the system and put false or offensive information there. This may lead to incorrect behaviour of the AI systems. To keep the base clean, Wikidata employs moderators who manually inspect each revision and revert vandalic ones.

To help moderators fight vandals, the organizers of WSDM Cup 2017 challenged the participants to build a Machine Learning model which automatically detects if an edit should be rolled back. In this talk we will discuss the second place solution to the Cup: how to process half of terabyte of revisions, extract meaningful features and create a production ready model that scales to a large number of testing examples.


PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R.

PyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases.

Comments are turned off
When autoplay is enabled, a suggested video will automatically play next.

Up next

to add this to Watch Later

Add to

Loading playlists...