Snooping Wikipedia Vandals with MapReduce

Abstract :

In this paper, we present and validate an algorithm able to accurately identify anomalous behaviors on online and collaborative social networks, based on their interaction with other fellows. We focus on Wikipedia, where accurate ground truth for the classification of vandals can be reliably gathered by manual inspection of the page edit history. We develop a distributed crawler and classifier tasks, both implemented in MapReduce, with whom we are able to explore a very large dataset, consisting of over 5 millions articles collaboratively edited by 14 millions authors, resulting in over 8 billion pairwise interactions. We represent Wikipedia as a signed network, where positive arcs imply constructive interaction between editors. We then isolate a set of high reputation editors (i.e., nodes having many positive incoming links) and classify the remaining ones based on their interactions with high reputation editors. We demonstrate our approach not only to be practically relevant (due to the size of our dataset), but also feasible (as it requires few MapReduce iteration) and accurate (over 95% true positive rate). At the same time, we are able to classify only about half of the dataset editors (recall of 50%) for which we outline some solution under study.

Complete list of metadatas
Contributor : Admin Télécom Paristech <>
Submitted on : Thursday, February 25, 2016 - 12:52:49 PM
Last modification on : Tuesday, May 14, 2019 - 10:14:46 AM


  • HAL Id : hal-01279007, version 1


Michele Spina, D. Rossi, Mauro Sozio, Silviu Maniu, Bogdan Cautis. Snooping Wikipedia Vandals with MapReduce. IEEE ICC, Feb 2015, London, United Kingdom. ⟨hal-01279007⟩



Record views