Inproceedings3581
TokTrack: A Complete Token Provenance and Change Tracking Dataset for the English Wikipedia
TokTrack: A Complete Token Provenance and Change Tracking Dataset for the English Wikipedia
Published: 2017
Mai
Buchtitel: TokTrack: A Complete Token Provenance and Change Tracking Dataset for the English Wikipedia
Seiten: 408--417
Verlag: AAAI Press
Organisation: International Conference on Web and Social Media (ICWSM)
Referierte Veröffentlichung
BibTeX
Kurzfassung
We present a dataset that contains every instance of all
tokens (≈ words) ever written in undeleted, non-redirect
English Wikipedia articles until October 2016, in total
13, 545, 349, 787 instances. Each token is annotated with (i)
the article revision it was originally created in, and (ii) lists
with all the revisions in which the token was ever deleted
and (potentially) re-added and re-deleted from its article, enabling
a complete and straightforward tracking of its history.
This data would be exceedingly hard to create by an average
potential user as it is (i) very expensive to compute and as
(ii) accurately tracking the history of each token in revisioned
documents is a non-trivial task. Adapting a state-of-the-art algorithm,
we have produced a dataset that allows for a range of
analyses and metrics, already popular in research and going
beyond, to be generated on complete-Wikipedia scale; ensuring
quality and allowing researchers to forego expensive textcomparison
computation, which so far has hindered scalable
usage. We show how this data enables, on token-level, computation
of provenance, measuring survival of content over
time, very detailed conflict metrics, and fine-grained interactions
of editors like partial reverts and re-additions and other
metrics, in the process gaining several novel insights.