BTC-2019: The 2019 Billion Triple Challenge Dataset

José-Miguel Herrera, Aidan Hogan, Tobias Käfer

Published: 2019 Oktober

Buchtitel: Proceedings of the 18th International Semantic Web Conference
Seiten: 163-180
Verlag: Springer

Referierte Veröffentlichung

Kurzfassung
Six datasets have been published under the title of Billion Triple Challenge (BTC) since 2008. Each such dataset contains billions of triples extracted from millions of documents crawed from hundreds of domains. While these datasets were originally motivated by the annual ISWC competition from which they take their name, they would become widely used in other contexts, forming a key resource for a variety of research works concerned with managing and/or analysing diverse, real-world RDF data as found natively on the Web. Given that the last BTC dataset was published in 2014, we prepare and publish a new version –BTC-2019 – containing 2.2 billion quads parsed from 2.6 million documents on 394 pay-level-domains. This paper first motivates the BTC datasets with a survey of research works using these datasets. Next we provide details of how the BTC-2019 crawl was configured. We then present and discuss a variety of statistics that aim to gain insights into the content of BTC-2019. We discuss the hosting of the dataset and the ways in which it can be accessed, remixed and used.

Verknüpfte Datasets

BTC

Forschungsgruppe

Web Science

Forschungsgebiet

Inproceedings3786

BTC-2019: The 2019 Billion Triple Challenge Dataset

BTC-2019: The 2019 Billion Triple Challenge Dataset