Buchtitel: Proceedings of the International Workshop on Semantic Big Data (SBD∂SIGMOD'19)
To benefit from mature database technology RDF stores are built on top of relational databases and SPARQL queries are mapped into SQL. Using a shared-nothing computer cluster is a way to achieve scalability by carrying out query processing on top of large RDF datasets in a distributed fashion. Aiming to this the current paper elaborates on the impact of relational schema design when queries are mapped into Apache Spark SQL. A single triple table, a set of tables resulting from partitioning by predicate, a single wide table covering all properties, and a set of tables based on the application model specification called domain-dependent-schema, are the considered designs. For each of the mentioned approaches, the rows of the corresponding tables are stored in the distributed file system HDFS using the columnar-store Parquet. Experiments using standard benchmarks demonstrate that the single wide property table approach, despite its simplicity, is superior to other approaches. Further experiments demonstrate that this single table approach continues to be attractive even when repartitioning by key (RDF subject) is applied before executing queries.
Weitere Informationen unter: Link
DOI Link: 10.1145/3323878.3325804