Yong Tang. Great set of new references for me to look at, thank you! I’m on a similar quest… for data integration and OLAP type queries, powered by Spark over data in a data lake.
https://github.com/SANSA-Stack/SANSA-Stack is powered by Spark RDDs. This seems active but built on the older Spark 1.x RDD structure
OnTop VKG promises to map Sparql to SQL (thus query Big Data sql engines like Spark SQL)
S2RDF and some other papers on optimizing Sparql queries over Spark
Douglas Moore give the Grakn database a look, Graql (Grakn’s query language) handles, natively, both deductive reasoning via backward chaining (OLTP) and distributed analytics (OLAP) at the database level. Providing strong abstraction over low-level constructs and complex relationships. https://github.com/graknlabs/grakn
"for data integration and OLAP type queries, powered by Spark over data in a data lake." ---amazing! semantic data lake for connecting RDB worlds with semantic!
semantic data lake is very interesting and I am also building a inner BI project wishing to use semantic data warehouse or datalake. A recent paper for this topic is from the following https://upcommons.upc.edu/bitstream/handle/2117/188695/SETLBI.pdf
The overall semantic data integration process(SETL) from the same authors: https://arxiv.org/pdf/2006.07180.pdf
Thanks for mentioning SANSA Douglas Moore, the project is promising and I will see how it works.
As for the recent develop version and 0.8-snapshot , scala binary version 2.12 corresponding to spark-core_2.12 has been added into pom.xml. And it is changed into building on spark 2.
Ok, I should have been more precise, it’s built on RDDs and not Dataframes.