hi all - what do people think of this new sharing protocol from databricks? are there any W3C or similar protocols for secure sharing of RDF? https://delta.io/sharing/
Full disclosure: I have some history... gave a guest lecture at Berkeley in 2009 for what became the founding team at Databricks, back when Matei Zaharia was a young grad student, then later got to work closely with him, as their Director of Community Evangelism for Apache Spark. And now... Matei's a prof at my old dept at Stanford, my firm is doing work for Ion Stoicha (Databricks' former CEO) at Anyscale, and meanwhile at least three of the Databricks founding team have become billionaires 🙂 So it's a tangled web! Databricks is super-hyper invested in Delta Lake. Also, I'm scheduled to lead a Delta Lake panel on behalf of one of their competitors at Datanova on July 14 – we've got a review meeting this afternoon. That said, I really don't quite get their argument. And I've been privy to many of their arguments at early stages, some of which never went public. Although I'll need to formulate a more substantial opinion about Delta Lake within a few hours 😂
At the risk of rambling... this is a huge looming issue for KG work in enterprise. I'm working on a project with a large manufacturing firm currently, and really seeing substantive issues from the client's side. Based on that, I'd feel rather hesitant whatever Databricks, SAP, etc., are trying to claim about "live data" – and to back that up, I'd reference work by Matthias Broecheler (who spoke at KGC), Jesse Anderson (expert on data eng for streaming data in enterprise use cases), and gosh even the S1 from Confluent (or what it doesn't say). When one digs deep enough into the distributed systems issues (e.g., interconnect bottlenecks, memory object scheduling, etc.) of trying to work within +1B node graphs and handle versioning, semantic overlays, complexity of required graph algorithms, or even the relative gateways (or lack) among the various query languages .... wow, that's provably complex, well beyond what even Intel, NVIDIA, AMD, etc., would claim possible with current architectures. I know for a fact that some vendors are absolutely fudging their numbers. There's just no way that I can believe the Delta Lake approach would deliver in a large-scale graph context (with no aspersions toward my friends TD, et al., at Databricks) although I have hopes that some approaches borrowed from HPC work can apply.
Thanks for the interesting insights, Paco N. I must say I had no idea about the business side intricacies there. However, there is a proposed W3C Working Group Charter on Linked Data Signatures, that would make secure sharing possible. It’s not much at this stage, as the working group just started forming, but it might be worth keeping an eye on it: https://w3c.github.io/lds-wg-charter/index.html
Well said Ellie and Bojan. Fluree is doing very well, and I'm impressed! The scale issues for certain kinds of popular use cases are going to be a constant. FWIW, this touches on "why" NVIDIA bought ARM, and the outcomes which showed up to some extent at GTC this year 🙂 I'd also recommend Mark Pesce's excellent "Geopolichips" mini series within The Next Billion Seconds for more backstory https://nextbillionseconds.com/2021/05/17/geopolichips-1-why-is-there-a-global-shortage-of-computer-chips/
Thanks, Ellie and Paco! Yes, this sounds very similar to our access model using using SmartFunctions to grant/revoke access to the data for both reads and writes. And our ledger does essentially what it sounds like this protocol does where the deltas of transactions are pushed out to the query peers, which can scale horizontally as much as is needed. This looks like really interesting technology though. I'm interested to see where it goes.
Thank you very much Paco N. for such a comprehensive and lived response! And Thank you Bojan Božić for that W3C link, that’s exactly what I was wondering about. Unfortunately I’m not clear what the headline is here. What I’m understanding is:
there are no well established protocols for secure sharing of linked data (and thus Databricks’ claim is not trivially false)
a W3C WG exists that deals with it, but it’ll be likely years before its usable in industry
Databricks protocol is interesting as a foray, but is likely to have large shortcomings not in the security but in all of the arcane practicalities of scaling such a solution
Fluree offers something similar
Both are proprietary and so fundamentally flawed as a broad solution to sharing data and large-scale interoperability
Is that fair?
uhm, the SPARQL protocol?