Best Practices for Handling Entity Resolution in Data Management

Avy Faingezicht · 2021-07-07T19:11:09.094Z

Separately, I have a whole slew of duplicates in my data which will require entity resolution. We already have a way to resolve that “xyz inc” and “xyz” are the same company, which I assume in KG-land translates to a synonym relation between two nodes. I assume having N nodes for a single entity is not a good practice, so any pointers about how to handle these merge operations would be welcome

The Knowledge Graph Conference

Dimitris K.
·
looks like you hit the main pain-points of data integration here. the is no golden rule for any of these issues and unfortunately not a lot of material online. for ontology, it depends on your modelling use case, schema.org / dbpedia.org usually provide some base modelling for most types but you need to see if they are too detailed, too generic, too loose or too strict for your needs. you can find a list of online ontologies here. You could also construct your own ontology and optionally link your properties to other ontologies to ease automated data ingestion
💯1
Dimitris K.
·
wrt linking, this again depends on your use case and how you want to store or query your data, you may decide to do a "hard" deduplication where you take all the duplicates and replace them with a single entity by fusing/merging all fields, or you may decide that you want to keep the data in the original form and do the fusion/merging on-demand or in real time
Dimitris K.
·
a good provenance scheme is something that you might also want to pay attention to, this can help you identity the source of data quality issues that surface in your data
François S.
·
I agree with Dimitri, this is the classical data integration use case that knowledge graphs are there to help solving. So at least you are talking to the right crowd here 🙂
regarding the integration of entities the best to me is a combination of what Dimitri refers to: create a merged, canonical entity, while keeping it linked to the original entities. The original entities may or may not be part of the KG depending on constraints related to the KG size. If they are not, you would store mappings in an external table. You will also need additional indexes that capture the provenance of the attribute values for your canonical entities.
regarding the ontology I would additionally to what Dimitris recommends propose the use of the W3C org ontology. I have found it helpful in a variety of scenarios where the modeling of companies/organizations are core.
François S.
·
BTW here is a small rant: schema.org is great but its goal is to represent structured markup on web pages. It is thus biased towards properties and types that search engines are interested in annotating. I miss a library of core schemas that would allow to represent concepts central to businesses like the ones you mentioned here. Something modular and extensible.
💯2
François S.
·
Finally Avy you should check out this report https://drive.google.com/file/d/1RkLmfytkI0KHd7183bwkK265loC_5ybG/view?usp=sharing that goes into some details around what we are discussing here (and give a more complete idea of my view on the topic)
👍1
Aaron B.
·
"I miss a library of core schemas that would allow to represent concepts central to businesses..." Sounds kind of like gist François S.?
gist is Semantic Arts' minimalist upper ontology for the enterprise. It is designed to have the maximum coverage of typical business ontology concepts with the fewest number of primitives and the least amount of ambiguity.
https://www.semanticarts.com/gist/
François S.
·
Yes something like Gist indeed.
Avy Faingezicht
·
Thanks, everyone! Dimitris K., Archivo seems super useful and simple to navigate. Since my space is so constrained, I was thinking of going with Wikidata’s P/Q identifiers, assuming I’d be able to map back to anything else from there. is that a bad idea for a prototype?
Avy Faingezicht
·
In terms of fusing/merging, I was indeed thinking along those lines. Kind of a three step process:
1.
building a raw table with all known facts/statements across input sources
2.
running a deduplication/entity resolution process to produce new source1:X owl:sameAs source2:Y or equivalent triplets
3.
for each connected component of the results of a ?s owl:sameAs ?o query, create a property table/view, following rules to pick between conflicting statements across sources
Dimitris K.
·
For prototyping, I would say that picking any ontology that is close to your model would work fine. For taking it to the next level, you should decide if you will stick with an external ontology or build your own, the former makes things easy at first but can get complicated if you need to extend your model in the future.
Dimitris K.
·
the general steps for a KG pipeline is data extraction, data enrichment, linking and fusion (and data quality checks in various parts of the pipeline). How these steps are implemented (technically) is highly dependent on how you acquire and maintain your sources and if you need manual curation. e.g. creating a KG from a few static input sources can have a different approach from one that has input sources with high change frequency, But, overall, what you describe sounds correct
💯1
Rashif Rahman
·
Deduplication, record linkage and entity resolution (terms that need deduped themselves) is a challenge I have been tackling for one of my teams. Our findings have led us to prefer an LPG database over an RDF one. Owl:SameAs is not really helpful when the central problem is the discovery of duplicates across arbitrary data sources (though in the same business model they are not expected to have similar data). With owl:SameAs data storage and memory requirements ballooned as all data is replicated for the same instances. Not to mention the RDF overhead in general for raw data in the 10s and 100s of GBs range. While this is OK for a community project it does not (usually) make a lot of business sense. Some LPG DBs may offer handy NLP and similarity functions that helps to an extent. When the entities are their own subgraphs structural similarity does not yield good results. Until recently graph embeddings helped with "homogeneous" data (friends, recipes, etc.) but not a domain with multiple types of entities and relationships. Taking into consideration all relationships as well as all properties (literal or otherwise) is also a key problem. But that is where we (I) left it at and have yet to get up to speed on recent advances in entity resolution with the help of G(C)NNs. Interested to hear what you or anyone else has learned so far!
François S.
·
I think the type of graph database does not matter here. While owl:sameAs is widely used in semantic web/linked-data scenario I agree that it is not necessarily the right tool for an enterprise KG.
François S.
·
Indeed the problem of identifying multiple equivalent entities is core to knowledge graph construction as KGs by definition integrate multiple sources. There are two tasks here:
the first is to identify equivalent entities across the integrated data sources
the second is to represent and manage these equivalent entities

Best Practices for Handling Entity Resolution in Data Management

19 comments

Best Practices for Handling Entity Resolution in Data Management

19 comments