- AutorIn
- M.Sc. Daniel Obraczka
- Titel
- Entity Resolution on Heterogeneous Knowledge Graphs
- Zitierfähige Url:
- https://nbn-resolving.org/urn:nbn:de:bsz:15-qucosa2-972878
- Datum der Einreichung
- 16.12.2024
- Datum der Verteidigung
- 28.04.2025
- Abstract (EN)
- With the ever-growing size of the internet, the need for structured information is growing alongside the web. Knowledge Graphs (KG) aim to address this information need by enabling users to dynamically represent complex relationships in a semantically enriched data structure. Typically, such graphs encompass information from various domains or data sources, which often necessitates merging two KGs. This involves identifying the same real-world entities and concepts across the separate sources, a process known as Entity Resolution (ER). While the flexibility and dedicated graph structure are great for modeling complicated information, it is challenging for data integration tools to handle the heterogeneous schemata and utilize the rich relational information present in the KGs. A recent trend that addresses this issue is to utilize Knowledge Graph Embeddings (KGE) to transform the information of the KG into dense vectors, translating the semantic and relational similarity between entities into distances between the entity embeddings in the vector space. Even though this research area is relatively new, improvements have been seen at a neck-breaking pace. Still, KGE-based Entity Resolution suffers from a variety of problems. The early methods, in particular, mainly incorporated the graph structure into the matching process treating entity attribute values as second-class citizens. Therefore, in our EAGER framework, we aimed to investigate the possibilities of using KGEs in conjunction with attribute similarities as input for a variety of machine learning models. We introduced a new dataset family from the movie and TV show domain with a shallow graph structure. In conjunction with datasets sampled from open KGs like DBpedia and Wikidata, we were able to evaluate our approach on a spectrum graph densities. While attribute similarities on their own are better for shallow graphs, and KGEs on their own are better for more densely connected graphs, our results show that we generally obtained significantly better results when combining KGEs with attribute similarities than either on their own. Overall, our combination approach significantly outperformed state-of-the-art matching frameworks designed for tabular data. KGE-based matching is largely focused on learning the embeddings. In the final step, the nearest neighbors are used to obtain the matching candidates. This is usually done using metrics like cosine similarity or Euclidean distance in the vector space. A problem of high-dimensional data like embeddings is hubness, which means that some data points become nearest neighbors of many other data points. At the same time, some entities are nearest neighbors of none. This is detrimental to match quality since, typically, one entity from one data source is matched with one or, at most, a few entities in the other data source. To address this problem we investigated hubness reduced nearest neighbor search in our kiez framework. We reduce speed penalties incurred by the hubness reduction by relying on Approximate Nearest Neighbor search (ANN). Our results show that using ANN allows us to reap the quality benefits of hubness reduction while sacrificing practically nothing in terms of speed. One major bottleneck of ER lies in its quadratic complexity a-priori, i.e., all entities of one data source must be compared with all entities of the other data source. A common solution to this problem is to include a blocking step. Here, likely matching entities are placed into the same block and comparisons are then only performed inside the block. Traditionally blocking comes from the world of databases, which is not capable to deal with the heterogeneity of the schemata of KGs and ignores the rich relational structure that makes this data structure so powerful. Furthermore, most blocking approaches rely on symbolic techniques, e.g., overlap on token level. Embeddings of language models can capture the semantics of attribute information and enable clustering of similar concepts that might be dissimilar when solely relying on string metrics. For our framework named klinker we therefore expand a relational blocking approach to enable the use of embedding-based approaches and conversely expand deep learning-based tabular blockers to incorporate relational information. We are then able to investigate relational blocking on the neuro-symbolic spectrum. Our results show that our relational enhancements to state-of-the-art approaches significantly improve the results. The implemented (neuro-)symbolic methods can outperform even sophisticated deep-learning-based methods in terms of quality and speed. Our newly implemented hybrid methods that combine symbolic and embedding-based techniques show promising directions for future research. We make all our results publicly available as open-source libraries. The significance of our results is ensured by utilizing a rigorous Bayesian statistical testing regime.
- Freie Schlagwörter (EN)
- Entity Resolution, Knowledge Graph, Knowledge Graph Embedding, Data Integration
- Den akademischen Grad verleihende / prüfende Institution
- Universität Leipzig, Leipzig
- Version / Begutachtungsstatus
- angenommene Version / Postprint / Autorenversion
- URN Qucosa
- urn:nbn:de:bsz:15-qucosa2-972878
- Veröffentlichungsdatum Qucosa
- 27.05.2025
- Dokumenttyp
- Dissertation
- Sprache des Dokumentes
- Englisch
- Lizenz / Rechtehinweis
CC BY 4.0