Duplicate Table Detection with Xash

Media type: E-Article; Text
Title: Duplicate Table Detection with Xash
Contributor: Koch, Maximilian [Author]; Esmailoghli, Mahdi [Author]; Auer, Sören [Author]; Abedjan, Ziawasch [Author]; König-Ries, Birgitta [Author]; Scherzinger, Stefanie [Author]; Lehner, Wolfgang [Author]; Vossen, Gottfried [Author]
imprint: Bonn : Ges. für Informatik, 2023
Published in: Datenbanksysteme für Business, Technologie und Web (BTW 2023) ; GI-Edition / Proceedings, Lecture Notes in Informatics ; P-331
Issue: published Version
Language: English
DOI: https://doi.org/10.15488/15086; https://doi.org/10.18420/btw2023-18
ISBN: 978-388579725-8
Keywords: data lakes ; duplicate table detection ; Konferenzschrift ; data discovery
Origination:
Footnote: Diese Datenquelle enthält auch Bestandsnachweise, die nicht zu einem Volltext führen.
Description: Data lakes are typically lightly curated and as such prone to data quality problems and inconsistencies. In particular, duplicate tables are common in most repositories. The goal of duplicate table detection is to identify those tables that display the same data. Comparing tables is generally quite expensive as the order of rows and columns might differ for otherwise identical tables. In this paper, we explore the application of Xash, a hash function previously proposed for the discovery of multi-column join candidates, for the use case of duplicate table detection. With Xash, it is possible to generate a so-called super key, which serves like a bloom filter and instantly identifies the existence of particular cell values. We show that using Xash it is possible to speed up the duplicate table detection process significantly. In comparison to SimHash and other competing hash functions, Xash results in fewer false positive candidates.
Access State: Open Access
Rights information: Attribution - Share Alike (CC BY-SA)

Duplicate Table Detection with Xash - [published Version]

Bookmarks

Search in field:

Recently searched for: