• Media type: E-Article
  • Title: Fine-grained semantic type discovery for heterogeneous sources using clustering
  • Contributor: Piai, Federico; Atzeni, Paolo; Merialdo, Paolo; Srivastava, Divesh
  • imprint: Springer Science and Business Media LLC, 2023
  • Published in: The VLDB Journal
  • Language: English
  • DOI: 10.1007/s00778-022-00743-3
  • ISSN: 1066-8888; 0949-877X
  • Keywords: Hardware and Architecture ; Information Systems
  • Origination:
  • Footnote:
  • Description: <jats:title>Abstract</jats:title><jats:p>We focus on the key task of semantic type discovery over a set of heterogeneous sources, an important data preparation task. We consider the challenging setting of multiple Web data sources in a vertical domain, which present sparsity of data and a high degree of heterogeneity, even internally within each individual source. We assume each source provides a collection of entity specifications, i.e. entity descriptions, each expressed as a set of attribute name-value pairs. Semantic type discovery aims at clustering individual attribute name-value pairs that represent the same semantic concept. We take advantage of the opportunities arising from the redundancy of information across such sources and propose the iterative <jats:sc>RaF-STD</jats:sc> solution, which consists of three key steps: (i) a Bayesian model analysis of overlapping information across sources to match the most locally homogeneous attributes; (ii) a tagging approach, inspired by NLP techniques, to create (virtual) homogeneous attributes from portions of heterogeneous attribute values; and (iii) a novel use of classical techniques based on matching of attribute names and domains. Empirical evaluation on the DI2KG and WDC benchmarks demonstrates the superiority of <jats:sc>RaF-STD</jats:sc> over alternative approaches adapted from the literature. </jats:p>