• Medientyp: E-Artikel; Sonstige Veröffentlichung
  • Titel: COCOA: COrrelation coefficient-aware data augmentation
  • Beteiligte: Esmailoghli, Mahdi [Verfasser:in]; Quiané-Ruiz, Jorge-Arnulfo [Verfasser:in]; Abedjan, Ziawasch [Verfasser:in]; Velegrakis, Yannis [Verfasser:in]; Zeinalipour, Demetris [Verfasser:in]; Chrysanthis, Panos K. [Verfasser:in]; Guerra, Francesco [Verfasser:in]
  • Erschienen: Konstanz, Germany : OpenProceedings.org, University of Konstanz, University Library, 2021
  • Erschienen in: Advances in Database Technology - EDBT 2021
  • Ausgabe: published Version
  • Sprache: Englisch
  • DOI: https://doi.org/10.15488/16496; https://doi.org/10.5441/002/edbt.2021.30
  • ISBN: 978-3-89318-084-4
  • Schlagwörter: Data augmentation ; Konferenzschrift ; Correlation coefficient ; Data enrichments ; Data Science ; Index structure
  • Entstehung:
  • Anmerkungen: Diese Datenquelle enthält auch Bestandsnachweise, die nicht zu einem Volltext führen.
  • Beschreibung: Calculating correlation coefficients is one of the most used measures in data science. Although linear correlations are fast and easy to calculate, they lack robustness and effectiveness in the existence of non-linear associations. Rank-based coefficients such as Spearman's are more suitable. However, rank-based measures first require to sort the values and obtain the ranks, making their calculation super-linear. One of the use-cases that is affected by this is data enrichment for Machine Learning (ML) through feature extraction from large databases. Finding the most promising features from millions of candidates to increase the ML accuracy requires billions of correlation calculations. In this paper, we introduce an index structure that ensures rank-based correlation calculation in a linear time. Our solution accelerates the correlation calculation up to 500 times in the data enrichment setting.
  • Zugangsstatus: Freier Zugang