• Media type: E-Article
  • Title: Screening for linearly and nonlinearly related variables in predictive cheminformatic models
  • Contributor: Hemmateenejad, Bahram; Baumann, Knut
  • Published: Wiley, 2018
  • Published in: Journal of Chemometrics, 32 (2018) 4
  • Language: English
  • DOI: 10.1002/cem.3009
  • ISSN: 0886-9383; 1099-128X
  • Origination:
  • Footnote:
  • Description: AbstractFor a long time, feature selection has been a hot topic in the statistical‐related literature and has become increasingly frequent and important in various research fields. Feature screening methods using marginal correlation show potential problems. Another issue that hinders to select an important variable is the shading effect of a highly influential variable on the variables of lower importance. Feature selection can be even more complex in the presence of nonlinear relations. To overcome these limitations, an innovative method for selecting linearly and nonlineary correlated variables is presented. It works based on the hyphenation of nonparametric variable ranking methods with nonparametric regression methods through an iterative regression based on residuals. Here, maximal information coefficient and distance correlation are used to rank the variables. The algorithm starts with modeling the relationship between response and the top‐ranking variable by using multivariate adaptive regression splines method. In the next iterations, the top‐ranking variables are selected based on relationship with subsequent residuals. The validation of the method is discussed by using 2 nonlinear simulated data. The method is further validated by analysis of 2 real cheminformatic data sets including toxicity of 1571 industrial chemicals and aqueous solubility of a diverse set of 1708 organic molecules.