• Medientyp: Dissertation; Elektronische Hochschulschrift; E-Book
  • Titel: Statistical learning techniques for text categorization with sparse labeled data
  • Beteiligte: Ifrim, Georgiana [Verfasser:in]
  • Erschienen: Scientific publications of the Saarland University (UdS), 2009
  • Sprache: Englisch
  • DOI: https://doi.org/10.22028/D291-25940
  • Schlagwörter: Klassifikation ; classification techniques ; dünn gesäten Trainingsdaten ; Strukturierte Logistische Regression ; Lernen ; Suchmaschine ; Klassifikatoren ; very few explicitly labeled training examples ; Inductive/Transductive Latent Model ; Induktiv-Transduktiv Latentes Modell ; Daten ; Structured Logistic Regression
  • Entstehung:
  • Anmerkungen: Diese Datenquelle enthält auch Bestandsnachweise, die nicht zu einem Volltext führen.
  • Beschreibung: Many applications involve learning a supervised classifier from very few explicitly labeled training examples, since the cost of manually labeling the training data is often prohibitively high. For instance, we expect a good classifier to learn our interests from a few example books or movies we like, and recommend similar ones in the future, or we expect a search engine to give more personalized search results based on whatever little it learned about our past queries and clicked documents. There is thus a need for classification techniques capable of learning from sparse labeled data, by exploiting additional information about the classification task at hand (e.g., background knowledge) or by employing more sophisticated features (e.g., n-gram sequences, trees, graphs). In this thesis, we focus on two approaches for overcoming the bottleneck of sparse labeled data. We first propose the Inductive/Transductive Latent Model (ILM/TLM), which is a new generative model for text documents. ILM/TLM has various building blocks designed to facilitate the integration of background knowledge (e.g., unlabeled documents, ontologies of concepts, encyclopedia) into the process of learning from small training data. Our method can be used for inductive and transductive learning and achieves significant gains over state-of-the-art methods for very small training sets. Second, we propose Structured Logistic Regression (SLR), which is a new coordinate-wise gradient ascent technique for learning logistic regression in the space of all (word or character) sequences in the training data. SLR exploits the inherent structure of the n-gram feature space in order to automatically provide a compact set of highly discriminative n-gram features. Our detailed experimental study shows that while SLR achieves similar classification results to those of the state-of-the-art methods (which use all n-gram features given explicitly), it is more than an order of magnitude faster than its opponents. The techniques presented in this thesis can be used ...
  • Zugangsstatus: Freier Zugang