Boolean:	`(bicycle AND helmet) OR (head AND protection)` (always group AND in parenthesis)
Title:	`title:(climate change)`
Author:	`author:("bohr niels" OR "bohr n")` (avoid only full first name)
Phrase:	`"water pump control"` (does not work with wildcards)
Wildcards:	`wom?n pharm*`

Conference paper

Vocabulary Pruning for Improved Context Recognition

In Proceedings of the International Joint Conference on Neural Networks — 2004, pp. 80-85

By Madsen, Rasmus Elsborg¹; Sigurdsson, Sigurdur¹; Hansen, Lars Kai^1,2; Larsen, Jan^1,2

From

Department of Informatics and Mathematical Modeling, Technical University of Denmark¹

Cognitive Systems, Department of Informatics and Mathematical Modeling, Technical University of Denmark²

Abstract

Language independent `bag-of-words' representations are surprisingly effective for text classification. The representation is high dimensional though, containing many non-consistent words for text categorization. These non-consistent words result in reduced generalization performance of subsequent classifiers, e.g., from ill-posed principal component transformations.

In this communication our aim is to study the effect of reducing the least relevant words from the bag-of-words representation. We consider a new approach, using neural network based sensitivity maps and information gain for determination of term relevancy, when pruning the vocabularies. With reduced vocabularies documents are classified using a latent semantic indexing representation and a probabilistic neural network classifier.

Reducing the bag-of-words vocabularies with 90%-98%, we find consistent classification improvement using two mid size data-sets. We also study the applicability of information gain and sensitivity maps for automated keyword generation.

Language:	English
Publisher:	IEEE Press
Year:	2004
Pages:	80-85
Journal subtitle:	Special Session on Machine Learning for Text Mining
Types:	Conference paper
ORCIDs:	Hansen, Lars Kai and Larsen, Jan

Keywords

dimensionality reduction information gain neural networks sensitivity text classification

Vocabulary Pruning for Improved Context Recognition

DTU Library

Address

Shortcuts

Log in?

Vocabulary Pruning for Improved Context Recognition

DTU Library

Address

Shortcuts