About

Log in?

DTU users get better search results including licensed content and discounts on order fees.

Anyone can log in and get personalized features such as favorites, tags and feeds.

Log in as DTU user Log in as non-DTU user No thanks

DTU Findit

Conference paper

Pruning the vocabulary for better context recognition

From

Department of Informatics and Mathematical Modeling, Technical University of Denmark1

Cognitive Systems, Department of Informatics and Mathematical Modeling, Technical University of Denmark2

Language independent 'bag-of-words' representations are surprisingly effective for text classification. The representation is high dimensional though, containing many inconsistent words for text categorization. These inconsistent words result in reduced generalization performance of subsequent classifiers, e.g., from ill-posed principal component transformations.

In this communication our aim is to study the effect of reducing the least relevant words from the bag-of-words representation. We consider a new approach, using neural network based sensitivity maps and information gain for determination of term relevancy, when pruning the vocabularies. With reduced vocabularies, documents are classified using a latent semantic indexing representation and a probabilistic neural network classifier.

Reducing the bag-of-words vocabularies with 90%-98%, we find consistent classification improvement using two mid size data-sets. We also study the applicability of information gain and sensitivity maps for automated keyword generation.

Language: English
Publisher: IEEE
Year: 2004
Pages: 483,484,485,486,487,488
Proceedings: 17th International Conference on Pattern Recognition
ISBN: 0769521282 and 9780769521282
ISSN: 10514651
Types: Conference paper
DOI: 10.1109/ICPR.2004.1334270
ORCIDs: Hansen, Lars Kai and Larsen, Jan

DTU users get better search results including licensed content and discounts on order fees.

Log in as DTU user

Access

Analysis