Aaron Kershenbaum
Department of Computer Science
Polytechnic University
Hawthorne, N.Y. 10532, USA
akershen@duke.poly.edu
ABSTRACTThe problem of assigning documents to categories in a hierarchically organized taxonomy and the effect of modifying the topology of the hierarchy are considered. Given a training corpus of documents already placed in one or more categories, vocabulary is extracted. The vocabulary, words that appear with high relative frequency within a given category, characterize each subject area by being associated with nodes in the hierarchy. Each node's vocabulary is filtered and its words assigned weights with respect to the specific category. Then, test documents are scanned for the presence of this vocabulary and categories are ranked with respect to the document based on the presence of terms from this vocabulary. Finally, documents are assigned to categories based on these rankings. Precision and recall are measured.
We present an algorithm for associating words with individual categories within the hierarchy and demonstrate that precision and recall can be significantly improved by solving the categorization problem taking the topology of the hierarchy into account. We also show that these results can be improved even further by intelligently selecting intermediate categories in the hierarchy. Solving the problem iteratively, moving downward from the root of the taxonomy to the leaf nodes, we improve precision from 82% to 89% and recall from 82% to 87% on the much-studied Reuters-21578 corpus with 135 categories organized in a three-level hierarchy of categories.