Next: The Self Term Expansion Up: UPV-SI: Word Sense Induction Previous: UPV-SI: Word Sense Induction

Introduction

Word Sense Disambiguation (WSD) is a particular problem of computational linguistics which consists in determining the correct sense for a given ambiguous word. It is well-known that supervised algorithms have obtained the best results in public evaluations, but their accuracy is close related with the amount of hand-tagged data available. The construction of that kind of training data is difficult for real applications. The unsupervised WSD overcomes this drawback by using clustering algorithms which do not need training data in order to determine the possible sense for a given ambiguous word.

This paper describes a simple technique for unsupervised sense induction for ambiguous words. The approach is based on a self term expansion technique which constructs a set of co-ocurrence terms and, thereafter, it uses this set to expand the target dataset. The implemented system was performed in the task ``SemEval-2007 Task 2: Evaluating Word Sense Induction and Discrimination Systems''[Agirre and A.2007]. The aim of the task was to permit a comparison across sense-induction and discrimination systems. Moreover, the comparison with other supervised and knowledge-based systems may be also done, since the test corpus was borrowed from the well known ``English lexical-sample'' task in SemEval-2007, with the usual training + test split.

The self term expansion method consists in replacing terms of a document by a set of co-related terms. The goal is to improve natural language processing tasks such as clustering narrow-domain short texts. This process may be done by mean of different ways, often just by using a knowledge database. In information retrieval, for instance, the expansion of query terms is a very investigated topic which has shown to improve results with respect to when query expansion is not employed [Qiu and Frei1993, Ruge1992,R.Baeza-Yates and Ribeiro-Neto1999,Grefenstette1994,Rijsbergen1979].

The availability of Machine Readable Resources (MRR) like ``Dictionaries'', ``Thesauri'' and ``Lexicons'' has allowed to apply term expansion to other fields of natural language processing like WSD. In [Banerjee and Pedersen2002] we may see the typical example of using a external knowledge database for determining the correct sense of a word given in some context. In this approach, every word close to the one we would like to determine its correct sense is expanded with its different senses by using the WordNet lexicon [Fellbaum1998]. Then, an overlapping factor is calculated in order to determine the correct sense of the ambiguous word. Different other approaches have made use of a similar procedure. By using dictionaries, the proposals presented in [Lesk1986,Wilks 1990,Nancy and Véronis1990] are the most sucessful in WSD. Yarowsky [Yarowsky1992] used instead thesauri for their experiments. Finally, in [Sussna1993,Resnik1995,Banerjee and Pedersen2002] the use of lexicons in WSD has been investigated. Although in some cases the knowledge resource seems not to be used strictly for term expansion, the aplication of co-occurrence terms is included in their algorithms. Like in information retrieval, the application of term expansion in WSD by using co-related terms has shown to improve the baseline results if we carefully select the external resource to use, with a priori knowledge of the domain and the broadness of the corpus (wide or narrow domain). Evenmore, we have to be sure that the Lexical Data Base (LDB) has been suitable constructed. Due to the last facts, we consider that the use of a self automatically constructed LDB (using the same test corpora), may be of high benefit. This assumption is based on the intrinsic properties extracted from the corpus itself. Our proposal is related somehow with the investigations presented in [Schütze1998] and [Purandare and Pedersen2004], where words are also expanded with co-ocurrence terms for word sense discrimination. The main difference consists in the use of the same corpora for constructing the co-ocurrence list.

Following we describe the self term expansion method used and, thereafter, the results obtained in the task #2 of Semeval 2007 competition.

Next: The Self Term Expansion Up: UPV-SI: Word Sense Induction Previous: UPV-SI: Word Sense Induction

David Pinto 2007-05-08