next up previous
Next: Evaluation Up: UPV-SI: Word Sense Induction Previous: Introduction

The Self Term Expansion Method

In literature, co-ocurrence terms is the most common technique used for automatic construction of LDBs [Grefenstette1994,Frakes and Baeza-Yates1992]. A simple approach may use $n$-grams, which allows to predict a word from previous words in a sample of text. The frequency of each $n$-gram is calculated and then filtered according to some threshold. The resulting $n$-grams constitutes a LDB which may be used as an ``expansion dictionary'' for each term.

On the other hand, an information theory-based co-ocurrence measure is discussed in [Manning and Schütze2003]. This measure is named pointwise Mutual Information (MI), and its applications for finding collocations are analysed by determining the co-ocurrence degree among two terms. This may be done by calculating the ratio between the number of times that both terms appear together (in the same context and not necessarily in the same order) and the product of the number of times that each term ocurrs alone. Given two terms $X_1$ and $X_2$, the pointwise mutual information between $X_1$ and $X_2$ can be calculated as follows:


\begin{displaymath}MI(X_1, X_2) = \log_2 \frac{P(X_1 X_2)}{P(X_1) \times P(X_2)}\end{displaymath}

The numerator could be modified in order to take into account only bigrams, as presented in [Pinto 2006], where an improvement of clustering short texts in narrow domains has been obtained.

We have used the pointwise MI for obtaining a co-ocurrence list from the same target dataset. This list is then used to expand every term of the original data. Since the co-ocurrence formula captures relations between related terms, it is possible to see that the self term expansion magnifies less the noisy than the meaninful information. Therefore, the execution of the clustering algorithm in the expanded corpus should outperform the one executed over the non-expanded data.

In order to fully appreciate the self term expansion method, in Table 1 we show the co-ocurrence list for some words related with the verb ``kill'' of the test corpus. Since the MI is calculated after preprocessing the corpus, we present the stemmed version of the terms.


Table: An example of co-ocurrence terms
Word Co-ocurrence terms
soldier kill
rape women think shoot peopl old man
  kill death beat
grenad todai live guerrilla fight explod
death shoot run rape person peopl outsid
  murder life lebanon kill convict...
temblor tuesdai peopl least kill earthquak


For the task #2 of Semeval 2007, a set of 100 ambiguous words (35 nouns and 65 verbs) were provided. We preprocessed this original dataset by eliminating stopwords and then applying the Porter stemmer [Porter1980]. Thereafter, when we used the pointwise MI, we determined that the single ocurrence of each term should be at least three (see [Manning and Schütze2003]), whereas the maximum separation among the two terms was five. Finally, we selected the unsupervised KStar clustering method [Shin and Han2003] for our experiments, defining the average of similarities among all the sentences for a given ambiguous word as the stop criterion for this clustering method. The input similarity matrix for the clustering method was calculated by using the Jaccard coefficient.


next up previous
Next: Evaluation Up: UPV-SI: Word Sense Induction Previous: Introduction
David Pinto 2007-05-08