The Self Term Expansion Method

On the other hand, an information theory-based co-ocurrence measure is discussed in [Manning and Schütze2003]. This measure is named pointwise Mutual Information (MI), and its applications for finding collocations are analysed by determining the co-ocurrence degree among two terms. This may be done by calculating the ratio between the number of times that both terms appear together (in the same context and not necessarily in the same order) and the product of the number of times that each term ocurrs alone. Given two terms

and

, the pointwise mutual information between

and

can be calculated as follows:

$\begin{displaymath}MI(X_1, X_2) = \log_2 \frac{P(X_1 X_2)}{P(X_1) \times P(X_2)}\end{displaymath}$

The numerator could be modified in order to take into account only bigrams, as presented in [Pinto 2006], where an improvement of clustering short texts in narrow domains has been obtained.

We have used the pointwise MI for obtaining a co-ocurrence list from the same target dataset. This list is then used to expand every term of the original data. Since the co-ocurrence formula captures relations between related terms, it is possible to see that the self term expansion magnifies less the noisy than the meaninful information. Therefore, the execution of the clustering algorithm in the expanded corpus should outperform the one executed over the non-expanded data.

In order to fully appreciate the self term expansion method, in Table 1 we show the co-ocurrence list for some words related with the verb ``kill'' of the test corpus. Since the MI is calculated after preprocessing the corpus, we present the stemmed version of the terms.

Table: An example of co-ocurrence terms

Word	Co-ocurrence terms
soldier	kill
rape	women think shoot peopl old man
	kill death beat
grenad	todai live guerrilla fight explod
death	shoot run rape person peopl outsid
	murder life lebanon kill convict...
temblor	tuesdai peopl least kill earthquak

For the task #2 of Semeval 2007, a set of 100 ambiguous words (35 nouns and 65 verbs) were provided. We preprocessed this original dataset by eliminating stopwords and then applying the Porter stemmer [Porter1980]. Thereafter, when we used the pointwise MI, we determined that the single ocurrence of each term should be at least three (see [Manning and Schütze2003]), whereas the maximum separation among the two terms was five. Finally, we selected the unsupervised KStar clustering method [Shin and Han2003] for our experiments, defining the average of similarities among all the sentences for a given ambiguous word as the stop criterion for this clustering method. The input similarity matrix for the clustering method was calculated by using the Jaccard coefficient.