Next: Acknowledgments
Up: Clustering Abstracts of Scientific
Previous: Improving Transition Point approach:
In this paper we have proposed a new use of the Transition Point technique in the clustering of abstracts in a narrow domain. We used as
a corpus a set of documents originally stored at CERN in the High Energy Physics domain, which led to experiment with real
collections
conformed by very short texts (hep-ex). Findings after the execution of three unsupervised methods (DF, TS and TP) were that TP
outperforms the other two methods over a subset of hep-ex. However, when the whole collection was used, a new filtering method had
to be developed in order to improve the previous results. This method was named TPMI, and it used a dictionary of related terms, constructed
over the same collection by using mutual information.
After the calculation of a baseline in both experiments was carried out, we could verify that this value was outperformed by our
approaches.
We observed that there are not methods to determine the number of terms that a term selection method must obtain, in order to
carry out the clustering task. Due to the unstability of TP, we carried out an analysis for explaining this behaviour
and therefore
to be able to determine the number of terms needed in such task. It is very important to continue with the study of the stability
control for this methods, since, this is in fact the key in the clustering of very short texts.
Clustering abstracts in a narrow domain has received not too much attention by the computational linguistic community, and therefore it
is very important to continue with the experiments in this area. Particularly, the determination of a succesful method for stabilizing
the results will be the further task.
Next: Acknowledgments
Up: Clustering Abstracts of Scientific
Previous: Improving Transition Point approach:
David Pinto
2006-05-25