Next: Acknowledgments Up: Clustering Abstracts of Scientific Previous: Improving Transition Point approach:

Conclusions

In this paper we have proposed a new use of the Transition Point technique in the clustering of abstracts in a narrow domain. We used as a corpus a set of documents originally stored at CERN in the High Energy Physics domain, which led to experiment with real collections conformed by very short texts (hep-ex). Findings after the execution of three unsupervised methods (DF, TS and TP) were that TP outperforms the other two methods over a subset of hep-ex. However, when the whole collection was used, a new filtering method had to be developed in order to improve the previous results. This method was named TPMI, and it used a dictionary of related terms, constructed over the same collection by using mutual information. After the calculation of a baseline in both experiments was carried out, we could verify that this value was outperformed by our approaches.

We observed that there are not methods to determine the number of terms that a term selection method must obtain, in order to carry out the clustering task. Due to the unstability of TP, we carried out an analysis for explaining this behaviour and therefore to be able to determine the number of terms needed in such task. It is very important to continue with the study of the stability control for this methods, since, this is in fact the key in the clustering of very short texts.

Clustering abstracts in a narrow domain has received not too much attention by the computational linguistic community, and therefore it is very important to continue with the experiments in this area. Particularly, the determination of a succesful method for stabilizing the results will be the further task.

Next: Acknowledgments Up: Clustering Abstracts of Scientific Previous: Improving Transition Point approach:

David Pinto 2006-05-25