Description of the experiments

Next: Results Up: Experimental results Previous: The hep-ex corpus

Description of the experiments

Clustering short-texts of narrow domain, implies basically two steps: first it is necessary to perform the feature selection process. We have used the three unsupervised techniques described in Section 2 for ordering the vocabulary of the corpora in non-increasing order; after that we have selected different percentages of the vocabulary (from 20% to 90%) in order to determine the behaviour of each technique under different subsets of the vocabulary. The second step involves the use of clustering methods; five different clustering methods were applied for this comparison: Single Link Clustering (SLC), Complete Link Clustering (CLC), K-Nearest Neighbour analysis (KNN), KStar [Shin and Han2003] and a modified version of the KStar method (NN1). The aim of the comparative study of the above clustering algorithms was to investigate whether exist a close relationship between a specific clustering method and a specific feature selection technique.

In order to obtain the best description of our experiments, we have carried out a cross-validation evaluation. This process implies to split the original corpus in a predefined set of partitions, and then calculate the average among all the partitions results. By applying this process we can ensure that our data results will be a consequence of a stable algorithm, which means that our results are not casual through the use of a specific clustering method and a specific data collection. In our case, we have use 4 partitions for the CICLing collection and 30 partitions for the hep-ex collection.

Next: Results Up: Experimental results Previous: The hep-ex corpus

David Pinto 2006-05-25