Performance Measurement

Next: Experimental Results Up: Clustering of Abstracts in Previous: Data Set

Performance Measurement

We used

-measure (commonly used in information retrieval [16]) in order to determine which method obtains the best performance. Given a set of clusters $\{G_1,\ldots,G_m\}$ and a set of classes $\{C_1,\ldots,C_n\}$ , the

-measure between a cluster

and a class

is given by the following formula.

$\displaystyle F_{ij}=\frac{2\cdot P_{ij}\cdot R_{ij}}{P_{ij}+R_{ij}},$

(3)

where $1\le i\le m$ , $1\le j\le n$ . $P_{ij}$ and $R_{ij}$ are defined as follows:

$\displaystyle P_{ij}=\frac{\mbox{Number of texts from cluster }i\mbox{ in class }j} {\mbox{Number of texts from cluster }i},$

(4)

and

$\displaystyle R_{ij}=\frac{\mbox{Number of texts from cluster }i\mbox{ in class }j} {\mbox{Number of texts in class }j}.$

(5)

The global performance of the clustering is calculated using the values of $F_{ij}$ . This measure is named measure and it is shown as follows:

$\displaystyle F=\sum_{1\le i\le m}\frac{\vert G_i\vert}{\vert D\vert}\max_{1\le j\le n}F_{ij}.$

(6)

David Pinto 2006-05-25