next up previous
Next: Results Up: Experimental results Previous: Experimental results

Performance measurement

We employed the $ F$-measure, which is commonly used in information retrieval [24], in order to determine which method obtains the best performance. Given a set of clusters $ \{G_1,\ldots,G_m\}$ and a set of classes $ \{C_1,\ldots,C_n\}$, the $ F$-measure between a cluster $ i$ and a class $ j$ is given by the following formula.

$\displaystyle F_{ij}=\frac{2\cdot P_{ij}\cdot R_{ij}}{P_{ij}+R_{ij}},$ (7)

where $ 1\le i\le m$, $ 1\le j\le n$. $ P_{ij}$ and $ R_{ij}$ are defined as follows:

$\displaystyle P_{ij}=\frac{\mbox{Number of texts from cluster }i\mbox{ in class }j} {\mbox{Number of texts from cluster }i},$ (8)

and

$\displaystyle R_{ij}=\frac{\mbox{Number of texts from cluster }i\mbox{ in class }j} {\mbox{Number of texts in class }j}.$ (9)

The global performance of a clustering method is calculated by using the values of $ F_{ij}$, the cardinality of the set of clusters obtained, and normalizing by the total number of documents in the collection ($ \vert D\vert$). The obtained measure is named $ F$-measure and it is shown in equation 10.

$\displaystyle F=\sum_{1\le i\le m}\frac{\vert G_i\vert}{\vert D\vert}\max_{1\le j\le n}F_{ij}.$ (10)


next up previous
Next: Results Up: Experimental results Previous: Experimental results
David Pinto 2007-05-08