next up previous
Next: Results Up: Evaluation Previous: Corpus

Indexing reduction

After our first participation in WebCLEF [4], we carried out more experiments using only those documents in Spanish language from the EuroGOV corpus. We observed that a value of $NTP=0.4$ using the reduction process shown in the Equation 1 was adequated. Therefore, in this test we carried out one run with that value. Moreover, this run took the evaluation corpus composed by the reduction of every text, using TP technique with a neighbourhood of 40% around TP, an enriched this set of terms using related terms as described by Equation (2).

Table 1 shows the size of every evaluation corpus used; the vocabulary composed by representation of all texts, $\vert TP_{SET}'\vert$, as well as the percentage of reduction obtained by each one with respect to the original vocabulary. As we can see, the TP technique obtained a vocabulary reduction percentage of more than 95%, which implies a time reduction for any search engine indexing process.


Table 1: Vocabulary size and percentage of reduction.
Domain DE AT BE DK SI
Size (KB) 2,588 2,317 6,796 1,189 6,729
Reduction (%) 95.3 97.2 98.0 97.9 97.1
Domain ES EE IE IT SK
Size (KB) 16,271 4,838 2,632 11,913 14,668
Reduction (%) 98.5 97.2 96.0 98.4 97.5
Domain LU MT NL LV PT
Size (KB) 3,212 4,817 20,324 21,213 9,134
Reduction (%) 99.2 95.7 97.7 97.8 97.6
Domain FR CY GR HU UK
Size (KB) 22,083 18,814 340 10,440 14,239
Reduction (%) 95.8 96.5 97.4 98.8 96.1


next up previous
Next: Results Up: Evaluation Previous: Corpus
David Pinto 2007-05-08