next up previous
Next: Evaluation of the results Up: Using Query-Relevant Documents Pairs Previous: Maximum likehood estimation


The EuroGOV corpus

We have used a subset of the EuroGOV corpus for the evaluation of the QRDP model. This subset was made up by a set of Spanish Internet pages, originally obtained from European government-related sites and particularly used in the WebCLEF track of the Cross-Language Evaluation Forum[*] (CLEF) [8]. A better reference to this corpus can be seen in [7].

We refined the evaluation corpus, with those documents automatically identified as in the ``Spanish'' language, by using the TexCat language identification program [*]. For the evaluation of this corpus, a set of 134 supervised queries in the ``English'' language was used. The pre-processing step was applied to both, the web pages and the queries, and consisted of the elimination of punctuation symbols, Spanish and English stopwords, numbers, html tags, script codes and cascading style sheets codes.

For convenience, we built a training corpus comprising pairs of query and target web page. We observed that a possible improvement in time indexing and search engine precision may be obtained by reducing the size of this corpus. Therefore, we applied a term selection technique, named transition point, in order to obtain only the mid-frequency terms which will represent every document (see [9] and [10] for further details).

For this purpose, a term frequency value of the web page vocabulary is selected as the transition point, and then a neighbourhood of TP is used as threshold for determining those terms which will be selected. After using four different thresholds (10%, 20%, 40%, and 60%), we obtained five corpora for the evaluation. Table 1 shows the size of every test corpus used, as well as the percentage of reduction obtained for each of them. As can be seen, the TP technique obtained a high percentage of reduction (between 75 and 89%), which also implied a time reduction for constructing the statistical dictionary.


Table 1: Test corpora
Corpus Size ($ \approx$Kb) Reduction (%)
Full 117 0
TP60 29 75.37
TP40 20 82.55
TP20 19 83.25
TP10 13 89.25


next up previous
Next: Evaluation of the results Up: Using Query-Relevant Documents Pairs Previous: Maximum likehood estimation
David Pinto 2007-10-05