We refined the evaluation corpus, with those documents automatically identified as in the ``Spanish'' language, by using the TexCat language identification program . For the evaluation of this corpus, a set of 134 supervised queries in the ``English'' language was used. The pre-processing step was applied to both, the web pages and the queries, and consisted of the elimination of punctuation symbols, Spanish and English stopwords, numbers, html tags, script codes and cascading style sheets codes.
For convenience, we built a training corpus comprising pairs of query and target web page. We observed that a possible improvement in time indexing and search engine precision may be obtained by reducing the size of this corpus. Therefore, we applied a term selection technique, named transition point, in order to obtain only the mid-frequency terms which will represent every document (see [9] and [10] for further details).
For this purpose, a term frequency value of the web page vocabulary is selected as the transition point, and then a neighbourhood of TP is used as threshold for determining those terms which will be selected. After using four different thresholds (10%, 20%, 40%, and 60%), we obtained five corpora for the evaluation. Table 1 shows the size of every test corpus used, as well as the percentage of reduction obtained for each of them. As can be seen, the TP technique obtained a high percentage of reduction (between 75 and 89%), which also implied a time reduction for constructing the statistical dictionary.
Corpus | Size (Kb) | Reduction (%) |
Full | 117 | 0 |
TP60 | 29 | 75.37 |
TP40 | 20 | 82.55 |
TP20 | 19 | 83.25 |
TP10 | 13 | 89.25 |