Corpus

Next: Indexing reduction Up: Evaluation Previous: Evaluation

Corpus

For the experiments carried out, we have used the EuroGOV corpus provided by the WebCLEF forum which is well described in [6]. We have indexed only 20 from 27 domains, namely: DE, AT, BE, DK, SI, ES, EE, IE, IT, SK, LU, MT, NL, LV, PT, FR, CY, GR, HU, and UK (we did not indexed the following domains: EU, RU, FI, PL, SE, CZ, LT). Due to this fact, only 1,470 from 1,939 topics were evaluated, which is approximately a 75.81% of the whole topics. Although we presented in Section 3.3 the MRR over 1,939 topics, there were 469 topics not indexed.

The preprocessing phase of the EuroGOV corpus was carried out by writing two scripts to obtain index terms for each document. The first script uses regular expressions for excluding all the information which is enclosed by the characters and . Although this script obtains very good results, it is very slow and therefore we decided to used it only with three domains of the EuroGOV collection, namely Spanish (ES), French (FR), and German (DE).

On the other hand, we wrote a script based in the html syntax for obtaining all the terms considered interesting for indexing, i.e., those different than script codes (javascript, vbscript, style cascade sheet, etc), html codes, etc. This script speeded up our indexing process but it did not take into account that some web pages were incorrectly written and, therefore, we missed important information from those documents.

For every page compiled in the EuroGOV corpus, we also determine its language by using TexCat [8], a language identification program widely used. We constructed our evaluation corpus with those documents identified as a language of the above list.

Another preprocessing problem consisted in the charset codification, which led to a even more difficult analysis. Although the EuroGOV corpus is given in UTF-8, the documents that made up this corpus do not neccesarily keep this charset. We have seen that for some domains, the charset codification is given in the html metadata tag, but also we found that very often this codification is wrong. We consider the charset codification detection the most difficult problem in the preprocessing step. Finally, we eliminated stopwords for each language (except for Greek language) and punctuation symbols. For the evaluation of this corpus, a set of queries was provided by WebCLEF-2006, which were applied with the same preprocessing process described above.

Next: Indexing reduction Up: Evaluation Previous: Evaluation

David Pinto 2007-05-08