Indexing reduction

Next: Results Up: Evaluation Previous: Corpus

Indexing reduction

After our first participation in WebCLEF [4], we carried out more experiments using only those documents in Spanish language from the EuroGOV corpus. We observed that a value of

using the reduction process shown in the Equation 1 was adequated. Therefore, in this test we carried out one run with that value. Moreover, this run took the evaluation corpus composed by the reduction of every text, using TP technique with a neighbourhood of 40% around TP, an enriched this set of terms using related terms as described by Equation (2).

Table 1 shows the size of every evaluation corpus used; the vocabulary composed by representation of all texts, $\vert TP_{SET}'\vert$ , as well as the percentage of reduction obtained by each one with respect to the original vocabulary. As we can see, the TP technique obtained a vocabulary reduction percentage of more than 95%, which implies a time reduction for any search engine indexing process.

**Table 1:** Vocabulary size and percentage of reduction.
Domain	DE	AT	BE	DK	SI
Size (KB)	2,588	2,317	6,796	1,189	6,729
Reduction (%)	95.3	97.2	98.0	97.9	97.1
Domain	ES	EE	IE	IT	SK
Size (KB)	16,271	4,838	2,632	11,913	14,668
Reduction (%)	98.5	97.2	96.0	98.4	97.5
Domain	LU	MT	NL	LV	PT
Size (KB)	3,212	4,817	20,324	21,213	9,134
Reduction (%)	99.2	95.7	97.7	97.8	97.6
Domain	FR	CY	GR	HU	UK
Size (KB)	22,083	18,814	340	10,440	14,239
Reduction (%)	95.8	96.5	97.4	98.8	96.1

Next: Results Up: Evaluation Previous: Corpus

David Pinto 2007-05-08