Table 1 shows the size of every evaluation corpus used; the vocabulary composed by representation of all texts, , as well as the percentage of reduction obtained by each one with respect to the original vocabulary. As we can see, the TP technique obtained a vocabulary reduction percentage of more than 95%, which implies a time reduction for any search engine indexing process.
Domain | DE | AT | BE | DK | SI |
Size (KB) | 2,588 | 2,317 | 6,796 | 1,189 | 6,729 |
Reduction (%) | 95.3 | 97.2 | 98.0 | 97.9 | 97.1 |
Domain | ES | EE | IE | IT | SK |
Size (KB) | 16,271 | 4,838 | 2,632 | 11,913 | 14,668 |
Reduction (%) | 98.5 | 97.2 | 96.0 | 98.4 | 97.5 |
Domain | LU | MT | NL | LV | PT |
Size (KB) | 3,212 | 4,817 | 20,324 | 21,213 | 9,134 |
Reduction (%) | 99.2 | 95.7 | 97.7 | 97.8 | 97.6 |
Domain | FR | CY | GR | HU | UK |
Size (KB) | 22,083 | 18,814 | 340 | 10,440 | 14,239 |
Reduction (%) | 95.8 | 96.5 | 97.4 | 98.8 | 96.1 |