In Cross-Language Information Retrieval (CLIR), the usual approach is to first translate the query into the target language and then retrieve documents in this language by using a conventional, monolingual information retrieval system. The translation system might be of any type (rule-based, statistical, hybrid, etc.). For instance, in [1] and [2], a statistical machine translation system is used, but it had to be previously trained from parallel texts. See [3], [4], and [5] for a survey on CLIR.
Since our perspective, the above two-step approach is too sensitive to translation errors produced during the first step. In fact, even if we have a very accurate retrieval system, translation errors prevent correct retrieval of relevant documents. To overcome this drawback, we propose to use a set of queries with their respective set of relevant documents as an input training set for a direct probabilistic cross-lingual information retrieval system which integrates both steps into a single one. This is done on the basis of the IBM alignment model 1 (IBM-1) for statistical machine translation [6]. Probabilistic approaches which use parallel corpora in order to translate the input queries by means of a statistical dictionary in CLIR have been used from many years ago (see [2]). However, our aim is not to translate queries but to obtain a set of associated words for a given query. Therefore, a parallel corpus does not have sense for our purpose, since we need to find a possible set of relevant documents for each query given. To our knowledge, this novel approach has not been presented earlier in literature.
We carried out some experiments by using a subset of the EuroGOV corpus [7] which was first used in the bilingual English to Spanish subtask of WebCLEF 2005 [8]. A document indexing reduction was also proposed in order to improve precision of our approach and to diminish its storing space. The corpus reduction was based on the use of a technique for selecting mid-frequency terms, named the Transition Point (TP), which was used in other research works with the same purpose [9,10]. We evaluated four different percentages of TP observing that it is possible to improve precision by reducing the number of terms for a given corpus.
Section 2 and 3 describe the query-relevant document pairs model in detail. Section 4 introduces the corpus used in the experiments, and explains the way we implemented the reduction process. The results obtained after the evaluation are illustrated in Section 5 and discussed in Section 6.