next up previous
Next: The QRDP probabilistic model Up: Using Query-Relevant Documents Pairs Previous: Using Query-Relevant Documents Pairs

Introduction

The fast growth of the Internet and the increasing multilinguality of the web poses an additional challenge for language technology. Therefore, the development of novel techniques for managing of data, especially when we deal with information in multiple languages, is needed. There are sufficient examples in which users may be interested in information which is in a language other than their own native language. A common language scenario is where a user has some comprehension ability for a given language but s/he is not sufficiently proficient to confidently specify a search request in that language. Thus, a search engine that may deal with this cross-lingual problem should be of a high benefit.

In Cross-Language Information Retrieval (CLIR), the usual approach is to first translate the query into the target language and then retrieve documents in this language by using a conventional, monolingual information retrieval system. The translation system might be of any type (rule-based, statistical, hybrid, etc.). For instance, in [1] and [2], a statistical machine translation system is used, but it had to be previously trained from parallel texts. See [3], [4], and [5] for a survey on CLIR.

Since our perspective, the above two-step approach is too sensitive to translation errors produced during the first step. In fact, even if we have a very accurate retrieval system, translation errors prevent correct retrieval of relevant documents. To overcome this drawback, we propose to use a set of queries with their respective set of relevant documents as an input training set for a direct probabilistic cross-lingual information retrieval system which integrates both steps into a single one. This is done on the basis of the IBM alignment model 1 (IBM-1) for statistical machine translation [6]. Probabilistic approaches which use parallel corpora in order to translate the input queries by means of a statistical dictionary in CLIR have been used from many years ago (see [2]). However, our aim is not to translate queries but to obtain a set of associated words for a given query. Therefore, a parallel corpus does not have sense for our purpose, since we need to find a possible set of relevant documents for each query given. To our knowledge, this novel approach has not been presented earlier in literature.

We carried out some experiments by using a subset of the EuroGOV corpus [7] which was first used in the bilingual English to Spanish subtask of WebCLEF 2005 [8]. A document indexing reduction was also proposed in order to improve precision of our approach and to diminish its storing space. The corpus reduction was based on the use of a technique for selecting mid-frequency terms, named the Transition Point (TP), which was used in other research works with the same purpose [9,10]. We evaluated four different percentages of TP observing that it is possible to improve precision by reducing the number of terms for a given corpus.

Section 2 and 3 describe the query-relevant document pairs model in detail. Section 4 introduces the corpus used in the experiments, and explains the way we implemented the reduction process. The results obtained after the evaluation are illustrated in Section 5 and discussed in Section 6.


next up previous
Next: The QRDP probabilistic model Up: Using Query-Relevant Documents Pairs Previous: Using Query-Relevant Documents Pairs
David Pinto 2007-10-05