The application of statistical machine translation for CLIR may be often seen in literature, but what we proposed in this paper is to study the derivation of the translation (association) dictionary from query-relevant document pairs. The probabilistic model assumes that the order of the words in the query is not important. Therefore, each position in a document is equally likely to be connected to each position in the query. Although this assumption is unrealistic in machine translation, we consider the IBM-1 model to be particularly well-suited for our approach.
We have used a term selection technique in order to reduce the size of the training corpus with good findings. For instance, by using a 82.5% of reduction, the results can improve those of using the complete corpus.
Last but not least, we would emphasize that the QRDP probabilistic model is language independent and, therefore, it can be employed to model cross-language query-document pairs in any language.