Next: Introduction
Using Query-Relevant Documents Pairs for Cross-Lingual Information Retrieval
David Pinto(1,2), Alfons Juan1, Paolo Rosso1
1Department of Information Systems and Computation,
Polytechnic University of Valencia, Spain
2Faculty of Computer Science,
B. Autonomous University of Puebla, Mexico
{dpinto, ajuan, prosso}@dsic.upv.es
Abstract:
The world wide web is a natural setting for cross-lingual
information retrieval. The European Union is a typical example of a
multilingual scenario, where multiple users have to deal with
information published in at least 20 languages. Given queries in some
source language and a target corpus in another language, the typical
approximation consists in translating either the query or the target
dataset to the other language. Other approaches use parallel corpora to
obtain a statistical dictionary of words among the different languages.
In this work, we propose to use a training corpus made up by a set of
Query-Relevant Document Pairs (QRDP) in a probabilistic cross-lingual
information retrieval approach which is based on the IBM alignment
model 1 for statistical machine translation. Our approach has two main
advantages over those that use direct translation and parallel corpora:
we will not obtain a translation of the query, but a set of associated
words which share their meaning in some way and, therefore, the
obtained dictionary is, in a broad sense, more semantic than a
translation one. Besides, since the queries are supervised, we are
working in a more restricted domain than that when using a general
parallel corpus (it is well known that in this context results are
better than those which are performed in a general context). In order
to determine the quality of our experiments, we compared the results
with those obtained by a direct translation of the queries with a query
translation system, observing promising results.
Next: Introduction
David Pinto
2007-10-05