next up previous
Next: Maximum likehood estimation Up: Using Query-Relevant Documents Pairs Previous: Introduction

The QRDP probabilistic model

Lex $ x$ be a query text in a certain input (source) language, and let $ y_1, y_2, \cdots, y_W$ be a collection of $ W$ web pages in a different output (target) language. Let $ \mathcal{X}$ and $ \mathcal{Y}$ be their associated input and output vocabularies, respectively. Given a number $ k < W$, we have to find the $ k$ most relevant web pages with respect to the input query $ x$. To do this, we have followed a probabilistic approach in which the $ k$ most relevant web pages are computed as those most probable given $ x$, i.e.,

$\displaystyle \{y_1^*(x), \cdots, y_k^*(x)\} = \operatornamewithlimits{arg max}...
... \\ Vert S\vert=k}} \operatornamewithlimits{min }_{y \in S} \,\, p(y\,\vert\,x)$ (1)

In the particular case of k=1, Equation (1) is simplified to

$\displaystyle y_1^*(x)=\operatornamewithlimits{arg max}_{y=y_1, \cdots, y_W} p(y\,\vert\,x)$ (2)

In this work, $ p(y\,\vert\,x)$ is modelled by using the well-known IBM alignment model 1 (IBM-1) for statistical machine translation [6,11]. This model assumes that each word in the web page is connected to exactly one word in the query. Also, it is assumed that the query has an initial ``null'' word to which words in the web page with no direct connexion are linked. Formally, a hidden variable $ a=a_1a_2\cdots a_{\vert y\vert}$ is introduced to reveal, for each position $ i$ in the web page, the query word position $ a_i\in\{0,1,\dotsc,\vert x\vert\}$ to which it is connected. Thus,

$\displaystyle p(y\,\vert\,x) = \sum_{a\in\mathcal{A}(x,y)} p(y,a\,\vert\,x)$ (3)

where $ \mathcal{A}(x,y)$ denotes the set of all possible alignments between $ x$ and $ y$. The alignment-completed probability $ p(y,a\,\vert\,x)$ can be decomposed in terms of individual, web page position-dependent probabilities as:

$\displaystyle p(y,a\,\vert\,x)$ $\displaystyle =\prod_{i=1}^{\vert y\vert} p(y_i,a_i\,\vert\,a_1^{i-1},y_1^{i-1},x)$ (4)
  $\displaystyle =\prod_{i=1}^{\vert y\vert} p(a_i\,\vert\,a_1^{i-1},y_1^{i-1},x) p(y_i\,\vert\,a_1^{i},y_1^{i-1},x)$ (5)

In the case of the IBM-1 model, it is assumed that $ a_i$ is uniformly distributed

$\displaystyle p(a_i\,\vert\,a_1^{i-1}, y_1^{i-1},x) = \frac{1}{\vert x\vert+1}$ (6)

and that $ y_i$ only depends on the query word to which it is connected

$\displaystyle p(y_i\,\vert\,a_1^i, y_1^{i-1}, x)=p(y_i\,\vert\,x_{a_i})$ (7)

By sustitution of (6) and (7) in (5); and thereafter (5) in (3), we may write the IBM-1 model as follows by some straighforward manipulations:

$\displaystyle p(y\,\vert\,x)$ $\displaystyle = \sum_{a\in\mathcal{A}(x,y)} \prod_{i=1}^{\vert y\vert} \frac{1}{(\vert x\vert+1)} p(y_i\,\vert\,x_{a_i})$ (8)
  $\displaystyle = \frac{1}{(\vert x\vert+1)^{\vert y\vert}} \prod_{i=1}^{\vert y\vert} \sum_{j=0}^{\vert x\vert} p(y_i\,\vert\,x_j)$ (9)

Note that this model is governed only by a statistical dictionary $ \ensuremath{\boldsymbol{\Theta}}$={$ p(w\vert v)$, for all $ v \in \mathcal{X}$ and $ w \in \mathcal{Y}$}. The model assumes that the order of the words in the query is not important. Therefore, each position in a document is equally likely to be connected to each position in the query. Although this assumption is unrealistic in machine translation, we consider the IBM-1 model is particularly well-suited for our approach.

next up previous
Next: Maximum likehood estimation Up: Using Query-Relevant Documents Pairs Previous: Introduction
David Pinto 2007-10-05