Next: Maximum likehood estimation Up: Using Query-Relevant Documents Pairs Previous: Introduction

The QRDP probabilistic model

Lex

be a query text in a certain input (source) language, and let $y_1, y_2, \cdots, y_W$ be a collection of

web pages in a different output (target) language. Let $\mathcal{X}$ and $\mathcal{Y}$ be their associated input and output vocabularies, respectively. Given a number

, we have to find the

most relevant web pages with respect to the input query

. To do this, we have followed a probabilistic approach in which the

most relevant web pages are computed as those most probable given

, i.e.,

$\displaystyle \{y_1^*(x), \cdots, y_k^*(x)\} = \operatornamewithlimits{arg max}... ... \\ Vert S\vert=k}} \operatornamewithlimits{min }_{y \in S} \,\, p(y\,\vert\,x)$

(1)

In the particular case of k=1, Equation (1) is simplified to

$\displaystyle y_1^*(x)=\operatornamewithlimits{arg max}_{y=y_1, \cdots, y_W} p(y\,\vert\,x)$

(2)

In this work, $p(y\,\vert\,x)$ is modelled by using the well-known IBM alignment model 1 (IBM-1) for statistical machine translation [6,11]. This model assumes that each word in the web page is connected to exactly one word in the query. Also, it is assumed that the query has an initial ``null'' word to which words in the web page with no direct connexion are linked. Formally, a hidden variable $a=a_1a_2\cdots a_{\vert y\vert}$ is introduced to reveal, for each position in the web page, the query word position $a_i\in\{0,1,\dotsc,\vert x\vert\}$ to which it is connected. Thus,

$\displaystyle p(y\,\vert\,x) = \sum_{a\in\mathcal{A}(x,y)} p(y,a\,\vert\,x)$

(3)

where $\mathcal{A}(x,y)$ denotes the set of all possible alignments between

and

. The alignment-completed probability $p(y,a\,\vert\,x)$ can be decomposed in terms of individual, web page position-dependent probabilities as:

$\displaystyle p(y,a\,\vert\,x)$	$\displaystyle =\prod_{i=1}^{\vert y\vert} p(y_i,a_i\,\vert\,a_1^{i-1},y_1^{i-1},x)$	(4)
	$\displaystyle =\prod_{i=1}^{\vert y\vert} p(a_i\,\vert\,a_1^{i-1},y_1^{i-1},x) p(y_i\,\vert\,a_1^{i},y_1^{i-1},x)$	(5)

In the case of the IBM-1 model, it is assumed that is uniformly distributed

$\displaystyle p(a_i\,\vert\,a_1^{i-1}, y_1^{i-1},x) = \frac{1}{\vert x\vert+1}$

(6)

and that

only depends on the query word to which it is connected

$\displaystyle p(y_i\,\vert\,a_1^i, y_1^{i-1}, x)=p(y_i\,\vert\,x_{a_i})$

(7)

By sustitution of (6) and (7) in (5); and thereafter (5) in (3), we may write the IBM-1 model as follows by some straighforward manipulations:

$\displaystyle p(y\,\vert\,x)$	$\displaystyle = \sum_{a\in\mathcal{A}(x,y)} \prod_{i=1}^{\vert y\vert} \frac{1}{(\vert x\vert+1)} p(y_i\,\vert\,x_{a_i})$	(8)
	$\displaystyle = \frac{1}{(\vert x\vert+1)^{\vert y\vert}} \prod_{i=1}^{\vert y\vert} \sum_{j=0}^{\vert x\vert} p(y_i\,\vert\,x_j)$	(9)

Note that this model is governed only by a statistical dictionary $\ensuremath{\boldsymbol{\Theta}}$ ={ $p(w\vert v)$ , for all $v \in \mathcal{X}$ and $w \in \mathcal{Y}$ }. The model assumes that the order of the words in the query is not important. Therefore, each position in a document is equally likely to be connected to each position in the query. Although this assumption is unrealistic in machine translation, we consider the IBM-1 model is particularly well-suited for our approach.

Next: Maximum likehood estimation Up: Using Query-Relevant Documents Pairs Previous: Introduction

David Pinto 2007-10-05