next up previous
Next: Description of the corpora Up: Clustering Narrow-Domain Short Texts Previous: Introduction

The Kullback-Leibler Distance

In 1951 Kullback and Leiber studied a measure of information from the statistical aspect viewpoint; this measure involved two probability distributions associated with the same experiment [13]. The Kullback-Leibler divergence is a measure of how different two probability distributions (over the same event space) are. The KL divergence of the probability distributions $ P$, $ Q$ on a finite set $ X$ is defined as shown in Equation 1.

$\displaystyle D_{KL}(P\vert\vert Q) = \sum_{x \in X}{P(x) {\mbox log } \frac{P(x)}{Q(x)}}$ (1)

Since this KL divergence is a non-symmetric information theoretical measure of distance of $ P$ from $ Q$, then it is not strictly a distance metric. During the past years, various measures have been introduced in the literature generalizing this measure. We therefore have used the following different symmetric Kullback-Leibler divergences i.e., Kullback-Leibler Distances (KLD) for our experiments. Each KLD corresponds to the definition of Kullback and Leibler [13], Bigi [4], Jensen [10], and Bennet [2] [27], respectively.

$\displaystyle D_{KLD1}(P\vert\vert Q) = D_{KL}(P\vert\vert Q) + D_{KL}(Q\vert\vert P)$ (2)

$\displaystyle D_{KLD2}(P\vert\vert Q) = \sum_{x \in X}{(P(x)-Q(x)) {\mbox log } \frac{P(x)}{Q(x)}}$ (3)

$\displaystyle D_{KLD3}(P\vert\vert Q) = \frac{1}{2} \left [ D_{KL} \left ( P\ve...{P+Q}{2} \right ) + D_{KL} \left ( Q\vert\vert\frac{P+Q}{2} \right ) \right ]$ (4)

$\displaystyle D_{KLD4}(P\vert\vert Q) = max \left ( D_{KL}(P\vert\vert Q) + D_{KL}(Q\vert\vert P) \right )$ (5)

KL and KLD have been used in many natural language applications like query expansion [8], language models [3], and categorization [4]. They have also been used, for instance, in natural language and speech processing applications based on statistical language modeling [9], and in information retrieval, for topic identification [5]. In this paper, we have considered to calculate the corpus document similarities in an inverse function with respect to the distance defined in Equations (2), (3), (4), or (5).

In the text clustering model proposed in this paper, a document $ j$ is represented by a term vector of probabilities $ \overrightarrow{d_j}$ and the distance measure is, therefore, the KLD (the symmetric Kullbach-Leibler divergence) between a pair of documents $ \overrightarrow{d_i}$ and $ \overrightarrow{d_j}$.

A smoothing model based on back-off is proposed and, therefore, frequencies of the terms appearing in the document are discounted, whereas all the other terms which are not in the document are given an epsilon ($ \epsilon$) probability, which is equal to the probability of unknown words. The reason is that in practice, often not all the terms in the vocabulary ($ V$) appear in the document $ d_j$. Let $ V(d_j) \subset V$ be the vocabulary of the terms which do appear in the documents represented in $ d_j$. For the terms not in $ V(dj)$, it is useful to introduce a back-off probability for $ P(t_k, d_j)$ when $ t_k$ does not occur in $ V(d_j)$, otherwise the distance measure will be infinite. The use of a back-off probability to overcome the data sparseness problem has been extensively studied in statistical language modelling (see, for instance [17]). The resulting definition of document probability $ P(t_k, d_j)$ is:

$\displaystyle P(t_k, d_j) = \begin{cases}\beta*P(t_k \vert d_j), & \mbox{ if } ...
...ox{ occurs in the document } d_j \\ \varepsilon, & \mbox{otherwise} \end{cases}$ (6)


$\displaystyle P(t_k \vert d_j) = \frac{tf(t_k, d_j)}{\sum_{x \in d_j}tf(t_k, d_j)} $

where: $ P(t_k \vert d_j)$ is the probability of the term $ t_k$ in the document $ d_j$, $ \beta$ is a normalisation coefficient which varies according to the size of the document; and $ \varepsilon$ is a threshold probability for all the terms not in $ d_j$.

Equation 6 must respect the following property:

$\displaystyle \sum_{k \in d_j}{\beta*P(t_k \vert d_j)} + \sum_{k \in V, k \notin d_j}{\varepsilon = 1}$

and $ \beta$ can be easily estimated for a document with the following computation:

$\displaystyle \beta = 1 - \sum_{k \in V, k \notin d_j}{\varepsilon} $

next up previous
Next: Description of the corpora Up: Clustering Narrow-Domain Short Texts Previous: Introduction
David Pinto 2007-05-08