Let be the transition point of the text . We can calculate MI score of each term as . The TPMI will assigns the final score:
The results obtained by using this refined method are shown in Figure 2. There we can see that this approach obtains the best value of measure. Very similar results of clustering on the whole collection were obtained for DF and TS methods, with respect to the subset of hep-ex. Anyway, TS method reached the maximum value (0.5925) with 43% of terms, which corresponds to a collection vocabulary size of 2,644 terms, and only 3,318 terms hold the threshold . Whereas the DF method is very stable, it mantains its values below of the baseline (0.5919). TPMI method had a good high peak () taking 20 terms, and giving a vocabulary size of 42,167 terms