Constructing and Examining Personalized Cooccurrence-based Thesauri on Web Pages

Sen Yoshida
NTT Communication Science Laboratories, NTT Corporation
2-4, Hikaridai, Seika-cho
Soraku-gun, Kyoto 619-0237 Japan
+81 774 93 5235
Takashi Yukawa
Nagaoka University of Technology
1603-1 Kamitomioka-cho
Nagaoka-shi, Niigata 940-2188 Japan
+81 258 47 9532
Kazuhiro Kuwabara
NTT Communication Science Laboratories, NTT Corporation
2-4, Hikaridai, Seika-cho
Soraku-gun, Kyoto 619-0237 Japan
+81 774 93 5230


Interests and knowledge differ from person to person. Such differences are reflected in the semantics of words and terms. For example, the terms ``notebook computer'' and ``PDA'' may be similar to a particular user from the perspective of portability, but the term ``desktop computer'' is not similar on this basis. For another user, the terms ``notebook computer'' and ``desktop computer'' are similar because both are computers, but the term ``PDA'' is not.

The authors have proposed a collaborative Information Retrieval (IR) scheme for personal documents with a mechanism for adapting to divergence in word semantics [4]. This scheme is targeted to IR on personal repositories. A personal repository is storage space in a personal computer that stores personal documents such as emails and technical papers that the user writes or downloads. The IR function of a personal repository provides a search capability to the user as well as a knowledge circulation capability to the user's acquaintances by accepting queries from them and answering the queries.

This collaborative IR scheme personalizes the IR function by utilizing concept bases, which are personal cooccurrence-based thesauri constructed from documents in personal repositories. A cooccurrence-based thesaurus [2] represents the similarity between two words as the cosine of the corresponding vectors, and it is constructed by reducing the dimensionality of a cooccurrence matrix for a corpus. The collaborative IR scheme constructs a cooccurrence-based thesaurus by regarding the set of documents in the personal repository as a corpus. Therefore, the constructed thesaurus reflects the tendency of word usage in the documents written or read by the user.

In the case of doing a search on a personal repository in response to a query from another user, it might be problematic to utilize a thesaurus that is personalized to the repository's owner. The understood meaning of the query keyword may differ between the query sender and the repository owner. To overcome this difficulty, the scheme performs an automated relevance feedback for each query.

Although the feasibility of this scheme was validated in the authors' previous paper, the validation experiment used an intentionally created `personal' corpus. Therefore, it is important to examine how much personal cooccurrence-based thesauri really differ from person to person. Additionally, a systematic method for examining differences in personal thesauri can be used to find relations among people based on their interests and knowledge, as the method proposed by Hamasaki et al. [1] does by measuring the differences among the structures of bookmark folders of Web browsers.

At the development stage of the collaborative IR scheme, privacy issues prevent us from preparing real data on personal repositories for experiments. Fortunately, however, we can obtain a substitute for a personal repository by retrieving Web pages starting at the homepage or the bookmark file of the user.

Accordingly, we present in this paper a method of constructing a personalized cooccurrence-based thesaurus from a personalized corpus obtained from the Web. We also propose a scale to measure the difference between two thesauri. Moreover, we report on the results of our experiment.


We propose a method for constructing a personalized cooccurrence-based thesaurus, which is divided into two parts: personal corpus retrieval and thesaurus construction.

A personal corpus is a set of Web pages that are retrieved from the Web by using a Web crawler. The Web crawler starts its task at the user's homepage or bookmark file and then gathers Web pages by following links in a breadth-first order until the number of pages in the corpus reaches into the thousands.

After the corpus is retrieved, it is used as the basis for constracting a cooccurrence-based thesaurus, in the same way as in [2,3]. The thesaurus construction process is outlined as follows. First, a cooccurrence matrix $A$ is built for all pages in the corpus. An element $a_{i j}$ of $A$ corresponds to the number of times the word $w_i$ and the word $w_j$ cooccur within the corpus. Then the matrix is decomposed with the singular value decomposition (SVD) method so that

\begin{displaymath}A = U\Sigma V^T\end{displaymath}

Finally, the reduction $U_k$ of the left singular matrix $U$ is obtained by selecting $k$ ($100 \leq k \leq 200$) largest singular values. This $U_k$ is called a cooccurrence-based thesaurus, and each of its row vectors is the feature vector of the corresponding word.


In this section, we propose a scale for measuring the differences between personal thesauri built by the method described in the previous section.

The scale is based on the root mean square of differences between the similarity values of words in each thesaurus. Namely, when a pair of words is given, the cosine value of their vectors in thesaurus $S$ differs from that in thesaurus $T$. We calculate the difference between the cosine value in $S$ and the cosine value in $T$ for every pair of words and sum those difference values by calculating the root mean square.

More precisely, we define the scale as follows. The feature vector of the word $v$ in thesaurus $S$ is the corresponding row vector of $S$, which we denote as $\vec{v_S}$. The similarity value between two words $v$ and $w$ in thesaurus $S$ is

\mathit{sim}_{v w}^S =

Accordingly, the difference between the similarity values of the words $v$ and $w$ in $S$ and $T$ is

\begin{displaymath}d_{v w} = \vert\mathit{sim}_{v w}^S-\mathit{sim}_{v w}^T\vert\end{displaymath}

The difference $d_{v w}$ can be calculated when both $v$ and $w$ are included in both $S$ and $T$. However, this is not applicable if $v$ or $w$ is not included in $S$ or $T$. By considering this, we extend the definition of $d_{v w}$ to

d_{v w}'& =& \vert\mathit{sim}_{v w}^S-...
...hrm{or}\,w \notin S\;\; v\,\mathrm{or}\,w \notin T)

Here, we add the difference values throughout the thesauri. Let $m$ be the number of words in a thesaurus. By calculating the above formula for all pairs of words, we get $m^2$ difference values. The scale is the root mean square of them:

\begin{displaymath}d = \sqrt{\frac{\sum_{i = 1}^m\sum_{j = 1}^m{d_{v_i w_j}'}^2}{m^2}}\end{displaymath}


We built personalized cooccurrence-based thesauri as follows. We selected six computer science researchers as target users, and each user's personal corpus was retrieved by a Web crawler. The personal corpus retrieval results are shown in Table 1.

Table 1: Number of hops, HTML documents, and words.
Person Starting from Hops Documents Words
p homepage 3 5,424 100,299
q homepage 4 4,972 57,626
r homepage 3 2,437 54,289
s homepage 4 1,143 30,917
t bookmark 2 10,564 169,762
u bookmark 2 15,173 207,311

We examined how extensively documents were shared within each pair of these personal corpora. The propotion of shared documents was about 5% for each pair of corpora. This confirms that personal corpora significantly differ from each other.

A cooccurrence-based matrix and corresponding cooccurrence-based thesaurus were constructed as follows. First, for a corpus that includes tens of thousands of words, we made a 5,000 by 2,000 cooccurrence matrix, $C$. Here, each row of $C$ corresponds to one of the 5,000 most frequently appearing words in the corpus, while each column corresponds to one of 2,000 most frequently appearing words, and element $c_{ij}$ records the number of times that words $i$ and $j$ cooccur. Here, words $i$ and $j$ are regarded as cooccuring if they are in a symmetric window of total size 41 that is centered on word $i$. Consequently, the cooccurrence-based thesaurus is produced by SVD to reduce the dimensionality of matrix $C$ to 5,000 by 100.

For instance, the similarity values in the thesauri of user `p' and user `q' are compared for two words in Table 2. These results show that for user `p', the word ``Web'' is semantically close to the word ``applications''; however, for user `q' it is distant. On the other hand, for user `p', the word ``Web'' is not so similar to the word ``Search'', but for user `q', it is similar.

Table 2: Sample similarity values
User Word Word Similarity Value
p Web applications 0.257
q Web applications 0.055
p Web Search 0.102
q Web Search 0.389

Accordingly, the difference values among these thesauri by using the proposed scale are shown in Table 3. The underlines indicate that the corresponding users are actually coauthors of a paper. This result shows that coauthors, who are expected to share interests and knowledge, tend to have low difference values.

Table 3: Difference among personalized thesauri
q r s t u
p 0.551 0.429 0.424 0.450 0.359
q   0.631 0.549 0.464 0.547
r     0.481 0.541 0.467
s       0.489 0.468
t         0.389


In this paper we described a method for obtaining a personalized corpus from Web pages and for constructing a personalized thesaurus for the corpus. To measure the differences among thesauri, we proposed a scale. The results reveal differences between users' thesauri, and the difference values measured by the proposed scale reflect the users' similarity in interests and knowledge.

A personal corpus is a substitute for a personal repository, which contains not only Web pages but also various other types of information sources. To realize a collaborative personal repository system, such variations need to be adapted. Further work will involve combining the thesaurus equalization method proposed in [4] and the thesaurus construction method proposed in this paper.


  1. M. Hamasaki and H. Takeda. Experimental results for a method to discover human relationship based on www bookmarks. Proc. 5th KES 2001, pages 1291-1295, 2001.
  2. H. Schütze and J. O. Pedersen. A cooccurrence-based thesaurus and two applications to information retrieval. Information Processing & Management, 33(3):307-318, 1997.
  3. T. Yukawa, K. Kasahara, T. Kato, and T. Kita. An expert recommendation system using concept-based relevance discernment. Proc. 13th ICTAI 2001, pages 257-264, 2001.
  4. T. Yukawa, S. Yoshida, and K. Kuwabara. Collaborative information retrieval for a personal repository system based on vector space model. Working Notes of 5th PRIMA 2002, pages 101-112, 2002.