The presence of replicas or near-replicas of documents is very
common on the Web. Whilst replication
can improve information accessibility for the users,
the presence of near-replicas can
hinder the effectiveness of search engines.
We propose a method to detect similar pages, in
particular replicas and near-replicas, which is based
on a pair of signatures. The first signature is obtained by
a random projection of
the bag-of-words representation of the page contents. The second
signature, referred to as Hypelink Map, is computed by a
recursive equation which exploits the connectivity among the Web
pages to encode the page context. The
experimental results show that on the given dataset
replicas and near-replicas can be detected with a precision
and recall of 93%.
Web page near-replicas, duplicates detection, Random Projections, Hyperlink analysis
The success of a search engine mainly depends on its capability to provide satisfactory answers to the users' queries in the first positions of the result set. In order to reduce the effect of information flooding, many criteria have been used to define a ranking among the returned documents. However, another important issue is the redundancy of the information on the Web. Documents on the Web can be replicated many times on different hosts, or the same document can be accessed by different URLs because of aliases. Partial or complete mirrors of sites are quite common to ease the access to popular resources and to distribute the network and server load. Replicas and near-replicas waste storage in the search engine indexes, they reduce the quality of the query results replicating the same information, and they can also hinder the effectiveness of the ranking techniques based on link analysis.
In  different techniques to detect mirror sites are analyzed and compared.
The proposed algorithms do not consider the page contents, but are
based on features such as the IP addresses of the sites, the hostnames,
the structure of the path in the URL and the similarities in the
connectivity to external hosts for the candidate mirrors.
A different approach based on hash signatures of the page
contents is presented in . In this paper we propose a technique for finding lists of similar documents,
and in particular replicas and near-replicas, based on a pair of signatures
which take into account both the document contents and the hyperlink structure.
2. The Document Signatures
Each document is mapped to a pair of signatures. The Random Projection (RP)
signature maps the document contents to a low-dimensional real valued
vector, preserving the notion of similarity in the original space
(i.e. near representations are mapped to near points in the projected space).
The Hyperlink Map (HP) signature is then obtained by propagating
the RP signatures through the hyperlinks between the pages,
providing a method to distinguish pages also on the basis
of the pages they link.
Appropriate index structures for multidimensional data (e.g R-trees, etc.)
can be used to reduce the cost of -nearest-neighbors and range searches
using the RP and HM signatures.
2.1 The Random Projection Signature
Recently Random Projection has been proposed as a technique for dimensionality reduction . A random projection from to dimensions is obtained by a projection matrix whose entries may be initialized using a uniform random distribution in [-1,+1]. The Random Projection (RP) signature of the page in the repository is computed from the random projection of the bag-of-words representation of . Given the vector such that , the RP signature is obtained by normalizing the vectors in order to have each component of the signature vectors in the interval [0,1].
In the experimental evaluation we investigated
the effect of the dimension showing that good accuracy can be obtained
using very low values for . For the dataset used in the experiments
we found that no substantial improvements can be achieved by choosing .
2.2 Hyperlink Map Signature
In order to encode the information related to the context of the page in the hyperlinked environment, we can define an iterative equation which combines the signatures of the neighbor pages, using a scheme similar to the computation of the PageRank . A simple implementation of this scheme is represented by the following system of equations
where is the -th component of the signature for the page , is the -th component of the RP signature for the page , is a dumping factor which modulates the effect of the signature propagation from the pages linked by page , and is the iteration step. At the first iteration . The vector represents the Hyperlink Map (HM) signature of page at iteration . Finally, in order to obtain values in the interval [0,1], the vectors are multiplied by , being .
At each iteration, equation (1) propagates the HM
signatures from each node to its parents, combining both
the page contents, encoded by the value, and
the structure of the page neighborhood. In order to include the contribution of
the descendents which are links away from a given page, the
equation should be iterated at least times. Significant
signatures can be computed using only few iterations (3-5
iterations in the experiments).
3. Experimental Results
We collected about documents by crawling the Web. The bag-of-words representation of each page was based on a dictionary containing about 2 millions words. In the experiments, two documents , having RP signatures and and HM signatures and , were considered to collide with tolerance iff and . Thus two documents collide if their RP and HM signatures lay in two hypercubes having side lengths smaller than and , respectively.
We measured the accuracy of the algorithm in terms of recall and precision. The recall is defined as the number of detected (near)replicas divided by the total number of (near)replicas present in the dataset. Since it is not known a priori which documents are replicated in the dataset, we manually sampled 100 pages having at least one replica or near-replica and we used this set as a reference. Thus, the recall was computed as the fraction of the documents in this set returned by the algorithm.
The precision is defined as the number of documents that have a (near)replica in the dataset divided by the total number of documents returned by the algorithm. The precision was evaluated by checking all the lists of colliding documents returned by the algorithm using the cosine correlation of the bag-of-words representations. This algorithm is slow but highly accurate in evaluating the document similarity.
The curves in figure 1 show how the precision and the recall vary with respect to the choice of the two thresholds and . The three curves depending on show the improvement provided by the use of the HM signature for three different values of . The fourth curve () reports the behavior of the algorithm when using only the RP signature. For all values of , when reducing the threshold the precision increases, while the recall is reduced. Anyway, in all the cases for appropriate values of the threshold the precision is significantly improved with a negligible reduction in the recall. This motivates the combined use of the RP and HM signatures. The break even point is 90% for the RP signature and 93% for the combination of the RP and HM signatures.
An interesting case is that of pages related to Web forums or repositories of newsgroup messages. These sites show a high redundancy because they allow different views of the same contents. The following one is an example of two URLs which refer to the same message:
Another example which was detected in the dataset consists of a list of messages ordered by date, by subject, by thread or by author. Even if the different pages differ for the internal organization and for some small details, the four pages have essentially the same informative content:
We proposed the use of two low dimensional signatures to
effectively detect replicas and near-replicas among a set of Web pages.
The two signatures encode the page contents and the page
hypertextual context, respectively. In particular, by using the HM signature
documents are compared not only on the basis of their contents
but also on their hypertextual context.
The experimental results
show that the use of both signatures can yield high recall
and precision in finding replicas and near-replicas.
This approach can be extended to broader definitions of similarity
between pages than replication or near-replication.