\documentclass{www2003-submission}
\usepackage{graphicx}
\usepackage{amsmath}
\usepackage{amssymb}
\hyphenation{PageRank} \hyphenation{Richardson}
\begin{document}
\title{Usage Aware PageRank%: A PageRank reformulation that takes into account usage statistics
\titlenote{A longer version of this paper with experimental results on a site level is available as a technical report at http://www.cs.umn.edu, TR number 03-010}
}
\numberofauthors{4}
\author{
\alignauthor B. Uygar Oztekin\\ %\titlenote{}\\
\affaddr{University of Minnesota,}\\
\affaddr{Dept. of Computer Science,}\\
\affaddr{Army HPC Research Center}\\
\email{oztekin@cs.umn.edu}
\alignauthor Levent Ert\"oz\\ %\titlenote{}\\
\affaddr{University of Minnesota,}\\
\affaddr{Dept. of Computer Science,}\\
\affaddr{Army HPC Research Center}\\
\email{ertoz@cs.umn.edu}
\alignauthor Vipin Kumar\\ %\titlenote{}\\
\affaddr{University of Minnesota,}\\
\affaddr{Dept. of Computer Science,}\\
\affaddr{Army HPC Research Center}\\
\email{kumar@cs.umn.edu}
}
\additionalauthors{Jaideep Srivastavva, Dept. of Computer Science,
University of Minnesota, srivasta@cs.umn.edu}
\maketitle
\begin{abstract}
Traditional link analysis approaches assume equal weights assigned
to different links and pages. In original PageRank formulation,
the user model assumes that the user has equal probability to
follow each link from a given page, thus the score of a page
equally affects all of the pages it points to. It also assumes
that the probability for a user to go to a URL directly without
following a link is the same for all URLs. In this paper, we
investigate different weighting schemes that take into account the
probability to go to a page directly (by typing or using
bookmarks), as well as the relative probability to follow a link
from a given page. Both of these probabilities can be approximated
from usage logs if they are available. We introduce a natural
extension to the original PageRank formulation that we will call
Usage aware PageRank (UPR). The new formulation combines static
link structure graph with the usage graph that will be obtained
via web logs or other means. It is also quite general; how much
emphasis will be given to the graphs is controlled by a parameter.
If the parameter is set to zero, the algorithm becomes equivalent
to the original PageRank, if it is set to one, the emphasis shifts
to the usage graph, and for values in between, both of the graphs
will be used with weights specified by the parameter. UPR is also
quite inexpensive. After a onetime precalculation step, an
iteration of UPR takes about the same time as a PageRank
iteration.
\end{abstract}
\category{H.3.3}{Information Storage and Retrieval}{Information
Search and Retrieval}[search process, information retrieval,
information filtering]
\begin{terms}
Algorithms, performance, experimentation
\end{terms}
\begin{keywords}
%\textbf{Keywords:}
Pagerank extension, usage statistics, link analysis, UPR, usage
aware pagerank
\end{keywords}
\vfill\eject
\section{Usage aware PageRank}
Original PageRank formulation focuses on the static structure of
the site, completely ignoring usage. However, users visiting web
pages either directly or by following a link can be considered as
an implicit indication of the importance of these pages. Usage
aware PageRank formulation is an extension to the original
PageRank which makes use of usage statistics as well as the static
structure. The PageRank formula given in the original paper is as follows:\\
\begin{equation}\label{pagerank}
PR(A) = (1-d) +
d\times(\frac{PR(T_{1})}{C(T_{1})}+\ldots+\frac{PR(T_{n})}{C(T_{n})})
\end{equation}
where $T_{1} \ldots T_{n}$ are pages pointing to page $A$, the
parameter $d$ is the damping factor, and $C(A)$ is the number of
links going out of page $A$. %PageRank is based on a random walk model.
A user will access a page either following a link, or by typing
the URL (or using bookmarks). We can think of the damping factor
as the probability that a user will follow a link, instead of
typing the URL. Note that, PageRanks of all pages will add up to
the total number of pages ($n$) using the above formula. By
dividing $(1-d)$ term by $n$, we can obtain a true probability
distribution for PageRanks. The formula is based on a random walk
model, if the user chooses to follow a link, all the links on the
page the user is looking at have equal chances of being clicked
on. Similarly, if the user chooses not to follow a link and to
start over, all the pages have equal chance of being accessed. In
some sense, PageRank of a page shows the probability that a random
walk user will visit a particular page. In this formulation, a
page's PageRank contributes equally to the PageRanks of the pages
it points to, normalized by total number of outgoing links from
that page.
The random walk user in the original PageRank formulation does not
differentiate between links on a given page while deciding which
one to follow, nor does it differentiate between pages when it
decides to start from a new page. Consider a scenario where we
have a web page, $P_1$, which is bookmarked by various users,
therefore visited more often than other pages. Consider a link on
that page, $L_{12}$, that points to page $P_2$, which is followed
by majority of its visitors, but other links on page $P_1$ are
hardly used. Intuitively, the probability of users visiting page
$P_1$ is higher than the probability of users visiting other
pages. Similarly, PageRank of $P_1$ should contribute more to the
PageRank of $P_2$, since $L_{12}$ is followed more often than
other links on $P_1$. Obviously, the random walk model doesn't
capture these differences. We extend the PageRank formulation by
taking usage statistics into account. The Usage aware PageRank,
UPR, of a page is defined as follows:\\
\begin{equation}\label{upr}
\begin{split}
UPR(A) &= (1-d)\times \left(\frac{1-a}{n}+a\times
W_{no-follow}(A)\right)+\\
& d\times \left(
\begin{split}
(1-a) & \times \sum{\frac{UPR(T_i)}{C(TS_i)}}+\\
a & \times\sum{\frac{UPR(T_i)\times W(TU_i\rightarrow A)}{W(TU_i)}}
\end{split}
\right)
\end{split}
\end{equation}
where $TS_i$ are the pages $T_i$ that contain hyperlinks
(structural links) to page $A$, $C(TS_i)$ is the number of
hyperlinks on $T_i$, $TU_i$ are the pages that users followed a
link from to go to page A, $W(TU_i\rightarrow A)$ is the total
weight of link traversals from page $T_i$ to page $A$, $W(TU_i)$
is the total weight of all link traversals from page $T_i$,
$W_{no-follow}(A)$ is the total weight of accesses to page $A$
without following a link divided by total weight of all such
accesses, $d$ is the damping factor, $n$ is the total number of
pages, and $a$ is the emphasis given to the usage statistics. The
simplest way of assigning weights to page / link accesses is to
use counts obtained from the logs, in which case, UPR with
emphasis only on the usage graph, will have similar interpretation
as PageRank, but with accurate probability estimates. When $a$ is
equal to zero, UPR becomes equivalent to PR. In order to avoid
sink effects, i.e., guarantee that UPR of all pages add up to $1$,
for pages that don't contain any hyperlinks, it is possible to
create artificial links to all other pages.
In some sense, UPR is based on a ``biased random walk model''. The
biased random walk user distinguishes between pages as well as
links. He/she prefers some links over others (perhaps the chosen
link is more visible, placed towards the top, or has a better
anchor text), and prefers some pages over others (perhaps the page
is a good ``hub'' to start from, it is in his bookmarks, or has a
short and easy to type URL).
Note that, once a value of $a$ is fixed, the formula can further
be simplified by precomputing the matrix that will be used in the
iterations, in which case number of computations required for each
subsequent iteration will be about the same as a regular PageRank
iteration. The formula can be rearranged as follows:
\begin{equation}\label{upr}
\begin{split}
UPR(A) & = (1-d)\times \left(\frac{1-a}{n}+a\times W_{no-follow}(A)\right)+\\
& d\times \sum{UPR(T_i)\left(\frac{1-a}{C(TS_i)}+\frac{a\times W(TU_i\rightarrow A)}{W(TU_i)}\right)}
\end{split}
\end{equation}
Note that only the $UPR(T_i)$ term in the summation changes during
iterations, once $a$ is fixed, the remaining part can be
incorporated in a precalculated matrix.
%\subsection{Extensions to UPR}
In our formulation, we used an emphasis factor, $a$, that
represents the relative importance of the usage graph to the
structure graph. A slight variation of the formulation can be
obtained by using two separate emphasis factors for the two
components of the formula corresponding to the probability that a
user will type a URL, and the probability that a user will follow
a link on a given page. We can adjust the weights of the
structure graph and the usage graph separately for these two
components. Deemphasizing the usage statistics about going to a
page directly may be desirable in some cases, for instance, when
we want to reduce the effects of browser home pages (which may get
unintentional hits every time a user opens up the browser), and
when we do not have full usage statistics for a subset of pages.
Note that, weights of links going out of a given page is
normalized for each page in our formulation. If we do not have
usage statistics about the links of a given page, all of them can
be treated equally. However, if we assign initial weights
reflecting the number of hits on a page without following a link,
portions of the dataset that do not contain usage statistics will
be penalized. Note that a similar situation occurs if the usage
sampling rate for different domains or sites are different; these
pages will be assigned low initial scores. By splitting the
parameter $a$ into two, it may be possible to reduce such
undesirable effects in presence of partial or uneven information,
without affecting the second portion of the formula. The modified
formula is very similar to the original formula, and is as
follows:
\begin{equation}\label{upr}
\begin{split}
UPR(A) &= (1-d)\times \left(\frac{1-a_{1}}{n}+a_{1}\times
W_{no-follow}(A)\right)+\\
& d\times \left(
\begin{split}
(1-a_{2}) & \times \sum{\frac{UPR(T_i)}{C(TS_i)}}+\\
a_{2} & \times\sum{\frac{UPR(T_i)\times W(TU_i\rightarrow A)}{W(TU_i)}}
\end{split}
\right)
\end{split}
\end{equation}
where $a_{1}$ and $a{2}$ are the new emphasis parameters,
separated for the two portions of the formula. Similarly,
iterations using this formula too can be optimized as discussed
before (for a given value of $a_2$).
There are certain types of user behavior that may reduce the
accuracy of the probability estimates obtained from the logs.
Every time a user opens up a browser, the browser's home page gets
a hit, even if the user's intent is to visit another page. Also,
equal number of hits to a page from several users should be
weighted higher than a user hitting the same page several times,
so that our estimates better reflect the behavior of all users.
Similarly, several users following a link should be weighted
higher than a user following the same link several times. In the
UPR formulation, instead of using counts of links followed, and
counts of hits to pages with empty referrer fields, we can
deemphasize successive accesses to pages / links from the same
user in a time window by using modified counts which is a log
transform of the counts:
\begin{equation}\label{modcount}
Modified Count = \lg(1+Count)
\end{equation}
Note that if there were no accesses to a particular page, the
modified count would be $\lg(1+0)=0$, if there was a single
access, the modified count would be $\lg(1+1)=1$, and the modified
count would lower the weight of subsequent accesses from the same
user. After the transformation, adding up all the modified counts
from all users for all time windows will give us estimates that
better reflect overall user behavior.
We applied regular PR and UPR with simple counts as well as
modified counts to our department's web site for which we have
full usage statistics. Although characteristics for intranet vs
internet settings can be significantly different, preliminary
results show that UPR is promising on the site level and helps
alleviate some of the problems associated in applying PageRank on
a single site. These results will be presented in the poster
session. Some of the results and a longer paper is also available
as a the technical report at http://www.cs.umn.edu, report number
03-010.
\end{document}