Overview of CAM

Consider a user who is worried about a hurricane in progress and
wants to keep abreast of the hurricane-related updates. To achieve
this, he poses a continuous *m*-keyword query
*q* = {*w*_{1},
*w*_{2},...*w*_{m}}. In this
section, we present an overview of the major ingredients of the CAM
approach.

Let *Q* denote the set of all
queries submitted in the system and
denote
the importance of
*i*^{th} query. These are
input to the system. Let *P* denote
the set of web pages relevant to the continuous queries, *Q*_{p}i be the set of queries for
which *i*^{th} page is found
to be relevant, and *r*_{i,
j} be the estimated relevance of *i*^{th} page for *j*^{th} query. It is positive for
all queries q
*Q*_{p}i and zero for all
*q* *Q* -
*Q*_{p}i. These relevance measures are
initially calculated during the tracking period (and get updated
continually, as explained later).

It is clear that not all pages will be equally important for
each query in the system. So we rank the pages by assigning a
*weight* to each page using its relevance for queries. The
*weight* of a page, computed as
(*r*_{i, j}), denotes
the value of current version of the page. If the page gets updated
before its current version is *monitored*, we assume that we
incur a loss of *W*_{i}.

Let *C* denote the total number
of *monitoring tasks* that can be employed in a single
monitoring epoch. *C* is derived as
an aggregation of the resources needed for monitoring, including
CPU cycles, communication bandwidth, and memory. E.g., Heydon and
Najork [9] report that with two 533MHz
Alpha processors, 2GB of RAM, 118GB of local disk, a 100Mbit/sec
FDDI connection to the Internet, their crawler (Mercator)
downloaded 112 documents/sec and 1,682 KB/sec (on an average).

is the
estimated number of changes that occur in page *i* in *T*
time units. Henceforth we will call it the *change
frequency* for page *i*.
Suppose *U*_{i} denotes the
sequence of time instances
*u*_{i, 1}, *u*_{i,
2},..., *u*_{i, p}i at which the tracking
phase determines that possible updates occur to page *i*. We assume
0 *u*_{i, 1}
*u*_{i, 2}
^{...} *u*_{i,
p}i *T*,
*u*_{i, 0} = 0, and
*u*_{i, p}i = *T*.
*p*_{i} is the total number
of update instances for *i*^{th} page during *T*, i.e., cardinality of sequence *U*_{i} (*p*_{i}=|
*U*_{i}|).

Note that a page may not be updated at these time instances and
so there is a probability
associated with each time instance *u*_{i, j} that denotes the
chances of
*i*^{th} page being updated
at the
*j*^{th} instance. The
overall goal of the resource allocation and scheduling phases is to
monitor in such a way that the monitoring events occur just after
updates are expected to take place. The number of missed updates is
an indication of the amount of lost information and minimizing this
is the goal of the system.

With these considerations in mind, decisions are made about the
allocation of a given number of monitoring tasks among a set of
relevant pages while also deciding the best instant when these
allocated tasks should occur within an epoch. The basic idea is
that these monitoring epochs of length *T* repeat every *T* units of time and we will make
decisions pertaining to the monitoring tasks to be carried out in
one monitoring epoch using both new data and the results from the
previous epochs.