Resource Allocation in CAM

As noted earlier, we need to distinguish between pages on the
basis of two metrics. One is the nature of page change behavior and
the other is the importance of a page for one or more queries. Page
change behavior is studied during a *tracking* phase and is
characterized by associating a probability of change with every
potential update instant. Next we show the way in which pages can
be ranked by assigning *weights* to them using relevance
measures. These relevance measures are determined for each page for
each query during the *tracking* period.

CAM aims is to minimize the weighted
importance of changes that are not reported to users:
*W*_{i} *E*_{
i}. Here *E*_{i}
denotes expected number of lost changes for
*i*^{th} page.

We make two assumptions about updates. First, inter-update
intervals are independent of each other. Second, an update results
in complete loss of information from previous updates. That is, the
number of lost updates is an indication of the amount of lost
information. While these may not always be true, they give us a
simple way to state the goal to be accomplished. Thus,

E_{i} |
= | (1 - y_{i,
j}). |

The constraint on resources is given by

y_{i, j}
= C. |

where

It is important to point out that it is relatively easy to
accommodate the following extensions to the above model. In certain
cases, we may have more information about specific changes than the
case described above. For example, if we can measure change
behavior of
*i*^{th} page with respect
to
*j*^{th} query, then it
would be possible to allocate resources even more efficiently. For
example, suppose we get to know during the tracking period that a
particular news site mainly declares health updates only once at
the start of day and in the rest of the time, it remains mainly
concerned about political and sports updates, then we can better
characterize the change behavior of this page with respect to
queries concerned with sports, medical and political domain.

Suppose
denotes the probability of change of
*i*^{th} page at
*j*^{th} update instant
where this change is relevant for query *k*. Then *E*_{i}, the weighted expected
number of lost changes for
*i*^{th} page is,

If we can extract even more information about changes to the
*i*th page (by measuring, say, not
only the probability of change at time *u*_{i, j}, but also the average
importance of change at this time) then it can make better resource
allocation. For example, suppose we find that a particular research
site compiles and announces all its previous day's research updates
daily at 10:00a.m. in the morning, but throughout the rest of the
day, it updates its page only if/when some new research
breakthrough takes place. Then it is clear that visit to this page
right after 10:00a.m. is more likely to be fruitful than any other
visit to this page.

The formulated resources allocation problems are discrete, separable and convex.

**Discrete**- because variable
*y*_{i, j}can take only discrete values. Our problem is inherently discrete due to discrete nature of monitoring. Either a monitoring task is allocated to a page or it won't be. There can't be anything between these. **Separable**- because optimizing function could be expressed in terms of
*y*_{i, j}only. **Convex**- owing to the convex nature of the objective function.

Given the design of the above resource allocation algorithm, the
*y*_{i, j}s that result are
optimal, i.e., maximize the value of the returned information, when
the updates are quasi-deterministic, i.e., occur at specific time
instants and the actual occurrence of a change is associated with a
probability.