Dept of Informatics, University of Athens

Panepistimiopolis, 15771

Athens, Greece

Dept of Informatics, University of Athens

Panepistimiopolis, 15771

Athens, Greece

We evaluate the behavior of the Exponential Growth Copying (EGC) model for the web-graph and analyze the effect of individual parameters on its overall effectiveness through simulation. We derive the in and out degree probability distributions of the resulting graphs for various parameter values, measure them against the corresponding empirical power-laws on the real web and suggest appropriate values to improve EGC effectiveness. We conclude that the EGC model provides a very good approximation of the in-degree distribution while the out-degree distribution is plausible only when specific values of EGC parameters are employed.

Web-graph, Models, Simulation.

Valid models of the web are important, both for
creating WWW-like representations where new applications can be
tested (searching, indexing, compression etc.) and for improving
the understanding of content creation on the web, predicting its
evolution and the emergence of important new phenomena. The
web-graph is an abstraction of the WWW: each HTML page is a vertex
and each hyperlink a directed edge, defined by origin vertex (tail)
and destination vertex (head). The out (in) degree of a vertex is
the number of edges having as tail (head) this specific vertex.
Crawls report [1, 2, 4] that “power-laws” exist for in
and out-degrees of web-graph vertices: The probability that a
vertex has in (out) degree *x* is proportional to
*x ^{-k}* for some

Among the various models that have been proposed for web-graph evolution, the Evolving Copying Models are highly regarded as they were explicitly designed to model the WWW. Their authors show that they have a large number of complete bipartite subgraphs, as has been observed in the crawls, whereas several other models do not. Their development was based on very realistic intuitions about the WWW [3][5]. In this paper, we study the behavior of the Exponential Growth Copying model (named henceforth EGC) [3] and analyze the effect of individual parameters on its overall effectiveness. We derive the in and out degree distributions of the resulting graphs for various parameter values and measure them against the reported power-laws for in and out degrees. We draw conclusions about the model behavior by measuring the error distribution between experimental and empirical results, using the mean error as a measure of performance. Finally, we suggest appropriate values for EGC parameters for more “web-like” graphs.

Copying models in general are based on the
intuition [5] that, although some page creators may create content
and link to other sites regardless of the already represented
topics in the web, many will be drawn to existing topics of
interest to them and link to pages within some of these existing
topics. In [3] the intuition for the EGC is extended with the fact
that, due to the exponential growth of the WWW, a page creator will
not be aware of the existence of pages created in the most recent
“epoch” of pages (e.g. the last week or fortnight). At
each time step a new epoch of vertices arrive, the size of which is
a constant fraction of the current graph. Each of these vertices
may be linked only with vertices of previous epochs. The EGC model
is formally described parametrically, using a constant
“growth” factor *p > 0*, the
“self-loop” factor *γ > 1*, the
“tail copy” factor *γ'* in *(0, 1)*
and the “out-degree” factor *d > 0*.

We simulated the operation of the EGC as
follows: the model starts with a small initial graph, a parameter
vector *(p, γ, γ', d)* and the terminating
condition *N* (number of vertices in the graph). Let *n*
denote the current number of vertices in the graph up to the last
epoch and *v* the number of vertices to be added in the next
epoch. The value of *v* is decided by random sampling from the
binomial distribution with parameters *p* and *n*. For
each new vertex the algorithm creates *γ* self-loop
edges; then *d* new edges are created and each one’s
head and tail are set to selected vertices: the head of a new edge
is chosen uniformly among the heads of all existing edges up to the
last epoch (i.e. preferential attachment on in-degree among the
*n* existing vertices) and the tail with probability
*γ'* is one of the existing edges up to the last epoch
(i.e. preferential attachment on out-degree among *n* existing
vertices) while with probability *1 - γ'* the tail is
chosen randomly among the vertices of the new epoch. When all new
edges have been created, the new epoch “arrives” and is
added to the graph; new epochs are generated until *n >
N*.

EGC was implemented in Matlab. Each simulation
run was made for a number of epochs until at least *10,000*
vertices were generated in the resulting graph. Initial conditions
were the same for all runs, specifically one vertex with
*γ* self-loop vertices. We chose characteristic values
for the model parameters *p, γ, γ', d*. For
each parameter vector, *5* independent replications were made.
To ensure reliability, we also made a number of experiments with
*1,000,000* vertices where no significant variation from the
*10,000* vertices results was observed. We thus experimented
with *135* different combinations of *[p, γ,
γ', d]* parameters, to explore a wider range of
parameter values, which, as later presented, have a strong impact
on EGC efficiency. We chose low, intermediate and high values of
*[p, γ, γ', d]*; specifically *{0.05, 0.5,
0.95}* for *p* and *γ'*, *{2, 5, 10}* for
*γ* and *{1, 2, 3, 4, 5}* for *d*. In and
out degrees of all vertices (removing self-loop edges) and the
corresponding degree probability distributions for each run are
then computed, and averaged over the *5* runs of each
parameter vector. Finally, we compare these with the empirical
power-law degree probability distributions by means of error
distributions for each parameter vector.

In-degree probability plots, in all *135*
cases, keep a steady form of “power-law” [Figure 1],
while out-degree probability plots deviate from this form in many
cases [Figures 2, 3]. The examination of the impact of individual
parameters *p, γ, γ', d* on EGC results shows
that *p* does not affect mean errors, while as
*γ* increases, they tend to increase also. As
*γ'* increases, the out-degree error tends to
decrease. Both errors decrease as *d* increases. Analyzing the
impact of *[p, γ, γ', d]* on EGC, we conclude
that parameter *p* does not have any significant importance,
*γ* should be kept as low as possible,
*γ'* should be as high as possible and the greater
*d* is the better. Overall the EGC model provides a very good
approximation of the in-degree distribution. It is not, however,
analogously efficient when approximating the out-degree. It gives
better results when *γ' -> 1*, *γ ->
2* and *d -> 5*. Especially concerning *d*, we
consider it as a drawback of the EGC model, since empirical values
for *d* in the web are between *3* and *4*.

Figure 1. ‘power-law’ in-degree
probability distribution [*γ'= 0.05, p=0.5,
γ=2, d=3*]

Figure 2. ‘no power-law’ out-degree probability distribution [*γ'= 0.05, p=0.5, γ=2, d=3*]

Figure 3. ‘power-law’ out-degree probability distribution
[*γ'= 0.95, p=0.5, γ=2, d=3*]

We have presented a simulation-based evaluation
of the EGC model, starting from an initial graph of *1* vertex
with *γ* self-loop edges, for a wide variety of *[p,
γ, γ', d]* values. We concluded that the
in-degree probability distribution of the resulting graph, for all
parameter vectors, is very close to the empirical distribution of
the WWW (power-law with factor *–2.1*). However, the
out-degree probability distribution displays sensitivity to the
parameter *γ'*, only approximating the power-law with
factor *-2.7* when *γ' = 0.95*. Moreover, both in
and out degree approximations tend to improve with the increase of
*d* over *4*, but this is problematic because this value
in the web is between *3* and *4*. We also concluded that
the value of parameter *p* does not have any significant
impact on the results, while parameter *γ* should be
kept as low as possible.

- A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan,
R. Stata, A. Tomkins, J. Wiener. Graph structure in the web:
Experiments and models.
*Proc. 9th WWW Conf, pp. 309-320, 2000.* - J. Kleinberg, R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins. The web as a
graph: Measurements, models and methods.
*Proc. Intrnl Conf on Combinatorics & Computing, pp.1-18,1999.* - R. Kumar, P. Raghavan, S. Rajagopalan, D. Sivakumar, A.
Tomkins, E. Upfal. Stochastic models for the web-graph. Proc.
*41st Annual Symp on Foundations of Computer Science, 2000.* - R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins. Trawling the
web for emerging cyber-communities.
*Proc. 8th WWW Conf, pp. 403-416, 1999.* - R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins. Extracting
large-scale knowledge bases from the web.
*Proc. 25th VLDB Conf, pp. 639-650, 1999.*