We describe some stochastic processes for web-like graph generation. The aim is to understand which of them are principal for creating a graph which has statistical properties closer to the known properties of the Web graph. Model of random digraph where edges are added with probability proportional to out-degree of origin vertex and in-degree of terminus vertex was investigated. In both growing and non growing network cases we find that the model graph is similar to the Web graph in the mean that they have the same set of main properties (power law distribution of vertices in- and out-degree, large clustering coefficient, small diameter, hubs-authorities core, significant number of small bipartite cliques).
Web model, random graph, power-law distribution, clustering, small world, hubs and authorities, HITS.
The World Wide Web can be viewed as a huge directed graph in which a vertex represents a web page (or site) and an edge represents a hyperlink. Recent studies [1, 2] have shown that despite its chaotic appearance, web graph has a remarkable degree of self-organization. The significant properties of the Web as a graph are power-law distributions (PLD) of in- and out-degrees of vertices. That is, the number of web pages having r_{in} links on the page or r_{out} links from the page is proportional to r^{-g} for some constants g_{in}, g_{out} > 2. Furthermore, the average distance between any two web graph vertices is bounded by log N, where N is the number of the vertices in the graph, a fact which is typical for pure random graph. Finally, web pages are highly clustered compared to nodes of a random graph [3]. These two properties make the Web a "Small-World" graph.
In [4] it was shown that the Web graph contains bipartite sub graphs with specific properties, called hub and authorities, where a hub is a page that points to many authorities and an authority is page that is pointed to by many hubs. Each of the sub graphs recognized as core of a web community - set of connected sites with a common content topic. In addition, it was found [1] that there are a lot of bipartite small cliques in the Web graph.
It is interesting to find a stochastic process, which produces a graph with the complete set of the above mentioned statistical properties, i.e. a graph similar to the Web graph. This is important both from theoretical point of view, for Web modeling, and has also a practical sense, for pointing out and utilization of main mechanisms of Web structure self organization.
[5] argued that the PLD of network topology is rooted in two generic mechanisms: incremental growth and preferential attachment. Incremental growth captures the fact that networks are assembled through the addition of new nodes to the system, while preferential attachment encodes the hypotheses that new nodes connect with higher probability to more connected nodes. These models, called scale-free (SF), yield graphs with a PLD in-degree, yet they fail to capture the "community" notion as defined by [1, 4].
The best in the mean of closeness to the Web graph is copying model, introduced in [1]. The idea is: in every time stamp to add a vertex and its outgoing edges, part of which are copied from the existing edges and the rest are terminated in vertex chosen uniformly at random.
In this work we propose alternative models.
We propose a natural extension of SF model in the manner of coping model. Extended scale-free (ESF) model is defined in two principle steps:
(1) Starting with a small number of vertices, at each time step, a new vertex is added and is connected to existing vertex through random number m (< z) of new edges, where the average number of edges per node (z) is constant for a growing graph. The probability that an existing vertex gains an edge is proportional to its in-degree (i.e. to the number of in-edges it currently has.)
(2) Simultaneously, z-m directed edges are distributed among all the vertices in the graph by the following rules: (i) the source of any of these edges is chosen with a probability proportional to their out degree, (ii) the target ends of these edges are attached to vertices chosen with a probability proportional to their in-degree.
The extension of the SF model is in step (2), i.e. in free distribution of a part of the new edges between older vertices with origin vertices chosen with probability proportional to their out-degree.
The model has been simulated up to 50K vertices. In case z~8 we have found that the proposed ESF model generates a network where both in- and out-degree of vertices follow a PLD. We investigated the existence of "communities" structures in this model as defined by [1, 4]. We found a surprising number of bipartite cliques K_{l,p} [1] for different values l and p as well as a collection of good hubs and authorities [4], which indicate existence of the bipartite hub-authorities topology specific for the Web. So EFS graph is similar to the Web graph.
Another stochastic process which yields web-like graph we propose to build on the base of classic random graph [6].
Algorithm for PLD random graph constructing:
1. Start with N unconnected vertices of graph G.
2. Select an origin vertex V_{out} randomly with probability p_{Vout}=a_{out}*r_{out}+b_{out}.
3. Select a terminus vertex V_{in} randomly with probability p_{Vin}=a_{in}*r_{in}+b_{in}.
4. If edge (V_{out} ,V_{in}) is not in G add it.
5. Repeat from step 2 until z*N edges will be added.
This model has only 2 parameters (a and b are dependant due to the normalizations requirement). In case a_{in}, a_{out } = 0 the model is equivalent to well investigated classic random graph [6]. In case b_{in}, b_{out} = 0 the model yields a bipartite clique.
We find that PLD random graph with z~8 is similar to the Web graph.
As it was mentioned above, there are different models which may be interpreted as a Web model. Many other models, which include processes of vertices and edges being deleted and added simultaneously in accordance to some rules, were proposed in several studies (see, for example, [1]). But only the simplest one is really interesting in context of our discussion.
The copying and ESF models are based on the similar stochastic processes and therefore they are actually nearly equivalent. The main difference between PLD random graph and the other two models is that the first is not a growing one. It means that both copying and EFS models yield growing graphs in which rin for the early added vertices is bigger then for the latest. It is interesting to point out that a simple model of growing graph with edges added from new vertex to one chosen uniformly at random has a PLD of rin with g_{in} = 2. Additional mechanisms of a growing graph model are responsible for g_{in} value adjustment and for r_{out} distribution.
In [7] it was shown that in the Web graph there is no correlation between age of site and its in- or out- degree. On the other hand it is clear that the Web in principal is a growing network. These two facts are compatible only if the characteristic time for new site adding is significant bigger then the characteristic time for adding and changing links. So for graph topology modeling purposes we can assume that the Web graph is not a growing one.
We think that investigations of the proposed PLD random graph have a good perspective because it is a meaningful variation of the classic random graph, it yields a Web-like graph and it can be a good tool for simulation of various nature networks.