1. Introduction

There are at least two important reasons for the systematic study of Blogspace:

*Sociological reasons:*Blogspace differs from traditional web pages structurally because blogs represent concatenations of messages, as within newsgroups and bulletin boards, but authored by a single individual. However the more significant differences are more than structural: the culture of Blogspace focuses heavily on local community interactions between a small number (say, between 3 and 20) of bloggers. Members of the informal community might list one another's blogs in a ``blogroll'' and might read, link to, and respond to content on other community members' blogs. Often, these sequences of responses take place during a brief burst of heavy activity as an interesting topic arises, jumps to prominence, and then recedes. Naturally, this leads to the question: can we experimentally observe and model this highly dynamic, temporal community structure?*Technical reasons:*Traditional studies of the web and the web graph make use of a static snapshot derived from a crawl. All such work raises the natural question: what happens over time? A number of works have begun to address this question through creation and analysis of a series of snapshots of the data [ 4 , 15 , 8 , 3 ]. The development of tools and methods to analyze these snapshots is therefore a timely endeavor. However, Blogspace offers an additional technical advantage over such approaches--if data is recrawled with a certain frequency, there is no notion of the precise point in time a page or link was created/updated. In contrast, Blogspace offers us a ready-made view of evolution in*continuous time:*as each blog adds an entry (together with links), there is a time stamp associated with that event. By automatically extracting these time stamps we can piece together a view of Blogspace evolving continuously from the beginning of blog archiving to the present. We should stress that time is absolute (not merely relative as in a sequence of crawls). Our work focuses on connectivity evolution and on temporally concentrated*bursts*(in the sense of Kleinberg [ 11 ]) in this evolution of Blogspace.

- We introduce a combinatorial object we call a
*time graph*(Section 3.1 ) for the study of graphs that evolve in continuous time. We build the*blog graph*--the time graph for Blogspace--by automatically extracting dates from blog page entries ( Section 4 ). - We define a notion of communities in Blogspace and extend Kleinberg's
notion of temporal bursts in a sequence of documents [ 11 ] to sets of blogs and the
links between them, developing a notion of bursty communities of blogs
that are topically
*and*temporally focused ( Section 3 ). - We conduct a series of experiments that develop properties and views of the graph induced by Blogspace as a function of time, showing the development of macroscopic and microscopic community structure, and the evolution of burstiness (Section 5 ).
- We show that Blogspace underwent a transition behavior around the end of 2001, and has been rapidly expanding over the past year, not just in metrics of scale, but also in metrics of community structure and connectedness. This expansion shows no sign of abating, although measures of connectedness must plateau within two years ( Section 5 ).

2. Background

2.1. Preliminaries

2.1.1. Graphs

2.1.3. Bursts

2.2. Blogs

While weblogs had always included a mix of links, commentary, and personal notes, in the post-Blogger explosion increasing numbers of weblogs eschewed this focus on the web-at-large in favor of a sort of short-form journal. These blogs, often updated several times a day, were instead a record of the blogger's thoughts: something noticed on the way to work, notes about the weekend, a quick reflection on some subject or another. Links took the reader to the site of another blogger with whom the first was having a public conversation or had met the previous evening, or to the site of a band he had seen the night before. Full-blown conversations were carried on between three or five blogs, each referencing the other in their agreement or rebuttal of the other's positions.It is precisely this type of intense interaction that is at the core of Blogspace and that we wish to analyze algorithmically. For example, there is a `` blogathon '' once a year in which people blog for 24 hours straight for charity. Sponsors donate money and then during the blogathon, bloggers update their blogs every 30 minutes for an entire day.

2.2.1. Bursty communities of blogs

In a community of blog poets, a burst occurs when one member Firda posts a series of daily poems about other bloggers in the community and includes links to their blogs. This burst occurs from March-April of 2000--for example http://www.wannabegirl.org/2000_03_01_log-archive.php from March 2000 contains poems and links to http://trenchant.org/webloglog , http://www.premiumpolar.com/blog , and http://www.swallowingtacks.com , all members of the community.

In another community (this community may not be suitable for all ages), a blogger Dawn hosts a poll to determine the funniest and sexiest blogger. She conducts interviews with other bloggers in the community, of course listing their sites (see http://up_yours.blogspot.com/2002_05_19_up_yours_archive.html ). She then becomes obsessed with one of the other bloggers Jim, which spurs comments by many others in the community (see http://jimspot.blogspot.com/2002_07_28_jimspot_archive.html ).

3. Approach

- Community extraction ( Section 3.2 ): We extract dense subgraphs from the blog graph; these correspond to all potential communities (whether or not bursty).
- Burst analysis ( Section 3.3 ): Building on the work of Kleinberg [ 11 ] on bursts in event streams, we perform a burst analysis of each subgraph obtained in step 1 to identify and rank bursts in these communities.

3.1. Time graphs

- A set
*V*of nodes where each node has an associated interval*D*(*v*) on the time axis (called the*duration*of*v*). - A set
*E*of edges where each is a triple (*u*,*v*,*t*) where*u*and*v*are nodes in*V*and*t*is a point in time in the interval .

3.2. Algorithms for community extraction

- Compared to the web, blogs are not characterized by the strong distinction between ``authority-type'' and ``hub-type'' pages [ 10 ]. Every node in the blog graph corresponds to a `human being'; this is in contrast with the web where pages can be loosely classified as `people' (the hubs) and `topics' (the authorities).
- In contrast with the web-scale experiments of [ 13 , 12 ], the scale of our work here permits us to operate entirely in memory without streaming the data from disk. As a result, it is feasible for us to seek dense (rather than complete) subgraphs as community signatures.

Unfortunately, finding the densest subgraph in an undirected graph is NP-hard and appears notoriously difficult to even solve approximately [ 7 ]. We therefore resort to heuristics that are simple, efficient, and effective. The blog graphs we deal with are small enough that we can perform all the operations in memory, in contrast to the earlier work of Kumar et al. [ 13 ].

Our algorithm consists of two steps--pruning and expansion. Pruning corresponds to identifying the seed of a community and expansion corresponds to growing the seed to a dense subgraph that forms the signature of a community. We adopt the convention that that a node can participate in at most one community.

3.3. Burst analysis

4. Methodology

In burst analysis, we identified the links in each community as relevant events (as in Section 2.1.3 ) and divided them into chronological batches according to the week each link was added. For each community identified in the previous step, we calculated the number of links created between blog members in the community during each week and the total number of links between pages in the community, to use as input to a two-state automaton. Each state of our automaton corresponds to a different fraction of relevant link generation: a lower rate during calm periods and a higher rate during bursty periods. By adjusting the scaling parameter which determines how much the high rate differs from the low rate, we were able to control the length of typical bursts. We experimented with various values for this parameter, and chose a value which resulted in the majority of bursts ranging from one week to several months.

5. Results

5.1. Analysis of prefix graphs

Size | 3 | 4 | 5 | 6 | 7 | 8 | 9 |

# | 143 | 165 | 79 | 14 | 2 | 1 | 5 |

Figure 4 shows the results of applying the same algorithm to the prefix graphs. The upper figure plots for each time interval the total number of communities in the prefix graph, and the total number of nodes that participate in one of those communities. The lower figure plots for each time interval the fraction of nodes that belong to some community. If this fraction is large, one can view the space of all blogs at that time as consisting as a set of small inter-networking communities, rather than a set of standalone pages. These graph show the same striking pattern as earlier graphs in this section: a period of minimal structure during 1999 and 2000, slow growth in 2001, and then rapid growth in 2002.

To conclude, the degree distributions match earlier observations from many communities, and do not represent a surprise. The analysis of the largest SCC (a macroscopic phenomenon) and of communities (a microscopic phenomenon) does represent a surprise: by both measures, a fundamental change occurred in blogspace approximately one year ago, and we are still experiencing the results of that transition. To assess whether this observed behavior does in fact stem from the sociology of blogspace, we must first study the alternate possibility: namely, that the emergence of a giant component and a strong community structure would occur naturally when the graph reached a certain size and density. We now address this question.

5.2. How random is the blog graph?

Figure 5 and Figure 6 plot the same quantities as Figure 3 and Figure 4 for randomized Blogspace instead of the original blog graph. Since we included the time 0 edges, there is a substantial SCC to begin with. As time progresses, this SCC seems to have a threshold growth as in blog graph case (this is how a random graph would behave as well). For completeness, we also evaluated the growth of the giant SCC without the time 0 edges; initially it was of course much smaller, but it exhibited a similar threshold behavior and became a larger fraction of the overall graph during the last timestep than in Blogspace. However, comparing to Figure 4 , we see that the number of nodes in communities for randomized Blogspace is an order of magnitude smaller than for Blogspace, indicating that the community formation in Blogspace is not simply an emergent property of the growth of the graph. On the other hand, comparing Figure 3 to Figure 5 shows that the SCC in randomized Blogspace grows much faster than in the original blog graph.

So randomized Blogspace actually attains a large strongly connected component faster than Blogspace does; however, it does not attain significant community structure. If bloggers added links to other blogs without reference to topicality, the graph would still become well-connected, but it would not exhibit the striking community focus that characterizes Blogspace.

5.3. Burstiness in blog communities

Interestingly, the increase in number of bursts is not explained by the increase in number of communities alone. Not only have the number of communities in Blogspace been growing over the last year, the average burstiness of each individual community has also been growing. This suggests that the transition behavior we have observed in all our temporal analyses is in fact correlated with a change in the behavior of the bloggers themselves. For whatever reason, perhaps because of the richer set of available communities, or the change in the population of Blogspace, content creators have increased their participation in bursty community activity over the last year, and the trend shows no sign of slowing.

5.4. Anecdotal evidence for the effectiveness of community and burst extraction

**1**- R. Agrawal and R. Srikant.

Fast algorithms for mining association rules in large databases.

In*Proc. 20th Intl. Conf. on Very Large Data Bases*, pages 487-499, 1994. **2**- Z. Bar-Yossef and S. Rajagopalan.

Template detection via data mining and its applications.

In*Proc. 11th Intl. World-Wide Web Conference*, pages 580-591, 2002. **3**- K. Bharat, B. Chang, M. Henzinger, and M. Ruhl.

Who links to whom: Mining linkage between web sites.

In*IEEE International Conference on Data Mining*, 2001. **4**- B. E. Brewington and G. Cybenko.

Keeping up with the changing web.

*Computer*, 33(5):52-58, 2000. **5**- D. Eppstein, Z. Galil, and G. Italiano.

Dynamic graph algorithms.

In*CRC Handbook of Algorithms and Theory of Computation, Chapter 22*. CRC Press, 1997. **6**- P. Erdös and A. Rényi.

On the evolution of random graphs.

*Magy. Tud. Akad. Mat. Kut. Intez. Kozl.*, 5:17-61, 1960. **7**- U. Feige, D. Peleg, and G. Kortsarz.

The dense*k*-subgraph problem.

*Algorithmica*, 29(3):410-421, 2001. **8**- D. Fetterly, M. Manasse, M. Najork, and J. Wiener.

Crawling towards light: A large scale study of the evolution of web pages.

In*Proc. 1st Workshop on Algorithms for the Web*, 2002. **9**- G. W. Flake, S. Lawrence, and C. L. Giles.

Efficient identification of web communities.

In*Proc. 6th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining*, pages 150-160, 2000. **10**- J. Kleinberg.

Authoritative sources in a hyperlinked environment.

*J. ACM*, 46(5):604-632, 2000. **11**- J. Kleinberg.

Bursty and hierarchical structure in streams.

In*Proc. 8th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining*, 2002. **12**- R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins.

Extracting large scale knowledge bases from the web.

In*Proc. 27th Intl. Conf. on Very Large Data Bases*, pages 639-650, 1999. **13**- R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins.

Trawling the web for cyber communities.

*WWW8/Computer Networks*, 31(11-16):1481-1493, 1999. **14**- T. Mitchell.

*Machine Learning*.

McGraw Hill, 1997. **15**- The Internet Archive .

Copyright is held by the author/owner(s).

WWW2003, May 20-24, 2003,
Budapest, Hungary.

ACM 1-58113-680-3/03/0005