hristope@ceid.upatras.gr

garofala@cti.gr

makri@ceid.upatras.gr

panagis@ceid.upatras.gr

psaras@ceid.upatras.gr

sakkopul@ceid.upatras.gr

tsak@cti.gr

Rio Campus, 26500 Patras, Greece

61 Riga Feraiou Str. 26110 Patras, Greece

In this work two distinct metrics are proposed, which aim to quantify the importance of a web page based on the visits it receives by the users and its location within the website. Subsequently, certain guidelines are presented, which can be used to reorganize the website, taking into account the optimization of these metrics. Finally we evaluate the proposed algorithms using real-world website data and verify that they exhibit more elaborate behavior than a related simpler technique.

Web Metrics, Web Organization, Log File Processing.

User visits' analysis is the first step in any kind of web site evaluation procedure; in order to assist in this, many commercial systems provide statistics about the most visited files and pages. However in [4] is shown that, the number of hits per page, calculated from log file processing, is an unreliable indicator of page popularity. Thus, a refined metric is proposed, which takes into account structural information, and, when using it, certain pages are reorganized leading to the overall improvement to site access. Other researchers have attempted to identify user behavior patterns, Chen et al [2], and to analyze the paths that users follow within a site, Berkhin et al [1]. A very influential recent work is that of Srikant and Yang, who furthermore suggest structural changes to the website, after having identified visit patterns that deviate from the site's initial organization. On the other hand, very little progress has been achieved in providing a software tool or even framework that would assist in automatically applying changes to the web site. Some early steps towards this direction can be found in [3].

This paper introduces two new popularity metrics. The first one differentiates between users coming from within the website and users coming from other websites, while the second one uses a probability model to reassess popularity. Key feature of the new metrics is the higher fidelity of the popularity estimates. We evaluate and examine these metrics by comparing them with the metrics introduced in [4].

The Absolute Accesses (*AA _{i}*) to a specific page

The

It is crucial to observe that a web page is accessed in
four different ways. Firstly it gets accesses within site, secondly directly via
bookmarks, thirdly by incoming links from the outside world and finally by
typing directly, its URL. Under this observation, we can decompose *a _{i}*, into two factors,

We define the following quantities: * d _{i}* is the tree depth of the page

, (2)

where

Hence, * a _{i,in}*
depends on the depth of page

In order to define * a _{i,out}* we denote as

(3)

Equation 3 implies that *a _{i,out}* depends on the number of both bookmarks and
links from outside to page

(4)

It is tempting to model traffic inside a site, using a
random walk approach (e.g. like in [5]) but *
a _{i,in}*, models the ease (or difficulty) of access
that a certain site infrastructure imposes to the user. Thus, a page's relative
weight, should be increased inversely proportional to its access probability. We
consider the site's structure as a directed acyclic graph (DAG), where

Considering a path * W _{j} =
{v_{r},..., v_{t}}* and computing the routing probabilities at
each step, the probability of ending up to

There may be more than one paths leading to

(6)

Considering page *i* as target, the higher
* D _{i}* is the lower

(7)

Our metrics will be refered to as TOP and PROB, for short

In order to evaluate the results of the described algorithms we used the web server log from http://www.ceid.upatras.gr (Computer Engineering & Informatics Dept., University of Patras). We obtained a web log covering 44 days, including 1,320,819 records (hits) and 3596 unique visitors (Feb 2002 - March 2002). After having analyzed the site structure in order to recognize the pages of our interest, we identified a structure of 4 levels and 98 pages. We also implemented pre-processing, parsing, distilling and extracting procedures in order to filter out unwanted raw data from log files and focus only on entries corresponding to pure HTML pages. We implemented our proposed algorithms and metrics, the corresponding pre-processing procedures and the GKM algorithm and metric using the Mathworks Matlab v6.5 language.

The GKM metric provides a rough estimate of the
multiplicative factor *a _{i}*
. As the latter is affected by the number of pages at
the same level, the

This work aims to provide refined metrics, useful techniques and the fundamental basis for high fidelity website reorganization methods and applications. Future steps include the description of a framework that would evaluate the combination of reorganization metrics with different sets of redesign proposals. We also consider as an open issue the definition of an overall website grading method that would quantify the quality and visits of a given site before and after reorganization.

- Berkhin, P., Becher, J.D. & Randall, D.J.
*Interactive path analysis of web site traffic*. Proceedings of KDD 01 pp. 414-419, 2001. - Chen Ming-Syan, Jong Soo Park, & Philip S. Yu.
*Data mining for path traversal patterns in a web environment*. In Proc. of the 16th International Conference on Distributed Computing Systems, pp. 385-392, 1996. - Christopoulou, E., Garofalakis, J, Makris, C.,
Panagis, Y., Sakkopoulos, E. & Tsakalidis, A.
*Automating restructuring of web applications*, poster presentation in ACM HT '02. - Garofalakis, J.D., Kappos, P. & Mourloukos, D.:
*Web Site Optimization Using Page Popularity*. IEEE Internet Computing 3(4): 22-29 (1999) - Kleinberg, J.M.:
*Authoritative Sources in a Hyperlinked Environment*. JACM 46(5): 604-632 (1999) - Srikant, R., Yang, Y.:
*Mining web logs to improve website organization*. In Proceedings of WWW10, pp 430-437, 2001