While the knowledge level of AI problem solving has been captured by numerous libraries of problem-solving methods, similar effort for web-related knowledge-intensive tasks is still missing. Our focus is on tasks arising in analysis of website content and structure, such as pornography recognition or extraction of key facts about companies. We identify the core generic tasks and atomic inferences, and illustrate them on instances from the application problems. We suggest how problem-solving models in combination with ontologies could help in checking the consistency of distributed applications, in the context of our web-service-based system named Rainbow. Brief comparison with the IBrow project is included.
Problem-Solving Methods, Information Extraction, Ontology.
While the retrieval of documents in the scope of whole WWW is dominated by computation-centred methods relying on optimised keyword indexes, web analysis at the level of website seems to offer itself to inference-centred, knowledge-intensive methods, which would respect the peculiarities of different domains and data structures. However, while knowledge-based methods are declared as the heart of future Semantic Web (relying on explicit knowledge annotations), their use for analysis of the current web so far received limited attention. Yet, gradual semantic 'upgrade' of the current web is probably a more appropriate way of obtaining the Semantic Web than building it from scratch. Knowledge-based systems capable of transforming various forms of (raw) web data into collections of formalised statements thus should be addressed by research, in a principled rather than ad-hoc manner.
The study of principles of knowledge-based reasoning is centred around the Problem-Solving Methods (PSMs): knowledge models describing complex reasoning processes in terms of inferences and knowledge roles (the latter representing connecting points to domain-specific concepts). Comprehensive libraries of PSMs already arose in the context of the CommonKADS  initiative, while the more recent IBrow (http://www.swi.psy.uva.nl/projects/ibrow) project aims at semi-automatic construction of knowledge-based application from components described by means of PSMs. Most PSM libraries are, however, oriented on traditional (AI) problem-solving tasks such as diagnosis or planning. Their extension to substantially new tasks, such as those arising in web information access, is an unresolved problem.
In this paper, we examine the nature of problem solving in website analysis at the knowledge level, formulate the core set of tasks and inferences, and illustrate their use on specific problems.
Based on previous experience, we estimate that the majority of website analysis tasks can be characterised as one of the following: classification, retrieval or extraction. The 'fuel' of all of these tasks are web resources of varying granularity, identified by their web location (URL, XPath and the like). The syntactic type of resources is e.g. 'physical page', 'hyperlink', 'text paragraph', 'image' or 'phrase'. Complex tasks can be decomposed to simpler ones, and at the bottom of the decomposition hierarchy, we could identify analogous atomic inferences: classify, retrieve and extract. Let us characterise the tasks and atomic inferences, respectively:
The inputs and outputs of tasks/inferences are as follows. Resource corresponds to concrete resources identified by their web location. Type and class should be mapped on concepts from a domain ontology, such as 'physical page' or 'company homepage', respectively. Content corresponds to data extracted from resources; in our project, we restrict it to alphanumeric data. Finally, constraints correspond to logical conditions over ontology relations.
The Rainbow (http://rainbow.vse.cz) project aims at knowledge-based analysis of web content and structure, in particular at the level of 'website'. Its architecture consists of a closed collection of knowledge-based modules specialised in different types of data and communicating with each other via message exchange using the web-service technology. Three application projects are undertaken in this context:
A high-level scenario for the first task could be described by the following decomposition (in a tentative, vague pseudo-code; control structures would have to be specified separately):
classify_site :- classify_site_by_URL, @classify_site_by_structure. classify_site_by_structure :- retrieve_hub_page_by_topology, classify_page_by_HTML1, retrieve_content_pages, MULTIPLE (classify_page_by_HTML2, @classify_page_by_image). classify_page_by_image :- retrieve_image, classify_image_by_histogram.
The top-level decomposition assumes that a pornographic site can be recognised either using just its URL or through analysis of the site itself. At the next level, subtasks of the latter are listed: topology-based retrieval of an internal 'hub' page, HTML-based verification whether the page is of class 'gallery' (with image fingerprints), collection of 'content' pages, and, finally, verification whether these pages do actually contain pornography. An indication for pornography is already the presence of a single, nearly uncommented image in HTML code (for simplicity, we ignore the common case of gallery page being directly linked to bitmap images). Image analysis, in turn, computes the proportion of body colour in the colour histogram.
A similar scenario for the second task may be (cf. ):
extract_facts_from_site :- retrieve_link, @extract_facts_from_page. extract_facts_from_page :- retrieve_lexical_indicators, extract_values_by_NLP, extract_values_by_HTML.
The page potentially containing the desired type of facts (be it e.g. contact address, prices or references) has first to be found by link (URL and anchor) analysis. Subsequent page-level extraction is based on presence of generic lexical indicators, and exploits separately the linguistic structure of sentences and the structure of HTML mark-up (for semi-formatted information).
The third task, identification of boundary of a logical website, has the character of retrieval, and will employ analysis of topology (assumption of internal connectivity) as well as of HTML (assumption of shared mark-up 'envelope').
We developed a system of ontologies in DAML+OIL (http://www.daml.org/language) describing the WWW from the perspective of different ways of analysis (HTML, free text, URLs, link topology...) as well as in an integrated way, and generically as well as with respect to particular problems . Meta-properties of generic tasks are described by means of very simple task ontologies. The role of ontologies in connection with problem-solving models would be to check the consistency of services committed to these models.
A trivial check could match the input and output of the same task. For example, the abovementioned task of retrieving the 'hub' page inherits from the generic retrieval task the following feature: resource/s on output must be instance/s of the 'toClass' concept of property 'identified-by' for the concept corresponding to the type of resource in the input request. As the type 'page' is required, the output should be of data type 'URL', as it is the only allowed identifier of pages. More complex checks could involve multiple services and prevent e.g. deadlock or execution of tasks deemed to fail.
In the abovementioned IBrow project, an arbitrary knowledge-based component can be described by a collection of models in UPML language, each focusing on a different aspect of knowledge: domain, task, method, transformation between these, etc. 'Brokering' tools retrieve the components and consistently configure applications. In contrast, the scope of Rainbow is limited to a closed class of problem-solving tasks: those related to classification, retrieval and content extraction in websites. The collection of modules is closed, too; what changes is their problem-oriented knowledge bases and the concrete model of co-operation. An IBrow component is typically assumed to completely solve (by itself) a problem such as finding the explanation of a complaint in a causal network or (in the document-analysis application ) determining the language of a document. In Rainbow, on the other hand, the modules mostly have to co-operate in order to accomplish a meaningful task. It is the different configurations of Rainbow rather than individual modules that could be viewed as 'components' in the sense of IBrow, and described by problem-solving methods.
Our work represents a modest contribution to the research in problem-solving modelling, focused on the domain of web (content and structure) analysis. The rudimentary model we propose distinguishes three abstract tasks/inferences; we demonstrate their instantiation for specific problems. Although the notion of (generic) problem-solving method has not yet appeared in our work, we expect that these could arise in the future by abstraction from a larger collection of specific task structures (not necessarily restricted to problems solved by Rainbow). Such a bottom-up process would mimic the development of 'AI' problem solving two decades ago.
In the opposite direction, we have to find the way how to convert the abstract models to operational control structures governing the behaviour of the Rainbow architecture for the sake of a particular application. We are considering to adopt the concept of skeletal planning , which seems to respond to challenges imposed by the heterogeneity of the current web.