In this paper, we present a fully distributed, peer-to-peer collaborative search engine for web pages: Porqpine. This system uses a novel query-based model and collaborative filtering techniques in order to obtain user-customized results. Porqpine is a multi-agent system that runs on the computers of the user community. Porqpine is intended to complement conventional central search engines providing more accurate and personalized results.
Search, Search Engine, Distributed System, peer-to-peer, p2p, Collaborative Filtering, Multi-agent systems, Agents, Knowledge Sharing, Knowledge Management.
Most web search engines in use (Google, Altavista, etc.) fail to take advantage of the intentions, interests and preferences of their users. Some ways for taking advantage of user information have been attempted in collaborative and knowledge sharing systems [3,4,6], but not much in the area of web searching engines. In part this is due to the highly centralized nature, at least at a conceptual level, of the indexing tasks of search engines.
We present a system that extends the current centralized search engines (CSE), by introducing collaborative filtering mechanisms and a novel scheme to create web page's models based on the queries rather than in the content of the pages.
The creation of a permanent user profile allows to filter search results depending on the user interests, and also to introduce a certain degree of personalization in the search process. For example, an advanced Java programmer does not expect to obtain, for the same query terms, the same type of documents as a novice Java programmer. Moreover, user profiles allow automatic query expansion by inserting information about the user's interests so that more precise and specific results should be expected. If one considers users not only as isolated individuals but also as a community, then this social dimension could be exploited in order to access to the expertise of people with similar interests. If two people have similar knowledge and interest profiles, for example two advanced Java programmers, they would probably find interesting the same pages, but these pages could be seen as not interesting to other not so similar people: novice programmers could find them too technical. The social dimension of the community allows clustering users according to their interests and expertise, focusing on interesting information by reducing the domain of interest.
Another advantage is that it uses a model of web pages that is not directly based on page contents. Centralized search engines work by using an analysis of contents and by calculating important words in pages. Our system also uses a model based on the most relevant words but these words are not extracted by the system but introduced by human users with a high proficiency in their expertise domains, and this introduction is done implicitly through the queries that users make either to our system or to external CSE.
CSE might already use this methods and techniques to improve their performance. However, under a centralized paradigm, serious problems of privacy, anonymity and scalability would arise. It is because that our systems rely in a distributed approach, that is a peer-to-peer system.
All nodes in the network, which are implemented as multi-agent systems, are identical and their total composition results in a distributed search engine in a similar way as other peer-to-peer systems (Gnutella, Kazza or Freenet) used for file sharing. The difference is in the fact that what is shared here are not files but the knowledge that the different users have about web pages.
Usually CSE's resort to some variation of the vector space model to model the content of pages or the information associated to other resources or to link structure in order to index these pages. These models are known as "content-based models" and are quite comprehensive, since they associate a series of keywords to each page, and can be extracted from the page content or from the links that point to the page. The selection of which keywords are relevant is done by resorting to information retrieval techniques like, for example, TFIDF (Term Frequency-Inverse Document Frequency). All these processes are performed by crawlers continuously traversing the web, parsing pages and building their corresponding content models. Our system, by contrast, uses no crawler, although it could be possible that each node used several local agents devoted to these tasks. A page's model is built from the queries that recover that page in CSE. By doing so, a more accurate model is created, since only words relevant to people are used. People tend to use the same subset of words very frequently. Query redundancy, which is the basis of caches in the CSE [5,7], supports the idea of taking the words of a query as a source of knowledge. For example, when a user writes in a query "java rmi" he obtains a set of pages and the model of each retrieved page will be updated with the "java" and "rmi". Any other person issuing a query about "java rmi examples" would then obtain a subset of those pages.
However, our system depends, at least in the very beginning, on the existence of a large repository of content-based models so that the query-based model can be used. These repositories could be, for instance, the ones from the big CSE like Google and Altavista. This is the reason why our system can be considered as an extension, although the computing distribution could be used to crawl the web building content-based models that would then overcome this dependence.
Current CSE's always return the same results for the same query independently from users, since they have no personalization methods. This is due to difficulties on the scalability of the system and its privacy. Nevertheless, that sort of information storage adapts well to a decentralized peer-to-peer solution, since personal profiling information belonging to each user is kept in his computer.
Each user in the Porqpine system issues (either explicitly by a voting mechanism or implicitly
through his actions) a certain rating for each tuple
So each node stores the query-based model of each page and the corresponding set of personal assessments for that page in relation to a given query. Each user has a profile that is implicitly built by the system and identifies the user in a given role. With all this components it is possible to apply a collaborative filtering technique (i.e., given a user p and query q Porqpine will return pages that have been well rated by people similar to p as a result of a query similar to q). This pages will be interesting ones for user p since its rating will be calculated from opinions of other similar people which (1) have a similar knowledge to p 's for the same domain and (2) a similar analysis ability that is better than the one exhibited by automatic methods and (3) have similar interests. This is the core of collaborative filtering systems [3,4,6].
So far, we have only spoken about the information or knowledge contained in each node. Nevertheless, this information needs to be shared among all the nodes. To do so, the peer-to-peer approach has been chosen. Each single node is not only an information producer or consumer, but also is a router that forwards requests and results. Usually peer-to-peer systems forward requests, in such a way that, once the information or the file is found, a direct communication between the consumer (who queries) and the producer (who serves) is established. Our system does not work this way. The results are also sent back to the consumer through the peer-to-peer network. This is because both the request and the result do not have information about who was the real initiator of the request, or even who answered with results.
By doing so, anonymity, which is a major concern in our system, is preserved. Each node of the systems acts as a "blind proxy". Thus, for a simple malicious node is impossible to find out which is the requesting node or the one generating results.
Our solution of propagation could be assimilated to a stochastic breadth-first exploration of the neighbours of a node, in a similar fashion that [1,2] do. That is, a message is not sent to all neighbours of a node but only to a certain percentage of them, which are chosen by means of a "neighbour profile" that represents the knowledge contained in that part of the network (the one that is reachable through each neighbour but that does not reflect the knowledge contained in any individual node). This profile, which is build dynamically by the system, is used as a routing table for message propagation without endangering the anonymity. In this way the number of messages generated by the system is kept low. The risk, of course, is that some nodes may not be ever visited.
In this paper we have described the Porqpine system, which is a multi-agent system that builds up a collaborative and truly distributed search engine. The main features of the presented system are: 1) The introduction and usage of a novel model for web pages based on query terms instead on the content or link structure of a page. 2) The integration of the users' subjective ratings, registered implicitly by the system, in order to filter and rank the results in a collaborative fashion. 3) The avoidance of a central repository, which might be replaced by thousands of personal repositories spread among the users' computers. All these advantages are only possible because the information is distributed among all the nodes of the network, which communicate with each other in a peer-to-peer fashion.
The research described in this paper, is partly supported by the EC project Agentcities.RTD, reference IST-2000-28385. The opinions expressed in this paper are those of the authors and are not necessarily those of the Agentcities.RTD partners.