Master's Thesis

Abstract. The Internet is one of today's primary information sources yet information available through this medium is scattered and is not necessarily present in its original storage format. In fact, information is available as a set of interconnected documents each with a possibly different terminology. Nevertheless, it is a typical task to return a set of documents that match a given query, exemplified by the wide-spread use of web search engines. Such retrieval of documents, however, is only possible through a periodic traversal of the Web.

For the traversal to be effective, that is, frequently updated documents to be visited regularly, parallelization is indispensable. However, operating a distributed system that supports parallelization across multiple machines is a substantially more complex job than operating a centralized system. Other crucial aspects include easy tracing of job completion and flexible architecture. It is essential that the system support monitoring how jobs are processed and it adopt to various crawling tasks (e.g. traversal of specific domains, language-dependent behavior, reference structure analysis).

In order to battle the aforementioned demands, the author presents an architecture comprising of units loosely-coupled by means of well-defined interfaces and communicating with one another by exchanging messages. Loose coupling allows units to run on different machines and caters for extension with new functionality. Communication between units is transparent independently of whether it is in fact a local or a remote message exchange, and the way units are interconnected is declaratively defined in XML descriptors. In general, the services offered by an underlying framework greatly reduce the complexity that would otherwise be related to coordinating a distributed system. Unit development may subsequently focus on other aspects of web robot construction.