Master's Thesis

Abstract. The Internet is one of today's primary information sources yet information available through this medium is scattered and is not necessarily present in its original storage format. In fact, information is available as a set of interconnected documents each with a possibly different terminology. Nevertheless, it is a typical task to return a set of documents that match a given query, exemplified by the wide-spread use of web search engines. Such retrieval of documents, however, is only possible through a periodic traversal of the Web.

For the traversal to be effective, that is, frequently updated documents to be visited regularly, parallelization is indispensable. However, operating a distributed system that supports parallelization across multiple machines is a substantially more complex job than operating a centralized system. Other crucial aspects include easy tracing of job completion and flexible architecture. It is essential that the system support monitoring how jobs are processed and it adopt to various crawling tasks (e.g. traversal of specific domains, language-dependent behavior, reference structure analysis).

In order to battle the aforementioned demands, the author presents an architecture comprising of units loosely-coupled by means of well-defined interfaces and communicating with one another by exchanging messages. Loose coupling allows units to run on different machines and caters for extension with new functionality. Communication between units is transparent independently of whether it is in fact a local or a remote message exchange, and the way units are interconnected is declaratively defined in XML descriptors. In general, the services offered by an underlying framework greatly reduce the complexity that would otherwise be related to coordinating a distributed system. Unit development may subsequently focus on other aspects of web robot construction.

The way the architecture operates is demonstrated by a reference implementation in which multiple agents connect to a coordinator component. Both the coordinator and the individual agents may themselves consist of units distributed across multiple computers. The coordinator partitions the web and maps a partition to a crawling agent. Agents traverse the hosts and domains they have been assigned sending external references back to the coordinator. As agents are typically geographically close to domains they are to explore and the ratio of external to internal links is adequately low, substantial savings on network traffic may be attained. A graphical user interface helps monitor the system, allowing querying status and adjusting behavioral parameters.

The application is implemented in the .NET platform and fully utilizes thread management and remote communication facilities the platform provides. The source code of the framework has been implemented in C#, while data operations are performed by a relational database for the sake of flexibility. Though the particular implementation uses Microsoft's SQL Server, this can be replaced with a structured file for further efficiency gain as the implementation moderately exploits the power SQL offers.

Co-developed with Péter Pallos.

Licensed under GNU General Public License. The complete project is is available for download at SourceForge.

Download Thesis PDF Application (requires GPL-licensed 7-Zip archiver to unpack) 7z