FRDCSA | internal codebases | Crawler

[Project image]

Architecture Diagram: GIF

Jump to: Project Description | Parent Description | Capabilities

Project Description

The seeker algorithm is relatively straightforward. Both keywords and URLs are used to seed the search. Keywords are used to search online search engines to retrieve web pages, through a module which learns effective queries. URLs are spidered. Speculative fetching is performed based on expectation that site is a project URL or a metasite, as classified by WebKB tools. In this way, a database of project URLs is found. Next, we use information extraction to populate KBs about software systems, then use these to intiate searches. Eventually we would like this to extend this to a set of tactics for retrieving all information related to packaging and systems integration.


  • Use data on what files we have found that are useful to the FRDCSA, and use their descriptions and relevant information to guide a focused crawler into finding similar stuff.
  • See about using my browser's link history to build a model of interest and use to train a focused crawler.
  • Write a better focused crawler
  • Come up with new name for crawler.
  • Sorcerer, -a crawler
  • Combine focused crawler: deb http://combine.it.lth.se/ debian/
  • Ask William Cohen's graduate students about how to write a focused crawler for AI software.
  • should look into a domain specific web crawler

This page is part of the FWeb package.
Last updated Sat Oct 26 16:51:03 EDT 2019 .