FRDCSA | minor codebases | Spider

[Project image]

Jump to: Project Description | Parent Description | Capabilities

Project Description

Able to crawl websites, storing multiple versions over time like the wayback-machine. It is especially desired that this system be able to deliberate as it spiders, by using the nlu capabilities of FRDCSA. Ideally it should work similar to a web/unix softbot.


  • In order to do that, well, we'll have to have a way to verify the spider results.
  • Setup job-search to continuously spider the latest stuff from Craigslist, and use that to hook people up with jobs...
  • Things to do once connected to the internet: send url of plans to Justin, find out when Mary is going to take me shopping, email mike asking for a time that we can move the server in and maybe swing by the store, look into writing a spider for craigs list shopping items for Broker.
  • That's one thing is I haven't been studying formal models enough lately. Like this PI calculus stuff looks interesting. We want to map it out and use clear to ensure familiarity. Another thing we could do is spider citeseer again. Perhaps slowly over time.
  • Add the ability to spider with www::mechanize to clear when wget fails.
  • Add features to CLEAR to enable it to "spider" websites that use "next".
  • I just thought of a way that Clear could read through hyperlinked books. It could spider them and determine the order from the order of the links.
  • spider other people's webpages at CMU to come up with a language model and use this when writing my website.
  • spider systems through documentation about systems.
  • A powerful technique I believe is the location of individuals based on their writings - we can spider writings - locate good people - and then meet with them, etc.
  • Actively spider for sites of interest.
  • spider amazon books, pipe it to OCR.
  • From peer to peer networks, make a correspondence between real metadata and file names, to learn what the actual data is, and then search for new data, also spider on the basis of known and recommended titles.
  • Hilariously enough, my classification system gave /spider.pl http://www.frdcsa.org/frdcsa/internal/index.html an 88% change of being a project, plus 11% of being other, whereas it gave the external projects a solid 100% chance. This demonstrates that automatic techniques are useful in determining the content of pages. Therefore, we should use them to improve the quality of our writing, you know, things like LSA, etc.
  • Some of the things we ought to do: for our FRDCSA release on the web, we should make sure to spider to download a lot of software and populate the archive this way. We can take software from our different sources. The first version should simply have the tools of the FRDCSA and not mention packaging all of this software. Then when we have more money, and we are further along, we can release all of the packages and sell it again, perhaps giving a discount to those who purchased the first disk.

This page is part of the FWeb package.
Last updated Sat Oct 26 16:47:07 EDT 2019 .