Workhorse

workhorse is a system to set up dedicated servers for the creation of tagged, analyzed and understood texts, and other linguistic research. For datasets, we have Wikipedia, Gutenberg, and hopefully fulltext books from Google Books, all appropriately licensed. We aim to develop a highly annotated freely-available corpus of marked-up texts that have been processed with a wide variety of state of the art systems. We also aim to apply natural language understanding, knowledge base population, and other techniques, onto the texts to derive useful knowledge.


  • our system should pass all downloaded and visited webpages, probably using a squid proxy, through an analysis system. It can check for things like repo links, formalize the text knowledge using workhorse, etc. etc.
  • workhorse should understand the difference between a sentence, a paragraph, etc. This info should be given from the iaec/universal-parser stuff.
  • We need to develop tools to more easily manage the execution of the nlp systems for workhorse.
  • Edit workhorse to save the results of the analysis, not just the KNext output.
  • Develop a tool which we use to record our decision making for different problems. For instance, in trying to determine where to put the new aloysius system, I ruled out a merger with the services.frdcsa.org because that system is currently running on justin, which is insecure. Although, actually, I could move it to workhorse if I could find a way to route it.
  • use puck with workhorse
  • Troubleshoot the way any throughput to the workhorse computer messes up the rest of the internet.
  • For workhorse, use GATE as the corpus manager, etc.
  • Buy hard drives and add to ai.frdcsa.org, workhorse.frdcsa.org and node
  • Set up the FWeb2 to display datasets for workhorse, such as KNext processed texts, etc.
  • For the workhorse system: http://www2003.org/cdrom/papers/refereed/p831/p831-dill.html

