.
FRDCSA | internal codebases | Corpus
Homepage

[Project image]
Corpus

Architecture Diagram: GIF
Code: GitHub

Jump to: Project Description | Parent Description | Capabilities

Project Description

Corpus will serve as the automatic classification system for UniLang, which is necessary to achieving the desired capability of automatic message routing. The concept of Adjustable Autonomy is relevant here.

Corpus now has a reasonable UI and is now successfully classifying messages with a reasonable accuracy. We are using the rainbow - bayesian text classifier. This has suprisingly and astonishingly good results considering how little information would appear to be present in the sentences. However, it is not sufficient. While it usually chooses the correct category, the error rate is still too high, and to disambiguate some of the weaker classes will require extra information. Therefore, I am looking to incorporate other sources of classification evidence, based on features recognized by other external codebases.

Other features that will be added are as follows. Have the ability to vet the automatic classifications. A type system will be created. Recipient agents can reject messages which will help with classification. Incorporate mass verification and classification adjustment and subsequent message reclassification.

The next paragraph shows a very preliminary classification example, and the current scheme (ranked in terms of probability associated with example message). Note that the classification is exactly correct. The scheme system will be greatly revamped allowing a subsumption hierarchy and will also focus more on what the actual routing commands are. So for instance, rather than "goal", we would have "(Agent: Verber) (new-goal $1)" or rather than just "icodebase-capability-request", have "(Agent: MyFRDCSA) (capability-request Verber $1)". I.E. the responsible agent and the corresponding command to be sent.

(((Forgot to pick up pay check - need to go pick that ASAP.)))

                             observation	0.441955
                  verber-task-definition	0.244548
                       complex-statement	0.118441

  0) Finished
* 1) observation
* 2) verber-task-definition
  3) complex-statement
  4) icodebase-solution-to-extant-problem
  5) icodebase-capability-request
  6) event
  7) icodebase-input-data
  8) dream
  9) solution-to-extant-problem
  10) system-request
  11) policy
  12) priority-shift
  13) quote
  14) unclassifiable
  15) intersystem-relation
  16) SOP
  17) funny-annecdote
  18) unilang-client-outgoing-message
  19) goal
  20) icodebase-task
  21) suspicion
  22) not-a-unilang-client-entry
  23) dangling-clause
  24) capability-request
  25) rant
  26) icodebase-resource
  27) propaganda
  28) inspiring-annecdote
  29) shopping-list-item
>
    

Capabilities

  • Add to corpus the ability to print date markers and such when listing recent unilang messages.
  • For workhorse, use GATE as the corpus manager, etc.
  • Create a corpus mode and have a key for listing recently entered unilang entries.
  • Review the corpus log.
    ("pse-has-property" "106568" "habitual")
  • Scrape a resume corpus from the web, and use it to construct a skills list, and then figure out which skills I possess and add them to my resume.
  • Scrape a resume corpus from craigslist
  • With the cleaned wiki: make the wiki dictionary: make a corpus for various things: make a skillset detector for people based on which words they use: etc.
  • Use a corpus based method to recognize common terms and eliminate them from consideration in Capability::TextAnalysis
  • Check back for a researcher account at the internet archive for processing a large entity corpus - http://www.archive.org/web/researcher/researcher.php
    ("due-date-for-entry" "105528" "3 months")
  • Fix the problem with corpus not responding.
  • Fixed a problem with corpus not responding.
  • Figured out why corpus is not sending responses - it is because C-cpq[0-9] is causing corpus to send a response to it and it's not reading it...
    ("comment" "103930" "maybe")
  • Figure out why there have been delays lately with corpus
  • corpus obviously needs to take advantage of Weka's text mining capabilities!
  • Should use WSD on corpus entries and also run ExtractAbbrev.java on them to get abbreviations like Emacs unilang Agent, run this through termios, and use that to boot strap translation. Also do anaphora resolution by running it on chunks of aligned texts.
  • Now all we have to do is fix that data and worry about complexity/efficiency issues with corpus doing autoclassification. We also have to add more elegant classification than bayes, as well as talking amongst agents to support better categorization and command execution from our notes.
  • Get a handle of this corpus problem pretty soon. Come up with some quantification for it.
  • corpus can analyze temporal semantics of things like "tomorrow" and use that in relation to the assertion date.
  • Should have the corpus classifier allow multiple schemes for things like classified, non-classified (since if you are only thinking about 1 scheme you might forget these.)
  • It is recording the messages send back from corpus to unilang-Client, not necessarily relevant.
  • /var/lib/myfrdcsa/codebases/internal/unilang/start audience broker clear corpus cso ELog OpenCyc pse unilang-Client
  • Delete all the incorrect classifications in the sql log when corpus classifier is working.
  • For corpus, there is actually work from AI in belief integration in the Patrick ?Wilson? book
  • corpus can use a similarity weighting based on similarity and use that as evidence of related classifications.
  • Here is a good idea - use existing bug databases found on the internet as a labelled training corpus for the corpus classifier.
  • corpus should know the greater context of statements and thus h ave access to writings and estimates of when they were made.
  • Let us consider the corpus problem to multistrategy and complex, and therefore in need of being worked on.
  • corpus can have modules for some of the more complex systems.
  • When corpus classifies texts, it can use the same classifiers, duh.
  • Obviously the classifications from corpus depend on the current state.
  • Slipper for corpus?
  • Could use record linkage detection techniques for corpus as well as for Sorcerer
  • TimBl for corpus?
  • Fex for corpus.../
  • Calculate various statistics on the resume corpus, for instance, how many skills are listed on average, etc.
  • Humorous facts, I have written so many systems now that there are name conflicts between them, for instance "Cyc-Mode" and "corpus manager" are both abbreviated cm.
  • Great - corpus is working well, if a bit slow.
  • Obviously corpus needs to be ported to SQL.
  • (Here is the todo for unilang/corpus/pse)
  • Should have the corpus classifier allow multiple schemes for things like classified, non-classified (since if you are only thinking about 1 scheme you might forget these.)
  • It is recording the messages send back from corpus to unilang-Client, not necessarily relevant.
  • corpus can analyze temporal semantics of things like "tomorrow" and use that in relation to the assertion date.
  • corpus can use a similarity weighting based on similarity and use that as evidence of related classifications.
  • corpus should know the greater context of statements and thus h ave access to writings and estimates of when they were made.
  • corpus obviously needs to take advantage of Weka's text mining capabilities!
  • Here is a good idea - use existing bug databases found on the internet as a labelled training corpus for the corpus classifier.
  • Add duplicate detection to corpus feedback for unilang.
  • Let us consider the corpus problem to multistrategy and complex, and therefore in need of being worked on.
  • Should complete a basic schedule system manually, using corpus to look things up.
  • Just use the language generation feature to create a translation corpus
  • Create our own annotated fallacy corpus.
  • Prototype various nlp applications wrt corpus.
  • Maybe have a template based pattern extractor for corpus events.
  • Convert the GigaWord corpus into a server (if it even runs, I may need more memory... or just load a smaller portion of it.)
  • Setup centralized documentation corpus so that we can illustrate what documents we need to connect.
  • Can create a new perllib::TimeSeries::Segment using corpus::TDT
  • Refactor ASConsole::corpus::DBMail to use ASConsole::corpus::IMAP as a base class.
  • /var/lib/myfrdcsa/codebases/internal/unilang/start corpus ELog unilang-Client
  • Related http://research.microsoft.com/research/nlp/msr_paraphrase.htm to corpus matching.
  • corpus can determine when there is not enough information available for a given classifier to classify an Item, in which case it does various checks, defaulting to asking the user.
  • Need to use corpus to classify email/aim logs for what to do with them.
  • Could classify corpus entries along emotional lines.
  • This thing I've made for Sorcerer sure looks useful, it is being used in Sorcerer, broker, should be used in job-search, and could be used in busroute, clear, corpus, critic, cso, digilib, and verber.
  • /var/lib/myfrdcsa/codebases/internal/unilang/start audience broker clear corpus cso ELog OpenCyc pse unilang-Client
  • /var/lib/myfrdcsa/codebases/internal/unilang/start audience broker clear corpus cso ELog manager OpenCyc pse unilang-Client
  • Code monkey ought to have a corpus of examples for learning of error messages, etc, all marked by their full environmental context.
  • Delete all the incorrect classifications in the sql log when corpus classifier is working.
  • Perhaps have the radar jump corpus -s feature not show up upon recent repeated jumps to the same place
  • Find and extract an enormous text corpus
    ("completed" "81747")
  • do more Web as corpus searches.
  • This is a test for corpus.
  • Stop displaying these corpus Messages.
    ("completed" "66558")
  • Fix problem with corpus -k stopping if you query KBS while it is running.
  • KBS, MySQL:freekbs:default assert ("comment" "47827" "it is currently corpus -s (search)")
  • Get corpus Knowledge View mode working.
    ("completed" "61728")
    ("depends" "61728" "61722")
  • Analyze the beginning of corpus messages for common themes like "Need to " or "Should ..." etc.
  • Fix corpus Router.
  • Maybe want to load unilang entries into an information retrieval engine instead of corpus::Sources.pm
  • Should corpus be renamed since it is so unilang specific?
  • The corpus system for routing unilang messages can use the thinker/notes system.
  • Need to get something similar to corpus::TDT::GetEntries for working with unilang entries.
  • Create a corpus of version strings and file names from sourceforge, and use that to test the version string processing and comparison functions.
  • Maybe corpus should take the perspective that it is being talked to, and look at it that way.
  • Perhaps corpus should tag using freekbs
  • corpus can learn which item to send a particular command by analyzing the messages that start with any of the agents name followed by a comma.
  • Fix problem with corpus.
  • Add a feature to corpus or whatever, that is able to index various "requirements" like files, and then destroy them.
  • Come up with corpus of items for Mike Stevens to look at.
  • corpus should be sure to do things linearly. That way, when manually classifying things, we can assert a "continuation" between entries if they represent the same topic.
  • Use sentence splitting with corpus.
  • using kissinger corpus, formalize domains they are discussing, and represent communication actions formally, classify them, etc.
  • I realized that the same system that is holding up both corpus and RSR is also holding up gourmet. An ontology editor of sorts.
  • Check whether the corpus of ty[ing lessons for trr is large.
  • This would be as a means to disambiguate items. For instance - "create capabilities management system" - is this done? Well, when corpus is done, it will be, but how does pse know that necessarily.
  • One possible thing to do is write something for unilang that interfaces with the corpus classifier.
  • corpus should probably use an ILP tools to learn more rules for better classification
  • Now, we need to finish corpus and actually have planning working.
  • We can record a corpus of bus questions, etc, from pedestrians for use in determining additional busroute functionalities
  • A quick corpus thing would be a command like corpus --listall task
  • The overview of the corpus algorithm is too slow and inaccurate.
  • Areas for improvement with corpus. It's quite clear. The procedure I came up with isn't efficient but it is easily adaptable to an efficient procedure. Two things - don't do n^2, rather only look at reasonably related documents - this will cut down dramatically on run time. Second, make more efficient the sharedterm code, and lastly profile to see where spending all time.
  • Use Math::PartialOrdering for subsumption hierarchies for corpus and possibly gourmet
  • Demonstrate corpus to Brian.
  • corpus might well take into consideration that mispellings tend to before the user hits return, since often they don't check.
  • Should configure kbfs to start commenting on files. For instance - konik_laird_ilp2004.pdf is related to corpus.
  • Only after building OCR corpus.
  • However now I'm too tired to do this. So I need to do that and other related things, like build a better critic::Classifier type system for corpus, tomorrow. But I must also eat tomorrow.
  • I believe I should get something preliminary going for pse pretty soon - as exported from corpus, for now. So that we can begin to get an agenda in place. From this agenda, and from the interest mapping system, we can start getting activities going.
  • In lieu of a complete solution, we can simply manually verify corpus auto classification results at the end of each cycle.
  • Features to add to corpus - need to add system to determine whether a property was manually or automatically selected.
  • If it's not obvious, unilang will be using corpus to classify the users entries.
  • wow I got corpus working sort of. That's good news. It is now classifying unilang log messages,a dn doing a rather good job.
  • Need corpus to list items by class.
  • corpus needs to also store its classes.
  • corpus needs a feature where it can automatically handle complex statements, as well as automatic classification, and lastly, a measure of when something is done being classified. It should also allow reclassification using inherent distinctions present after the addition of new classes.
  • corpus needs to allow renaming of classes.
  • corpus needs to be easier to use (i.e. not have that windowing problem.)
  • Maybe some explanation of the distinction between classes in corpus would be useful.
  • corpus must first chunk, then classify.
  • Okay, FEX can be used with corpus.
  • Maybe corpus could handle formalization of everything - from verber and pse entries to?
  • Come up with a set of targets for corpus. For instance -
  • Other important thinking, I cannot begin work on certain projects, for instance, the meetup client, until corpus is finished. We can represent this as an HTN in opencyc.
  • Also need unilang or corpus to record which messages have already been addressed.
  • It would be nice if corpus classified various types of messages, and came up with a standard way of saying them, then a LM could be created.
  • corpus can concatenate two entries in a row and see how much sense it makes as a test when a connection is suspected.
  • You can use typing speed to (help) classify related thoughts in corpus.
  • I can't wait to get working on mapping corpus events to actions that require verber actions. It would also be nice to move on the area of integration with Cyc, supposing I can ever figure out how to work log files.
  • Of course, make sure to use Conversation perl module in corpus analysis of unilang.xml
  • make a corpus agent that allows interactive classification of existing entries and sends these to the respective clients over unilang.
  • So this command line can be used to satisfy goals. perhaps a mechanism can be implemented which performs a factual query of the corpus to determine which goals a given fact/assertion/literal satisfies or breaks.
  • And then I can use an interactive classifier, perhaps modify doit, to create a corpus of actionable items, and then begin a system to automatically translate these into Cyc planning operations, and interact with Cyc.
  • progress on the thoughts corpus will be slow until the mess with unilang is sorted out.
  • I have written two systems just now. One extracts sources from apt-get.org, the other gathers a corpus of my thoughts files together
  • come up with a standard file format to describe a unilang corpus


This page is part of the FWeb package.
Last updated Sat Oct 26 16:51:00 EDT 2019 .