I am sending you a little example of what I've been working on today. It is a system for natural language understanding (hence implements things like question answering, recognizing textual entailment, etc.) In addition to the other tools that I have for this, for instance for Question Answering we already have QUAC (OpenEphyra, Aranea) and for Recognizing Textual Entailment we already have the Stanford RTE system. But what I am working on today will be tools that are not only open source but also, we will have access to the internals. I have a system called Capability::TextAnalysis; It would take the following sentence (I'm using a small one here just because the intercomputer communication is not finished, and one essential service (Enju) has only been succesfully installed on my 32 bit machine.) "This is the first time I have tried this. I wonder how well it will work. Hopefully, well." Now, here are the results of the Capability::TextAnalysis module: -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- andrewdo@justin:/var/lib/myfrdcsa/codebases/internal/formalize/scripts$ ./test-capability-text-analysis.pl $VAR1 = { 'CoreferenceResolution' => 1, 'SemanticAnnotation' => 1, 'TermExtraction' => 1, 'NounPhraseExtraction' => 1, 'DateExtraction' => 1, 'Tokenization' => 1 }; Doing SemanticAnnotation Initializing SemanticAnnotation Retrieving result from cache Doing NounPhraseExtraction Initializing NounPhraseExtraction Retrieving result from cache Doing DateExtraction Initializing DateExtraction Retrieving result from cache Doing CoreferenceResolution Initializing CoreferenceResolution Computing result and adding to cache Doing TermExtraction Initializing TermExtraction Retrieving result from cache Doing Tokenization Initializing Tokenization Retrieving result from cache $VAR1 = { 'CoreferenceResolution' => [ { 'Ids' => { 'set_1' => { 'This' => 1, 'this' => 1 }, 'set_0' => { 'I' => 2 } }, 'String' => [ '<<>>', 'is', 'the', 'first', 'time', '<<>>', 'have', 'tried', '<<>>', '.', '<<>>', 'wonder', 'how', 'well', 'it', 'will', 'work', '.', 'Hopefully', ',', 'well', '.' ] } ], 'SemanticAnnotation' => [ { 'CalaisSimpleOutputFormat' => {}, 'Description' => { 'docDate' => '2009-10-20 21:22:53.593', 'externalID' => 'testing', 'externalMetadata' => {}, 'allowDistribution' => 'true', 'allowSearch' => 'true', 'docTitle' => {}, 'id' => 'http://id.opencalais.com/2qN2uHitGhWQOoxFCLakKg', 'about' => 'http://d.opencalais.com/dochash-1/0f786371-90c2-3af6-b178-384a64f0abd0', 'calaisRequestID' => '68d3dc26-2478-e7c1-1247-4e76a5b68072', 'submitter' => 'FRDCSA' } } ], 'TermExtraction' => [ [] ], 'NounPhraseExtraction' => [ 'first time', 1, 'hopefully', 1, 'time', 1 ], 'Tokenization' => [ 'This is the first time I have tried this . I wonder how well it will work . Hopefully , well . ' ], 'DateExtraction' => [ ' This is the first time I have tried this . I wonder how well it will work . Hopefully , well . ' ] }; -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Well, this textual analysis is not really sufficient. What I am working on now is integrating the output of all of those results into a central system. I have written something called FreeLogicForm, which converts text into something called Logic Forms. I've also implemented Nested Formula in FreeKBS today (however, I need to finish Resolution Style Theorem Proving), and this allows me to take the results from the logic form and assert them into the FreeKBS Knowledge Base. "This is the first time I have tried this. I wonder how well it will work. Hopefully, well." ("and" ("this" ?x1) ("be" ?e4 ?x1 ?e6) ("first" ?e6) ("time" ?e6) ("I" ?x2) ("have" ?e5 ?x2 ?e6) ("try" ?e6 ?x2 ?x3) ("this" ?x3)) Sending: KBS, MySQL2:freekbs2:default assert ("and" ("this" ?x1) ("be" ?e4 ?x1 ?e6) ("first" ?e6) ("time" ?e6) ("I" ?x2) ("have" ?e5 ?x2 ?e6) ("try" ?e6 ?x2 ?x3) ("this" ?x3)). ("and" ("I" ?x1) ("wonder" ?e3 ?x1 ?e4) ("how" ?e4) ("well" ?e4) ("it" ?x2) ("work" ?e4 ?x2)) Sending: KBS, MySQL2:freekbs2:default assert ("and" ("I" ?x1) ("wonder" ?e3 ?x1 ?e4) ("how" ?e4) ("well" ?e4) ("it" ?x2) ("work" ?e4 ?x2)). I am also going to add the following capabilities: It will of course easily combine named entities into the same unit (and have them looked up in the central terminology knowledge management system, called Termios, which is still incomplete). I will also easily add Calais semantic annotation named entity classes e.g. if Andrew Dougherty was mentioned, you would get the following added to the formula: Andrew_Dougherty (?x1) NP_Person (?x1) I will add coreference resolution, so that the "it" is resolved to the same entity as the "this", in other words ?x3 from the first and ?x2 from the second will be unified to the same variable, and "it" and "this" will become a reified object. i.e.: ("reification-135325 " ?x2) Lastly, word senses will be disamiguated, so ("try" ?e6 ?x2 ?x3) will become ("try_v_1" ?e6 ?x2 ?x3) which adds additional information to the system. As more accurate WSD, Coreference, Semantic Annotation, etc, systems are released, the results will simply be integrated by adding APIs for these to the standard Capability::WSD, Capability::CoreferenceResolution, etc, wrappers. This system will form the basis of a system which can (eventually) answer questions like the following. Given: The laptop was put in the bookbag. Erin checked his baggage on the plane, then flew to Tallahassee. Question: Which state is the laptop in? Answer: Florida. The final manual assembly of the above results yields something like this: ("and" ("reification-135325 " ?x1) ("be_v_1" ?e4 ?x1 ?e6) ("first_time" ?e6) ("reification-135324 " ?x2) ("have_v_3" ?e5 ?x2 ?e6) ("try_v_1" ?e6 ?x2 ?x1) ("reification-135325 " ?x1) ("wonder_v_2" ?e7 ?x2 ?e8) ("how" ?e8) ("well" ?e8) ("reification-135325 " ?x1) ("work" ?e8 ?x1) ) Adding the following techniques: theorem proving, event extraction, lexical knowledge, training over existing corpora of story/question/answer sets, etc, will yield such a system. This is of course state of the art. And the best part is everything here is just one narrow application of everything that we have. There are thousands of additional components to the FRDCSA. I am working on providing all of this as a web service which can be accessed over REST/SOAP/XMLRPC, etc. Additionally, I will be integrating the Vampire theorem prover, among others, to do the reasoning for FreeKBS, because our resolution style theorem prover would be reinventing the wheel. At least for now. There is a set of wordnet synset to Cyc concept mappings, which we can apply to the above Logic Form output, to convert to a more ontological approach. I am very close to importing Cyc into FreeKBS. Additionally, I will support other formalisms, such as RelEx, CAndC, APE, CELT, and try to integrate all the results into some kind of voting system (where conflicts exist) and augmentation else where. The bottom line is that we should be able to formalize large extents of text. Additionally, I will represent the sentences with classifications such as Speech Act classifications, etc. All of this will input into a contextual mechanism. With theorem proving we will be able to answer sophisticated questions about text, as well as using Sayer and the HypergraphT (or whatever from OpenCog), for doing various reasoning tasks. If this all seems a little useless, that is because I have written about motivating cases. The truth is this solves wide ranging problems. For instance, the food ontology for Gourmet can be instantly created, just by processing natural language books on culinary arts, as well as recipes.