FRDCSA | internal codebases | CoAuthor

[Project image]

Architecture Diagram: GIF

Jump to: Project Description | Parent Description | Capabilities

Project Description

In practice we have implemented various techniques that approximate the methodology mentioned below. One such technique is the determination of text dependency information. (There happens to be a very useful Perl module called Math::PartialOrdering, for dealing with partial orderings.) We used a weighting similar to TFIDF in order to extract terms that were shared mostly between two documents and few others, i.e. $shared->{$t} = 4 * ($doc->{$a}->{$t} * $doc->{$b}->{$t}) / ($tf->{$t} * $tf->{$t}). We hypothesized that if a text A is a prerequisite of a text B, that the distribution of first occurances of shared terms in A would be more equiprobable, whereas in B they would be closer to the front. We thought this since we thought that if A was defining the terms they would introduce them more gradually, whereas B presumed knowledge and therefore did not hesitate to use them. So our prerequisite measure is simply the difference between average percent position of first occurances of shared terms in A and B. This intentionally simplistic measure seems to have given at least initially usable results.

This capability will be used by the clear ITS to provide constraints on the reading of large groups of documents, i.e. digilib. The quiz capability will be used to determine familiarity with given material both for placement and advancement. Consideration will be given to the goal based justification for reading texts. So, one general functionality of coauthor is to write custom texts for individual learners based on clear placement/familiarity quizzes and interaction history, over large text collections contained in digilib. Even more applications are possible, such as generating appropriate documentation (overviews, manuals, etc) from disparate project sources, bootstrapping knowledge formation, intelligence analysis, or organizing various information sources.

The idea of a system that automatically sythesizes book formats is nothing new. We have many choices how to proceed. We could formalize all writings axiomatically and then use NLG to write the book, following some method for introducing concepts. This is close to what we have done, however, we do not have tools for completely representing these concepts. Initially we have chosen to perform a more shallow parse of the materials. This is because the structures required for shallow parse are prerequisites of those required for deeper parses.

Initially we focus on the organization of the source materials. Since we would like the process to be completely initiated by the computer, with the possibility of asking a complexity bound set of questions of the author, we ensure that the process is entirely automated. Automated essay scoring techniques (LSI) are applied to determine the important sections of the text. The corpus of texts to be integrated is sorted and clustered hierarchically. Sections and chapters are best fitted, and titles are extracted using a hybrid keyphrases/title recognizer and summarization.

A corpus of related materials is downloaded and weighted according to importance.

We ensure that all terminology is defined before being used, and this may involve restructuring the texts. This is done by terminology extraction followed by word sense disambiguation. The author selects the senses selected from Wordnet senses and automatically extracted definitions using Tom Pederson's WSD Perl modules. In this way we develop a large dictionary of concepts. We use various network centrality measures to compute the most important concepts. We then present them according to a format which has been obtained by performing the exact same analysis on related texts and mapping to their formats as represented in DocBook.

Some degree of reasoning must be applied in order to sort out conflicting information from these sources.

Spelling and grammar checking is employed at this stage.

We then employ a sentence level statistical rewrite of the text to enhance readability. We use the Halogen NLG system with a language model trained on books and papers in the domain and weighted according to the importance of the source text. We then use NLU to map to the interlingua and regenerate it.

Readability analysis as well as Latent Semantic Indexing is applied at this stage.

coauthor should be of assistance in writing any kind of documentation (FAQs/Howtos/Use Cases/Man pages/Papers/etc). It needs an interactive authoring environment that suggests completions, even operating at the rhetorical structure level. Could integrate with reasonbase, hey?


  • coauthor should beautify and declassify and generate these paper drafts.
  • Create a publications section of fweb, from coauthor.
  • Put statistical language modelling and correction software into coauthor.
  • I mean that coauthor could use.
  • Could use coauthor to generate cover letters for the resume.
  • coauthor can help us write Man pages, or any kind of documentation for that matter.
  • Put statistical language modelling and correction software into coauthor.
  • coauthor can generate zooming descriptions of various text systems.
  • I mean that coauthor could use.
  • coauthor should beautify and declassify and generate these paper drafts.
  • Create a publications section of fweb, from coauthor.
  • coauthor can restrict generated texts to conform to the idiom of a select group of texts, to enforce the purity of the language where so desired.
  • We can expand on the dependency extraction capabilities of coauthor by actually using definitional extraction patterns.
  • Note that we have many voys still, thanks to coauthor
  • For any text which coauthor uses to generate a book that is not by a particular author - coauthor can do the citation.
  • Need to get a centrifuser thing going for news texts - add it to coauthor
  • coauthor's dependency system can be used with clear in order to build mental models. Clearly, coauthor and clear are very closely related, which is very funny since I didn't even detect that at first. We need to get the coauthor system's ability to generate dependencies between written materials working so that - 1 - I can generate a doctrinal hierarchy for people to test proficiency in.
  • Use SPADE with coauthor
  • Add stuff to coauthor to automatically copy out the current book, to label it with the title, and put it in a collection, etc.
  • only publish ideas that you've written before we started on coauthor

This page is part of the FWeb package.
Last updated Sat Oct 26 16:50:56 EDT 2019 .