FRDCSA:

CoAuthor

Brief: Automated synthesis of books from domain texts
Jump To: Parent Description

In practice we have implemented various techniques that approximate the methodology mentioned below. One such technique is the determination of text dependency information. (There happens to be a very useful Perl module called Math::PartialOrdering, for dealing with partial orderings.) We used a weighting similar to TFIDF in order to extract terms that were shared mostly between two documents and few others, i.e. $shared->{$t} = 4 * ($doc->{$a}->{$t} * $doc->{$b}->{$t}) / ($tf->{$t} * $tf->{$t}). We hypothesized that if a text A is a prerequisite of a text B, that the distribution of first occurances of shared terms in A would be more equiprobable, whereas in B they would be closer to the front. We thought this since we thought that if A was defining the terms they would introduce them more gradually, whereas B presumed knowledge and therefore did not hesitate to use them. So our prerequisite measure is simply the difference between average percent position of first occurances of shared terms in A and B. This intentionally simplistic measure seems to have given at least initially usable results.

This capability will be used by the clear its to provide constraints on the reading of large groups of documents, i.e. digilib. The quiz capability will be used to determine familiarity with given material both for placement and advancement. Consideration will be given to the goal based justification for reading texts. So, one general functionality of coauthor is to write custom texts for individual learners based on clear placement/familiarity quizzes and interaction history, over large text collections contained in digilib. Even more applications are possible, such as generating appropriate documentation (overviews, manuals, etc) from disparate project sources, bootstrapping knowledge formation, intelligence analysis, or organizing various information sources.

The idea of a system that automatically sythesizes book formats is nothing new. We have many choices how to proceed. We could formalize all writings axiomatically and then use nlg to write the book, following some method for introducing concepts. This is close to what we have done, however, we do not have tools for completely representing these concepts. Initially we have chosen to perform a more shallow parse of the materials. This is because the structures required for shallow parse are prerequisites of those required for deeper parses.

Initially we focus on the organization of the source materials. Since we would like the process to be completely initiated by the computer, with the possibility of asking a complexity bound set of questions of the author, we ensure that the process is entirely automated. Automated essay scoring techniques (LSI) are applied to determine the important sections of the text. The corpus of texts to be integrated is sorted and clustered hierarchically. Sections and chapters are best fitted, and titles are extracted using a hybrid keyphrases/title recognizer and summarization.

A corpus of related materials is downloaded and weighted according to importance.

We ensure that all terminology is defined before being used, and this may involve restructuring the texts. This is done by terminology extraction followed by word sense disambiguation. The author selects the senses selected from Wordnet senses and automatically extracted definitions using Tom Pederson's WSD Perl modules. In this way we develop a large dictionary of concepts. We use various network centrality measures to compute the most important concepts. We then present them according to a format which has been obtained by performing the exact same analysis on related texts and mapping to their formats as represented in DocBook.

Some degree of reasoning must be applied in order to sort out conflicting information from these sources.

Spelling and grammar checking is employed at this stage.

We then employ a sentence level statistical rewrite of the text to enhance readability. We use the Halogen nlg system with a language model trained on books and papers in the domain and weighted according to the importance of the source text. We then use nlu to map to the interlingua and regenerate it.

Readability analysis as well as Latent Semantic Indexing is applied at this stage.

coauthor should be of assistance in writing any kind of documentation (FAQs/Howtos/Use Cases/Man pages/Papers/etc). It needs an interactive authoring environment that suggests completions, even operating at the rhetorical structure level. Could integrate with reasonbase, hey?