FRDCSA | internal codebases | OCRA

[Project image]

Architecture Diagram: GIF

Jump to: Project Description | Parent Description | Capabilities

Project Description

ocra consists of a document corpus, built by searching Google for PDF files, and excluding those which don't contain embedded images. We then run each recognizer (both free and commercial) on this corpus and compare the results. We try to align the sentences and thus create a probability distribution by running edit distance on individual words - to ascertain the probability that a given word be translated into another word. We combine this scorer with a forest ranker operating on a statistical language model of the entire text corpus.


  • ocra should use stuff from tesseract and google's new stuff for cleaning up results using LMs.

This page is part of the FWeb package.
Last updated Sat Oct 26 16:53:58 EDT 2019 .