Obtaining Smarter OCR to Drive your Searches

Obtaining Smarter OCR to Drive your Searches

Simple text files are the backbone of every search index created within a document review platform. When a reviewer executes a search, he or she relies on the quality of those text files to be first rate. Thus, it is crucial that the text that drives the accuracy of your searches is reliable. While the overwhelming majority of document review consists of electronic data for which text and metadata can be extracted, Optical Character Recognition (OCR) is still used when these text sources are not available. Through OCR processing, all information is extracted from image files and placed in individual text files corresponding to the source file.

Specific instances where OCR is needed include:

  • Paper documents scanned to TIF, PDF or JPG files for inclusion in productions.
  • Image files contained within processed data collections that do not have extractable text.
  • Images that have been redacted due to the presence of privileged, confidential and/or proprietary content.

While these documents may represent a small portion of your collection, they still may contain information that is critical to the success of your review. For that reason, it is important to consider the many factors that go into the creation of reliable OCR text files and what you can do to ensure you get the results expected.

The output will only be as good as the input so here are some things to consider when appraising a potential OCR project:

  • Though 300 DPI is considered industry standard and indeed court mandated in many cases, it is worthwhile to make sure that your input image files meet this threshold to aid in the efficiency and accuracy of the OCR.
  • For best results, images should be black & white with minimal blemishes. Most OCR programs have the ability to “despeckle” and/or “deskew”, but this process slows the program considerably and distortions within text can’t be fixed resulting in text alterations, additional characters, etc.
  • The OCR engine can also recognize and resolve page orientation, tables and graphs but speed and cleanliness of the output text files will be negatively impacted.
  • Handwritten documents, notes, and/or marginalia will not be recognized by OCR. Manual transcription will be necessary for these documents to be included in your index.
  • If your collection contains

    foreign languages

    , consider having those languages detected prior to processing. Tasking the OCR engine with recognizing several languages simultaneously will increase processing time and reduce accuracy. Knowing the languages present in your collection will help you avoid these potential issues.
  • Depending on all of the factors enumerated above, OCR cleanup may be an option to consider. This is a manual process where a technician will inspect each text file, compare it to the original image and edit where possible and/or necessary. Since this is a labor-intensive task, cost can be a factor.

It is important to keep in mind all of the factors that go into the varying levels of OCR accuracy. Since the accuracy of OCR text can vary, as a reviewer you will want to consider using “fuzzy” searching to return near hits ensuring that potentially important information is not excluded from your searches.

Share this article:

gplus linkedin