Double-byte characters in PDFs causing "not well formed" XML parsing errors

Project:RUcore Workflow Management System (WMS)
Version:7.0
Component:File Upload Module
Category:bug report
Priority:normal
Assigned:pkonin
Status:closed
Description

Hi folks,

Rhonda , Gideon and Peter have run into an issue with the Optimality Archive Collection involving XML parsing errors when uploading a born digital PDF for which searchable XML is to be extracted.

The source files for this collection are PDF files, for which we do not and cannot obtain the source MS Office Documents. Some of the error messages are as follows:

XML Parsing Error: not well-formed
Location: <a href="http://mss3.libraries.rutgers.edu/workarea/rucore00000002165/64747/xml/ocr-1-00064747.xml" title="http://mss3.libraries.rutgers.edu/workarea/rucore00000002165/64747/xml/ocr-1-00064747.xml">http://mss3.libraries.rutgers.edu/workarea/rucore00000002165/64747/xml/o...</a>
Line Number 192, Column 1: ©

XML Parsing Error: not well-formed
Location: <a href="http://mss3.libraries.rutgers.edu/workarea/rucore00000002165/64768/xml/ocr-1-00064768.xml" title="http://mss3.libraries.rutgers.edu/workarea/rucore00000002165/64768/xml/ocr-1-00064768.xml">http://mss3.libraries.rutgers.edu/workarea/rucore00000002165/64768/xml/o...</a>
Line Number 467, Column 675:An OT grammar, as defined by Prince and Smolensky (1993, 4ff.), consists of two functions: Gen and Eval7 . Gen generates an exhaustive set of possible candidates for any given input, and Eval evaluates those candidates and selects as the winner the candidate which best satisfies the language’s particular ranking of the universal set of constraints. The standard method of illustrating the workings of Eval with respect to a particular input and constraint (sub-)ranking is by the use of an OT tableau, which illustrates the number of violations of constraints of each candidate of a judiciously chosen subset of the output of Gen. An example tableau is as follows: (1.23) a. b. c. d.

Location: <a href="http://mss3.libraries.rutgers.edu/workarea/rucore00000002165/64782/xml/ocr-1-00064782.xml" title="http://mss3.libraries.rutgers.edu/workarea/rucore00000002165/64782/xml/ocr-1-00064782.xml">http://mss3.libraries.rutgers.edu/workarea/rucore00000002165/64782/xml/o...</a>
Line Number 594, Column 45:Class 2 Verbs Narr./ Imp III 8(8-13 ;&amp;8&8;&8;&amp;8&amp;89Indicative I 9(9;(;14

We would normally put these in software.libraries, but before we do, we checked past bugs, and it looks like we've had similar issues like this before:

<a href="http://software.libraries.rutgers.edu/node/1688" title="http://software.libraries.rutgers.edu/node/1688">http://software.libraries.rutgers.edu/node/1688</a>
<a href="http://software.libraries.rutgers.edu/node/1681" title="http://software.libraries.rutgers.edu/node/1681">http://software.libraries.rutgers.edu/node/1681</a>

And, we're wondering if it's the same thing, or a new issue?

Comments

#1

Peter supplied the attached PDF that causes the issue

Isaiah stated in email on May 22:

Regarding this PDF: I suspect the problem starts on Page 4, the mora symbol (which looks like a Greek letter "mu"). While displaying correctly on a PDF viewer, the text when cut and pasted into a UTF-8 environment appears thusly: m.

There's more though. Starting on Page 10, the document discusses the language Assamese, which is an Agnlicized term for for a word that probably can't be displayed properly here. Again, it appears as it should in PDF, but when transfer to UTF-8, it comes out as: " ." (You might not see anything in that space… or, you might see gibberish.) There are additional similar words, letters and content sprinkled throughout the text that have similar problems.

#2

Possible solution:

I have been working on this multi-byte character issue that has cropped up recently. It appears that certain OCR outputs, when migrated to our XML-1 datastream schema create XML parsing errors. It appears the cause of those errors are two-fold. One is multi-byte characters from the OCR process are causing a problem. The other issue are control codes in the OCR are causing the XML DOM parser to break.

Working with sample XML-1 datastream's provided by Peter and Kalaivani I developed a script that removes the control codes and does a "best guess" conversion of the multi-byte UTF-8 characters.

The script works by walking through each character in the XML file and detecting its encoding. Using mb_detect_encoding() if the encoding is not a "normal" ASCII code then further analysis is done. If the ASCII code is a control code, below 32, it is removed. If it is something above the "normal" ASCII code range some decoding and conversion analysis is done using the iconv() function to determine a "normal" ASCII code or set of code replacements.

An example of the conversion analysis is that "ff" is being converted into "ff". This is helpful in that previously mangled words like
"difference" are now "difference". Notice the double ff are converted.

I wrote a class and placed it in the shared class path on rep-devel.

/mellon/includes/classes/php/string_cleaner/

If we are satisfied with the results I suggest we add this to the pipeline process in WMS.

I also write a web interface to the tool that can be found here.

<a href="http://rep-test.libraries.rutgers.edu/cmmSand/multibyte/" title="http://rep-test.libraries.rutgers.edu/cmmSand/multibyte/">http://rep-test.libraries.rutgers.edu/cmmSand/multibyte/</a>

#3

Status:active» test

WMS (for rucore) is using this class now. Test when release 7.0 moves to rep-devel/test site. -YY

#4

Assigned to:yuyang» pkonin

I am assigning this bug to Peter to test.

#5

Status:test» fixed

#6

Status:fixed» closed

Closing.

Back to top