unicode searching with amberfish

Project:RUcore/NJDH/Partner Portal Search
Version:5.0
Component:Searching
Category:feature request
Priority:normal
Assigned:triggs
Status:closed
Description

Li Sun and Grace requested a means of searching Chinese and other non-Roman characters input into objects as unicode strings. The following email to Ron Jantz demonstrates the issue and the solution we agreed upon:
From: <a href="mailto:triggs@rutgers.edu">triggs@rutgers.edu</a>
Subject: test examples for scenario
Date: February 26, 2009 5:26:28 PM EST
To: <a href="mailto:rjantz@rci.rutgers.edu">rjantz@rci.rutgers.edu</a>

This is the original object with the chars grepped out (nice grep;-):

fedora@lefty64:/mellon/htdocs/dlr/EDIT/TESTOBJECTS> egrep "书法家" /mellon/data/objects/2009/0223/16/06/rutgers-lib_24235
<dc:title>书法家: shu fa jia</dc:title>
<dc:title>书法家: shu fa jia test</dc:title>
<mods:title>书法家</mods:title>
<mods:title>书法家</mods:title>
<foxml:datastreamVersion ID="SMAP1.0" LABEL="书法家: SMAP1" CREATED="2009-02-23T16:06:08.000Z" MIMETYPE="text/xml">
<foxml:datastreamVersion ID="DJVU-11.0" LABEL="书法家: DJVU-11" CREATED="2009-02-23T16:06:08.000Z" MIMETYPE="image/x.djvu">
<foxml:datastreamVersion ID="PDF-11.0" LABEL="书法家: PDF-11" CREATED="2009-02-23T16:06:08.000Z" MIMETYPE="application/pdf">
<foxml:datastreamVersion ID="ARCH1.0" LABEL="书法家: ARCH1" CREATED="2009-02-23T16:06:08.000Z" MIMETYPE="application/x-tar">

I have two other version of the file on hand - one with the characters converted to hex entities and the other filtered with vis -h (as well as the original). A grep of dc:title shows what we're dealing with:

fedora@lefty64:/mellon/htdocs/dlr/EDIT/TESTOBJECTS> egrep dc:title *rutgers-lib:24235.xml
rutgers-lib:24235.xml: <dc:title>书法家: shu fa jia</dc:title>
unient-rutgers-lib:24235.xml: <dc:title>&#20070;&#27861;&#23478;: shu fa jia</dc:title>
unient-rutgers-lib:24235.xml: <dc:title>&#20070;&#27861;&#23478;: shu fa jia test</dc:title>
vis-rutgers-lib:24235.xml: <dc:title>{e4}{b9}{a6}{e6}{b3}{95}{e5}{ae}{b6}: shu fa jia</dc:title>

I create an amberfish index of these three files called fafafa:

fedora@lefty64:/mellon/htdocs/dlr/EDIT/TESTOBJECTS> ls *rutgers-lib:24235.xml | af -i -d fafafa -C -t xml --phrase -F -v
/mellon/htdocs/dlr/EDIT/TESTOBJECTS/rutgers-lib:24235.xml
/mellon/htdocs/dlr/EDIT/TESTOBJECTS/unient-rutgers-lib:24235.xml
/mellon/htdocs/dlr/EDIT/TESTOBJECTS/vis-rutgers-lib:24235.xml

Search using the unicode characters does not work:

fedora@lefty64:/mellon/htdocs/dlr/EDIT/TESTOBJECTS> af -s -d fafafa --query 书法家
fedora@lefty64:/mellon/htdocs/dlr/EDIT/TESTOBJECTS>

Search with those same characters translated by vis -h does work:

af -s -d fafafa --query e4 b9 a6 e6 b3 95 ae b6
+ 100 fafafa 3 0 /mellon/htdocs/dlr/EDIT/TESTOBJECTS/vis-rutgers-lib:24235.xml 0 17733
fedora@lefty64:/mellon/htdocs/dlr/EDIT/TESTOBJECTS>

The Scenario:

User inputs unicode (utf-8) into WMS probably by clipping from a text or someone with fonts installed on his or her computer.

The object is ingested into Fedora, which will accept the utf-8 but not other types of hex, insuring some reliability.

The object is indexed in the normal way, with vis -h filtering any hex codes before they are indexed by amberfish. This is done to protect against bad xml data, but will work on the good codes as well. The converted ascii hex representations (separated by {} characters, which amberfish considers punctuation, and thus indexed as separate tokens or "words".

To search this, the user enters unicode utf-8 in the search box, which becomes the search query. We already munge the search query quite a bit, but we will now also pass it through vis -h to convert any hex data to the ascii equivalents, which are then used as keywords for searching. The Fedora id is returned as with any search. The only thing we need to do is add the vis filter to the user search text.

Jeffery

The vis filter is now simply applied routinely to user queries (as it already has been to the "search objects" - to protect against "bad" characters that would bother the xerces XML parser used by the amberfish indexer). This has no effect on ordinary Roman search queries (so no change in normal behavior), but quietly prepares queries with unicode to hit in the search indexes. The system has been tested on left64 with both Chinese and Arabic texts.

Comments

#1

Here is an example of how the search works in RUcore. The search term is "DEBUG:书法家".

/usr/local/bin/af -s -d /mellon/htdocs/dlr/INDEX/nnfedora-rucore00000000010 --query '(/.../mods:titleInfo/_c/.../"e4" & /.../mods:titleInfo/_c/.../"b9" & /.../mods:titleInfo/_c/.../"a6" & /.../mods:titleInfo/_c/.../"e6" & /.../mods:titleInfo/_c/.../"b3" & /.../mods:titleInfo/_c/.../"95" & /.../mods:titleInfo/_c/.../"e5" & /.../mods:titleInfo/_c/.../"ae" & /.../mods:titleInfo/_c/.../"b6")' 2>>/mellon/htdocs/dlr/TMP/af.err | /mellon/htdocs/dlr/EDIT/getpid.pl
orderby is 'ORDER BY dcTitle, dcDate'
searching DLR Portal
totalhits is 1
select token,path,dcTitle,dcCreator,dcDate from fedsearchfields where token IN ('rutgers-lib:24235') ORDER BY dcTitle, dcDate LIMIT 0, 10
Displaying 1 result
1 received pathname is '100 dbase 10 0 /mellon/data/objects/2009/0223/16/06/rutgers-lib_24235 0 100'

#2

Status:fixed» closed

Back to top