Solr : Cannot search for entities if they are represented as numeric codes in Metadata or XML-1

Project:RUcore SOLR Searching and Indexing
Version:7.4
Component:Code
Category:bug report
Priority:normal
Assigned:triggs
Status:postponed
Description

If an entity is represented as a code in the XML it is not searchable unless you search on that code. A user would not do that, they would search on the entity itself.

An example is the following foreign character. 漢 is what is displayed on the screen and what a user would search by. The followign entity code is what is stored in the XML 漢

The following bug with solr mentions how to fix this. They recommend decoding the source before adding it to the index.

http://code.google.com/p/solr-php-client/issues/detail?id=30

Comments

#1

Status:active» test

Reindex a few objects on rep-devel and try searching unicode. I made a minor change to solrfilter-api.cgi - the XML-1 characters are first protected and distinguished from any illegal control-codes or OCR artifacts by conversion to numeric entities and these are converted safely back by the solr indexer itself. I was testing a French text, rutgers-lib:24190. Search for: siècle récit puisse avancer (or simply "récit puisse avancer" which only appears in full text).

#2

Status:test» active

I looked in the XML-1 datastream for that record and the full text example "...le récit puisse avancer, le..." is not using a HTML entity code to store the "é". I was expecting to see in the XML data "...le rècit puisse avancer, le..."

Since the character in question isn't being stored as an HTML entity code I am not sure if it is being decoded and indexed by Solr/Lucene as a decoded character or an HTML entity code.

#3

Yes. It is stored as the unicode character in the XML-1 datastream. In the process of creating the search XML it is first converted to <#232; (along with all potentially dangerous characters) but then passed to Solr as "é". The bug was that an unnecessary call to utf8 conversion was eating characters like these. They are now being safely passed to the indexer.

#4

Do you have another example where in the XML-1 datastream the unicode character is being stored as a entity? In this example, rutgers-lib:24190, the XML-1 datastream has the chracters stored as "é" and not the entity code. I need this for testing, thanks.

#5

10601 is an early ETD with numeric entities in the XML. 13586 is a Chronicle OCR text with a slew of bad characters double-escaped in the original (e.g. what appears as <amp;quot; is <#x26;quot;). I took out the numeric encoding and we'll have to see if it breaks anything in a wider indexing. It seems to work now for both unicode texts and texts with entities.

#6

Version:6-x» 7.0

I only started looking at 10601. I could not find this dissertation through the RUcore development server search. Looking at the object I see no portals are associated. I did look at the XML-1 datastream for 10601. I see a number of entity references, such as à. These are hexadecimal codes and I was looking to test numeric codes, such as à. Does 13586 fit this need? If not, can you please give me an object to test that can be found through the development RUcore search that has numeric entities and not hexadecimal entities in the XML-1. I am also changing this to a 7.0 issue.

#7

Project:RUcore dlr/EDIT» RUcore SOLR Searching and Indexing
Version:7.0» 7.0

Moving from dlr/EDIT to RUcore SOLR Searching and Indexing.

#8

Version:7.0» 7-x

The 13586 is a Paterson Chronicle text with a lot of already escaped entity references in the underlying XML, e.g., Ê, which appears to the Solr engine as Ê. It can't really be considered typical or be used to test this.

#9

Version:7-x» 7.2

Jeffery,

What is the status of this bug?

KA

#10

I'm not sure frankly. I think we need a new, specially set up test object with a well defined search scenario.

#11

Version:7.2» 7.4

Moving to R7.4.

#12

Status:active» postponed

Back to top