Solr throwing errors when handling unicode characters

Project:RUcore API's
Version:7.0
Component:Search API
Category:bug report
Priority:normal
Assigned:chadmills
Status:closed
Description

In the PHP error log I found this example:

********
[05-Mar-2012 15:35:46] PHP Warning: SolrClient::query() [<a href='solrclient.query'>solrclient.query</a>]: in /home/httpd/html/rucore/ap
i/search/lib/class.query.solr.php on line 256
[05-Mar-2012 15:35:46] PHP Fatal error: Uncaught exception 'SolrClientException' with message 'Unsuccessful query request : Response Code
400. <html>
********

In this example Solr returned a response code 400 for the following search:

********
<title>Error 400 org.apache.lucene.queryParser.ParseException: Cannot parse 'metadata:(fran?is le nev? AND portalkey:(NJDH) AND relation:(
+ACIR OR +ALMBHNL OR +ALMBHNL_ImmigrationAzzieFamily OR +ALMBHNL_JosephYannarelli OR +ALMBHNL_PassaicStrike1926 OR +ALMBHNL_PeriodRooms OR
+ALMBHNL_UnionTWUA OR +AMLBH_OralHistories OR +EHCHSPHOTO OR +EHCN OR +HHP OR +JCDC OR +JerseyShore OR +John_Albok_Collection_145 OR +JrM
us OR +NJDH OR +NJHS OR +NJSL OR +NJSO1876 OR +NPLRNG OR +NewJerseyNeighborhoodsCollection OR +RULGriffis OR +Roosevelt OR +SBFarms OR +SP
COLMAPS OR +WPUChronicle OR +WarwiFlood OR +rucore00000001018 OR +rucore00000001026 OR +rucore00000001036 OR +rucore00000001042 OR +szhis0
04) NOT type:(collection)': Encountered "&lt;EOF&gt;" at line 1, column 638.
********

I then referenced the Apache access_log at the same time and found the following:

********
128.6.218.103 - - [05/Mar/2012:15:35:46 -0500] "POST /api/search/query/?output=xml&style=css&numresults=10&start=&q1=fran%E7ois+le+nev%E9&
q1field=object&q1bool=&q2=&q2field=&orderby=relevance&c%5B0%5D=ACIR&c%5B1%5D=ALMBHNL&c%5B2%5D=ALMBHNL_ImmigrationAzzieFamily&c%5B3%5D=ALMB
HNL_JosephYannarelli&c%5B4%5D=ALMBHNL_PassaicStrike1926&c%5B5%5D=ALMBHNL_PeriodRooms&c%5B6%5D=ALMBHNL_UnionTWUA&c%5B7%5D=AMLBH_OralHistori
es&c%5B8%5D=EHCHSPHOTO&c%5B9%5D=EHCN&c%5B10%5D=HHP&c%5B11%5D=JCDC&c%5B12%5D=JerseyShore&c%5B13%5D=John_Albok_Collection_145&c%5B14%5D=JrMu
s&c%5B15%5D=NJDH&c%5B16%5D=NJEDL&c%5B17%5D=NJHS&c%5B18%5D=NJSL&c%5B19%5D=NJSO1876&c%5B20%5D=NPLRNG&c%5B21%5D=NewJerseyNeighborhoodsCollect
ion&c%5B22%5D=RULGriffis&c%5B23%5D=Roosevelt&c%5B24%5D=SBFarms&c%5B25%5D=SPCOLMAPS&c%5B26%5D=WPUChronicle&c%5B27%5D=WarwiFlood&c%5B28%5D=r
ucore00000001002&c%5B29%5D=rucore00000001018&c%5B30%5D=rucore00000001026&c%5B31%5D=rucore00000001036&c%5B32%5D=rucore00000001042&c%5B33%5D
=szhis004&key=NJDH HTTP/1.1" 500 -
********

I started trying to perform the same query and found the "q1=fran%E7ois" was tripping up Solr. I could only duplicate this error by directly querying the search API and I could not duplicate this error through the search interface. When I searched "françois" from the search interface it would not convert it to "q1=fran%E7ois" but rather "q1=fran%C3%A7ois" and log that in the Apache access logs. It also returned two record results. The faulty queries are not logged by the statistics package since the error caused happens before the statistics would be logged.

I will look at a few other errors, but I suspect they are caused by a similar issue. Moving forward I will try to duplicate this through the search interface, once successful I can then find out what Solr doesn't seem to like about the query. At that point we can either do a unicode re-encoding or something else. In general though I will also investigate beefing up the fault tolerance at that portion of the query process to reduce the number of possible errors thrown to the php error log, and possibly the end user.

Comments

#1

Version:6-x» 7.0

#2

Project:RUcore/NJDH/Partner Portal Search» RUcore API's
Version:7.0» 7.0
Component:API - Search API» Search API
Status:active» test

Corrected this issue by using utf8_decode() on the submitted search string before doing deeper analysis. To test search using utf-8 encoded strings, like 'françois'.

#3

Status:test» active

Still throws an error in this search:

http://rep-test.libraries.rutgers.edu/rucore/search/results.php?key=scholarship&q1=%C3%A7&q1field=object&q1bool=AND&q2=&q2field=object&rtype[]=&orderby=relevance&numresults=10

which is a search for "ç"

#4

Status:active» fixed

Removed the utf_decode() of the submitted string. Searches appear to be more accurate now and I do not get any solr errors. Test by searching with a term such as 'françois'.

#5

Status:fixed» closed

Back to top