SOAR SC&I collection and sub collection indexing issue

Project:RUcore SOLR Searching and Indexing
Category:bug report

Kalaivani just altered me to what we think is an indexing issue with all of the SC&I department's colections and materials. None appear to be indexed at this point; through the SOAR portal. When we looked up one of the deposits, originated on 10/29/2015, in dlr/EDIT and went to view the Solr index there is an error. The error is specifically with:

<schooldeptfacet>School of Communication and Information (SC&I)</schooldeptfacet>

The ampersand isn't being stored as en entity; &amp;

Since this schooldeptfacet value should be the same for all SC&I material we think this is the cause. The resource on production we saw this with is: rutgers-lib:47900



Assigned to:triggs» chadmills
Status:active» test

I think I have it fixed on rep-dev and rep-test. I edited an object on rep-test to have the same MODS as the problem sample on mss3, immediately recreated the problem on rep-test, and then applied a couple of lines that allow the indexer to function as expected. The object on rep-test is rutgers-lib:203873 (originally one of my test objects). In case we decide we need to push through this fix right away, I've packaged it in a tar file on rep-test:
triggs@rep-devel:/mellon/htdocs/dlr/EDIT> tar tvf /mellon/cvsroot/solrfilter-api-8-1-patch.tar
-rwxr-xr-x triggs/developers 34366 2016-07-19 16:25 solrfilter-api.php
-rwxr-xr-x triggs/developers 38855 2016-07-19 16:26 INT/solrfilter-api.php
To run the patch, download to mss3, cd to the dlr/EDIT directory and type
tar xvf /path/to/tarfile/solrfilter-api-8-1-patch.tar


Version:8.x» 8.1.1
Assigned to:chadmills» triggs


Can you give me some background on the code change? It would help me know how exhaustive my testing needs to be. Were you able to isolate the fix to the solr index field in question or is the change broader and can include the way data is created for other index fields as well?



I would als like to know if this fix addresses all special characters. Department Names may have apostrophes, commas etc.


This fixes the issue of the unescaped & in department titles that came up in the PHP version of the CGI script. I'm not sure about apostrophes and commas. I have not been aware of such issues, but if there are some I suspect they should be described in a separate issue.


Status:test» active

Jeffery, we should find a universal solution for this issue -- not just escaping "&" only. I can't identify all the characters individually; it could be apostrophe, comma, and parenthesis or some other characters.


Along with #5 I am still not sure if this fix is just for the department name solr field or is the same conversion/logic being applied to all solr fields.


Escaping the "&" will in fact protect any other valid entity references in such strings (e.g., &apos;, &quot;. etc.). It reproduces more exactly the behavior of the CGI xpath that did protect & characters. The hanging & is what is illegal XML.


I am still not sure if this fix is just for the department name solr field or is the same conversion/logic being applied to all solr fields. I just need to know the extent of what I need to test.


The other fields drawn directly from the MODS are not a problem. It only seems to happen with the second xpath of MODS for these special fields. The fix addresses these secondary xpath fields to match the behavior we got automatically from the CGI xpath module.


Can you give me another field I can look at on test that will behave the same way? I can then sign off on this.



203873 pm rep-test has the same MODS as the object on production. the SC&I string occurs various other places including allmods and the regular modsname with &amp;. These fields are escaped normally. The secondary xpath in schooldeptfacet_st is now escaped as well. As for the &apos;, these as usual appear as &amp;apos; in the debug output (and also in a schooldeptfacet_st), but are converted searchable to "'" strings just before the text is passed to Solr for indexing.


So if a stray ampersand appears in another fields values it will be escaped/converted to &amp as well? My concern is this is only targeting a specific field where the initial issue was found; but the problem value, ampersand, could occur elsewhere and we would start playing wackamole.


The stray ampersand is the only thing within the data of XML elements that can cause this kind of trouble. The problem happened in the secondary xpath using PHP, which did not escape such characters as the perl xpath module did. This takes care of any odd ampersands (e.g. AT&T., &c., Jack & Jill) that typically occur. If a real entity gets caught here (e.g., &apos; as &amp;apos;), it gets picked up and re-rendered as before in the final stage before indexing. This is a big hammer that clobbers all such moles at once.


Priority:normal» critical

Since it is still not clear to me that this fix is applied to all values being indexed I edited 203873 on rep-test. Specifically:

<mods:subject authority="local">
<mods:topic>Assertive community treatment &</mods:topic>

After updating the MODS I selected "View Solr/Lucene and Object XML". An error is now thrown on screen in my browser.

This page contains the following errors:
error on line 351 at column 50: xmlParseEntityRef: no name
Below is a rendering of the page up to the first error.

The index of the resource looks incomplete.

<add><doc><field name="id">rutgers-lib:203873</field><field name="fedoraid">rutgers-lib_203873</field></doc></add>


OK. Try it again now. It wasn't looking for the &amp;< condition. Now it gets even this:
<field name="title"> &Justifying medication decisions in mental health care &c.: psychiatrists’ accounts for treatment recommendations &</field> goes to
<field name="title"> &amp;Justifying medication decisions in mental health care &amp;c.: psychiatrists’ accounts for treatment recommendations &amp;</field>
with or without the space at the beginning of the mods:title field.


Status:active» fixed


Version:8.1.1» 8.1.2


Confirmed on staging.


Status:fixed» closed

Back to top