Values with double-quotes are not encoded in the MODS metadata properly and/or not stored in the index properly

Project:RUcore SOLR Searching and Indexing
Category:bug report

Values with double-quotes are not encoded in the MODS metadata properly and/or not stored in the index properly. "Cape May Diamonds" is one example. This causes some facet values that contain double-quotes to yield 0 results. Please see associated issue "".



Assigned to:Anonymous» triggs


Hi Marty,

I'm not sure from this description (or the one in 3085) what the problem is. Can you give me a URL showing the problem? And explain what you would like to see? Are you selecting a tokenized text field for faceting? Are the quotes baked into a string field?



The facet I set up was for Topic (Subject). The "Cape May diamonds" string is in the following record:
"" (title is "Unearthing New Jersey Vol. 2, No. 2").
It does have the double-quotes in the metadata field.

Chad may be able to give you a better explanation than I can of how the double-quotes affect faceting code.


In particular look at the topic field...

Subject (ID = SUBJ7); (authority = NJEDL)
"Cape May diamonds"

In the index for topic; 'topic' i believe, the double-quotes are "double-quotes" and not then expected encoded "

When submitting a search to Solr the fact that are not encoded breaks the search query.


So if these quotes were escaped a " entities they would be OK? Is there any reason they are needed? Would it be OK to strip " from the Solr index field?


I don't think they should be stripped.

If the values stored was....

"Cape May diamonds"

instead of...

"Cape May diamonds"

We would be in business. That is because the queried formed places quotes around the value. Presenting solr with

""Cape May diamonds"" yields no results

""Cape May diamonds"" yields results

Also I have no idea how the values go in from WMS into the repository. It might be worthwhile to create a record in WMS with a vtopic values with quotes and see if they are escaped/converted.


It seems a pretty rare phenomenon. These are the only objects on test with a real " character in topic:
triggs@rep-devel:/mellon/cgi-bin/solr> egrep -lr "topic>\"" /repository/data/objects/*
The actual data is even more troublesome (no end "'s etc.:
triggs@rep-devel:/mellon/cgi-bin/solr> egrep -r "topic>\"" /repository/data/objects/*
/repository/data/objects/2009/0428/13/01/rutgers-lib_18548: <mods:topic>"Pesticide Free Zones" (PFZ)</mods:topic>
/repository/data/objects/2009/0428/13/01/rutgers-lib_18517: <mods:topic>"H.R. 5872"</mods:topic>
/repository/data/objects/2009/0428/13/01/rutgers-lib_18584: <mods:topic>"Working Dogs for Conservation Foundation</mods:topic>
/repository/data/objects/2009/0428/13/01/rutgers-lib_18584: <mods:topic>"
/repository/data/objects/2009/0428/13/00/rutgers-lib_18477: <mods:topic>"
/repository/data/objects/2009/0428/13/50/rutgers-lib_19887: <mods:topic>"Wanaque South Project</mods:topic>
/repository/data/objects/2009/0428/13/50/rutgers-lib_19887: <mods:topic>"
/repository/data/objects/2009/0428/13/50/rutgers-lib_19886: <mods:topic>"
/repository/data/objects/2010/0602/13/46/rutgers-lib_25418: <mods:topic>"University Extension</mods:topic>
/repository/data/objects/2010/0602/13/46/rutgers-lib_25419: <mods:topic>"University Extension</mods:topic>
/repository/data/objects/2012/1218/10/55/rutgers-lib_200833: <mods:topic>"Fra"</mods:topic>
/repository/data/objects/2012/1218/10/55/rutgers-lib_200833: <mods:topic>"Fra</mods:topic>
/repository/data/objects/2012/1218/10/55/rutgers-lib_200833: <mods:topic>"Fra's"</mods:topic>

empty quote example:
triggs@rep-devel:/mellon/cgi-bin/solr> egrep -3 -r "topic>\"" /repository/data/objects/*
/repository/data/objects/2009/0428/13/01/rutgers-lib_18584: <mods:topic>"
/repository/data/objects/2009/0428/13/01/rutgers-lib_18584-Endangered and Nongame Speciews Program (ENSP)</mods:topic>
/repository/data/objects/2009/0428/13/01/rutgers-lib_18584- </mods:subject>


I wonder why 18589 isn't showing in this list, but has a real quote in the index.


I wondered about that too. Actually, it seems to have quotes in the mods itself:
<mods:subject ID="SUBJ7" authority="NJEDL">
<mods:topic>&quot;Cape May diamonds&quot;</mods:topic>
but these get converted by xslt on the way to Solr. I'm looking into controlling the xsl output escaping.


Assigned to:triggs» chadmills
Status:active» test

This is quite a tricky one. &quot; is one of two special entities (the other is &apos;) that are known to confound the disable-output-escaping="yes" instruction in XSLT. I had to escape it as &amp;quot; before the XSL transformation and unescape it afterwards just before posting to Solr. I reindexed rutgers-lib:18589, so the "Cape May Diamonds" example can be tested.


Assigned to:chadmills» triggs
Status:test» active

Sorry to say this didn't work as expected. I was wrong. I think you were right about just stripping the quotes them altogether. Could you back out of this change and strip them instead?



Assigned to:triggs» chadmills
Status:active» test

OK. I don't think my earlier work was wasted - escaping the &quot; entities before restoring them. This had the effect of targeting them, and with one small change I now zap them instead of restoring them. I reindexed the test object.



Status:test» fixed

Thanks. II found the record and how it was indexed. I can confirm the quotes were dropped. After a full reindex is run other values should have there quotes dropped as well.


Status:fixed» closed

Back to top