Indexing values with diacritics

Project:RUcore SOLR Searching and Indexing
Version:8.1
Component:Code
Category:feature request
Priority:normal
Assigned:chadmills
Status:closed
Description

Values with diacritics are being indexed properly on their own. However, a corresponding value where the diacritics are translated to a corresponding non diacritic needs to be added. An example is the author name "Böröcz, József"

It should be indexed twice...

1) Böröcz, József
2) Borocz, Jozsef

Comments

#1

We will probably need to edit the schema.xml file and add one of these filters to both the index and query sections of the fieldType definition for "text_general":
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>

<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
or
<filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="false"/> (or "true" to have both versions indexed).

I am no authorized to edit schema.xml on ref-dev at the moment, so will need help with this, as well as with restarting Solr once edits are made.

#2

Please see this issue
<a href="https://software.libraries.rutgers.edu/node/2020" title="https://software.libraries.rutgers.edu/node/2020">https://software.libraries.rutgers.edu/node/2020</a> comment #12

You should have called for a password of the solr user and should be able
to update the solr schema and restart solr on rep-dev

If you don't know the solr username/password please give me a call
and i will reset it.

#3

Thanks Dave. I'll call you for the password.

#4

Assigned to:triggs» chadmills
Status:active» test

This is ready to test on rep-dev. A search for Borocz, Jozsef should pull up one object where Böröcz, József has been added to a title. I wound up using <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>. Thanks to Dave for his help with the Solr server.

#5

I'll test it once it is on test; provided there is something to test with.

#6

Right now it is only on dev. There is one object there with the name.

#7

Assigned to:chadmills» triggs
Status:test» active

Please confirm there is something to test with on the test system.

#8

Status:active» test

Geoff wood entered an item with diacritics (Peña) and confirmed he was able to find it both ways (Peña or Pena). You can use this to test. We can also add more terms during testing today.

#9

Assigned to:triggs» chadmills
Status:test» fixed

Thanks! Tested.

#10

I've added "Böröcz, József" to the title of a test object. Either "Böröcz" or "Borocz" should now find it on rep-test.

#11

Status:fixed» closed

Back to top