Searches not retrieving terms with diacritics & words with diacritics filing oddly

There are actually three problems here:

1. If you look in the SOAR collection tree under School of Arts and Sciences, then under Sociology, you'll see the first entry, which should read Böröcz, József

2. The name does appear correctly in the SOAR Author browse, although I'm not sure it's filing correctly (not sure if that vowel is supposed to file differently when it has a diacritic)

3. The name cannot be retrieved in a search of borocz or jozsef

Marking as critical as it is a public service issue.



These are three separate issues.

1) I corrected the way the name is displayed in the collection tree. To fix this I had to edit the collection label in dlr/EDIT. The creating application, WMS I figure, that made the collection to begin with might have an issue with handling these characters in collection name. I have created an issue in WMS regarding this.

2) I am guessing by filing the alphabetical order is being brought into question. Since the second letter in the authors name is a "ö" it is not an "o" when ordering it rather a higher order ASCII character that appears after "z" in translation tables used for ordering. It is the nature of the beast, I believe. Let me know if I am off track with this.

3) That is an issue with indexing and what needs to be done with values with diacritics. From your report you expect the name "Böröcz, József" to be indexed twice. Once as "Böröcz, József" and another time as "Borocz, Jozsef" Right now if you search separately for "József" or "Böröcz" you will get the results expected. So to make that happen when indexing if a diacritic is found it needs to be translated to a corresponding non-diacritic character. I'll add this as a feature request for 8.1 in the Solr indexing project.

Let me know if I read into any of these in the wrong way.



Just to expand on values with diacritics would it be an option to store an alternate author name in the record as well. So an author with the name "Böröcz, József" could also have an alternate author name defined by the cataloger? In this case, "Borocz, Jozsef".

I think this might be a worthwhile improvement to the cataloging if it doesn't violate any rules of course. Once the alternate name is added it could improve indexing by other services as well since the value is now part of the resource.


