Collection names with special characters; diacritics

Project:RUcore API's
Component:Search API
Category:bug report

On production a scholarly deposit collection for author "Böröcz, József" was created. The way the collection title was entered in the dlr collection tables made the display of the authors name unreadable. Only way to fix it was to enter the entry in the dlr collection tables. The above authors name could be used to find out where in the process the author names encoding gets messed up.

Since WMS created the collection I am starting the issue at this application, but the problem may be somewhere else.



Assigned to:yuyang» triggs

Jeffery, I am re-assigning this to you. After you have fixed it in dlr/EDIT, let me know and I can test WMS (though I think WMS does handle this). -YY


Assigned to:triggs» yuyang

Yang, I only just saw this. (I would have thought I'd have been sent an email about it.) I suspect the dlrcollections database is set to a character set other than UTF-8. It's interesting that the dlr display shows the odd characters since that page declares UTF-8, but if you click the title to edit, it displays properly in a page that doesn't declare a specific character set. I think we might fix this by changing the character set of the database (though Dave would have to do this and it would not be part of a release) or else I could take away the UTF-8 declaration, but I prefer not to do that sort of workaround. I'm reassigning this to you only so that you will be sure to see it.


Hmmm. On rep-dev at least the dlrcollections character set is UTF-8:
mysql> show variables like "character_set_database";
| Variable_name | Value |
| character_set_database | utf8 |
1 row in set (0.00 sec)
On rep-test, however, the character set is latin1:
mysql> show variables like "character_set_database";
| Variable_name | Value |
| character_set_database | latin1 |
1 row in set (0.00 sec)
I don't know what it is on mss3. It's a bit odd that these should be different.


Assigned to:yuyang» dhoover

Yang, I tried editing collections on rep-dev (UTF-8 db) and rep-test (latin1 db), adding the string "Böröcz, József" to the titles of my test collections there. On both these machines, "Böröcz, József" displayed as expected. I added it to my collection on production (temporarily) and reproduced the bug. I don't know what the character_set_database variable on production is set to, but I suspect it is something other than UTF-8 or latin1. Something Swedish perhaps, the mysql default? I think maybe we should reassign this to Dave so that he can have a look and change the dlrcollections character_set_database to UTF-8 or at least let us know what it is.


Assigned to:dhoover» chadmills
Status:active» test

According to Jeffery (#4), dlrEDIT seems to function as expected. Please test for 8.1. -YY


I am not sure how to test this. I created an author collection using NetID and WMS created collection without the accents. Looking up for the user to get NetID, his record comes up without the accents also.

After creating the collection without accents in WMS, I was able to copy the name with accents from the original report and paste it in WMS and the record saved fine. So I am not sure what the problem is.


Assigned to:chadmills» ananthan
Status:test» active

So I think the test should be to create a collection on test with the name/label: "Böröcz, József"

Once that collection is created in WMS we need to see that the collection name/label displays correctly in the "Change Collection" pop-up and can be searched using the "Change Collection" search function.

Once cleared then we can create an object under that collection and ingest it. Then we can look and see how the collection appears in the search tree and that the item is searchable by the collection name.

I created a sub-collection under RBDIL Analytics on rep-test. It is marked (M). It displays in the "Change COllection" list fine but when searching for the name the search widget says all collections in the list match the hit. I don't know how to ingest the collection so I am stopped there. I need some helping ingesting the collection. There is definitely a problem with searching for a collection name in the "Change Collection" window that contains diacritics.


Status:active» test

I tested with the same person : Borocz, Jozsef M.
When I looked up for his NetID in Sakai, his name is displayed as: Borocz, Jozsef M.

Also, WMS creates the collection name as: Borocz, Jozsef M.

We'll have to ask Dave to check the LDAP record to see how his name appears.

Searching for collection name with diacritics is not working as expected. I will file a new bug for 8-x.


Name appears without diacritics in LDAP, but the published name includes the diacritics and J.B.'s signs his emails with the diacritics. Per our SOAR procedures, we use the form of name that the author uses in his or her publications.


In this case, we have to change the collection name and perhaps author's name in WMS before ingesting. I am able to change the collection name with diacritics and WMS saves it without any problems.

Searching for this collection still does not work properly; there is a separate issue for this.


Assigned to:ananthan» chadmills


I ingested the collection. It was missing date but I was able to ingest it. Here it is:

<a href="" title=""></a>


Project:RUcore Workflow Management System (WMS)» RUcore API's
Version:8.1» 8.1
Component:Collection Management» Search API

After ingesting the resource for the collection the name didn't appear correctly in the search interface. It looks like the search API and the collection hierarchy class were running the collection label value through utf8_encode() which expects a ISO-8859-1 value.

The dlr.collections table label field is encoded as utf8_general_ci while all of the other fields are latin1_swedish_ci which makes no sense. I think this was a change that happened recently.

Removing the utf8_encode() for the label/name makes the collection label/name appear correctly on test now. Moving this to a different project; namely the search API. Since the rest of the table is latin1_swedish_ci I am still using tuf8_encode for other values from that table.

Test by looking at the advanced search on the RUcore test site; main/root portal. Navigate to the bottom of a fully extended tree to see the collection.


Entered this collection name in WMS: noöne élève educación entrée pâté maître ä ø

I was able to search and find the collection by keyword: tried to search by "noöne" and "élève"; entering entire collection name does not find it.

Ingested it and clicking on the link displays the collection name correctly.
<a href="" title=""></a>

Created and ingested a resource with diacritics.
<a href="" title=""></a>

Diacritics displayed correctly.


I had to add the new collection to the 'dlr' tree and reindex. A search using "noöne" yielded a result.öne

Submitting a search using "élève" also found the resource.élève


Status:test» fixed

I'll mark this as fixed for now


Status:fixed» closed

Back to top