Hello Jivers,
We're having an issue with alphabetical ordering of Chinese characters. Each Chinese character has a corresponding romanization (a way of writing the word in English) called pinyin. When listing things in Chinese (for example, the phone book in a mobile phone), names beginning with Chinese characters are ordered according to their pinyin in a regular A-Z format. So, for example, the character 我 is pronounced "wo", and would come after the character “你”, pronounced "ni."
However, when Clearspace is listing things in Chinese, for example, users, this alphabetical ordering isn't being obeyed. I created two users with the above listed names, added them as friends, and then ordered my friends list by name. The user named “我我我" (wo) is showing up before "你你你“ (ni) in the friends listing.
I'm wondering if, since Chinese is expressed using multibyte characters (two english characters per Chinese character), Clearspace might be ordering Chinese characters according to their multibyte equivalent, instead of according to their pinyin equivalent.
My team tells me that Java is capable of ordering Chinese according to its pinyin; I'm wondering what it would take for us to enable this functionality within Clearspace's listings, in terms of lists of content, people, etc that can be alphabetically ordered.
Thanks as always!
-Rob-
Hey Rob,
In Clearspace we use Java's String.compareTo() method. Unfortunately this method compares strings lexicographically. This means the character with the lowest unicode value will be first. As far as I can tell there is no way to have java convert chinese characters to pinyin before comparing them. The reason the list appears out of order is because, as you mentioned, Java is sorting them based on their multibyte unicode value.
I can file this as a bug for you if you'd like. Unfortunately to sort chinese correctly it would take a fairly involved code change so I'm not sure how long it would take for this fix to be implemented.
Hi Sean,
Thanks for getting back to us. Java does have objects and methods to sort Chinese phonetically. The java.text.RuleBasedCollator is equipped to handle such collation, such as:
java.util.Arrays .sort(test, (RuleBasedCollator) Collator .getInstance(Locale .CHINA))
How substantial of a modification would be required for us to replace the method Clearspace is currently using to sort with a method based on the above mentioned object?
Thanks,
-Robert-
Hey Robert,
Which page were you looking at when using sorting? The admin console? Could you give me the URL you're visiting so that I can be sure to look at the correct portion of our code?
Hi Sean,
Rob has gone home for the weekend so I don't know the exact pages.. but have concern for anywhere customers / users have access in the community.
- people lists (a-z etc)
- blogs lists
- groups lists
We need the users to be able to sort these lists to find people. 99% of the users of this community will have a chinese name, blogs will be in chinese and group names will also be in chinese.
I would say admin panel is lower priority as we can use that in english until a fix. Our main concern is that community will be unusable (and unlaunchable) if people, blogs and groups can not even be sorted.
I hope this helps.
Hey Chad,
Sorry for the back and forth here, I should have asked this earlier. Could you tell me which version of Clearspace you're currently using or plan to go live on?
No problem. We are working with 2.5.3 at the moment.
Hey Chad,
I've traced through our source for the use cases you metioned above: People lists, Blog lists, and Group lists. The ordering of these elements in a list is done in a different location for each feature.
For people lists, we use a Lucene SortComparator. Lucene is an indexing API we use to index all of our content. SortComparatorSource is an interface that only has one method called newComparator which will return an object that can be used to compare two elements. For a example of this interface being implemented, have a look at a class named SortComparator in the Lucene source. We use Lucene's default SortComparator, which utilizes String.compareTo(), which as I mentioned above, doesn't handle chinese characters correctly. In order to get the people list sorting correctly, you'll need to implement your own PeopleAction, which will extend the default PeopleAction. The main thing you'd want to change is the getSortOrder() method. Here you'll want to return your own Lucene Sort object that will use a SortComparator that uses RuleBasedCollector to compare, instead of String.compareTo().
For Blog lists, we also use Lucene, but in a different way. We use the DbSearchQueryManager to ultimately query Lucene. The object you'll ultimately want to modify is the SearchQueryResultRelevenceComparator. By default this class's compare() method uses String.compareTo() to compare objects. You'll want to modify this to use your RuleBasedCollector.
Finally, group lists don't actually use Lucene, they query the database directly. That means the sort order is ultimately determined by SQL. So inorder for groups to be sorted by pinyin, your database is going to need to be able to use the 'SORT BY' SQL keywords to sort by pinyin. I'm not sure if this is possible in any database, but that is a different discussion. If there's need to modify the query that returns these items, it can be found in SocialGroupDAOImpl on line 500.
Sean,
Thank you for your very detailed research and response.
But I find it surprising that you are suggesting that we do this work. We expect that a full localization is provided by Jive software which means that all functions of the english version are already working in the local language, in this case Chinese.
Hey Chad,
I've actually filed this as a bug on our end. I just thought you'd like to know where the customizations would take place, in case you didn't want to wait for them to be implemented in a bug-fix release.
Hey Chad,
Just wanted to add a bit more information on the bug I filed. The ID for this issue is CS-9813, and I've requested that it be fixed for CS 2.5.5, which is scheduled to be released Dec. 15th.
If you have any other questions let me know.
Thanks Sean.
Hey Chad,
One of our core engineers believes he has a fix for some of these issues however he'd rather not check them in until he can validate them. Unfortunately no one here speaks or reads Chinese, so we're unable to validate. Would it be possible for you to provide us with a list of names in Chinese that are sorted by unicode value ( as our instance does now ) as well as sorted by pinyin ( the way it should be ). This would hep us validate the changes we've made.
The changes that we're trying to test only effect the sorting of content based on Lucene. If the content is to be sorted by the database ( as mentioned above ) the only way to handle that currently is to set the DB to the proper locale and collation. The problem with this approach is that if it's a multi-locale instance, all users will be effected. We realize this is a less than ideal solution but the scope of the issue is so large that we cannot make these changes in a point release.
Hi Sean -
Thanks so much for keeping on this with us. I've attached an excel spreadsheet with a listing of six usernames and their corresponding romanization in Chinese, in their original ordering from Clearspace. Two more columns show the correct order (alphabetical by pronunciation), with the corresponding romanization.
I hope this helps your engineers! Thanks, as always!
-Rob-
Hey Rob,
Using the list you've attached our engineer was able to validate his fix. He's checked those fixes into 2.5.5 which will be released Dec. 15th. However this fixes won't encompass all cases where items can be sorted lexicographically. Specifically the cases that require the DB to do the sorting are cases where a fix was too large in scope for a point release. If this site is set to be a single locale site, setting the collation on the database might provide the correct sorting order. However this order would be incorrect for any characters that are not Chinese.
Jive combines collaboration software, community software & social networking software into the leading SBS solution.
© Copyright 2000–2009 Jive Software. All rights reserved.
915 SW Stark St., Suite 400, Portland, OR 97205