对中文词按拼音排序

redguardtoo · 2020 年5 月 24 日 13:19

xuchunyang · 2020 年5 月 24 日 14:00

Unicode 有没有现成的方案？笔画数和拼音 Unicode 应该都有现成数据。这儿有篇解释 Unicode + ICU4C 的博客，要是 Emacs 的 API 不够的话，可以试试 ICU

cireu · 2020 年5 月 24 日 16:11

Unicode里中文字符好像是按康熙字典序排列的，这个对一般人来说应该没有什么卵用

redguardtoo · 2020 年5 月 25 日 01:06

同意。unicode和拼音没有什么关系，不过代码的性能可以用unicode再优化，在my-chinese-compare里我是用string来作为key搜索hashtable，实际上用unicode作为key性能更好。但是我懒得写了。

xuchunyang · 2020 年5 月 25 日 04:05

Unicode 的 Code Point 的顺序肯定没什么用，但是 Unicode 很复杂，我相信肯定有汉字的拼音数据、肯定有文字（包括 CJK）的排序规定，按拼音排序、笔画数排序等应该有现成的方案，考虑到排序功能很常用，比如操作系统排序应用名，文件管理器文件、表格软件按人名排序。

xuchunyang · 2020 年5 月 27 日 16:00

Unicode 的字符串排序算法：Unicode collation algorithm，ICU 是它的一个实现。Unicode 太复杂了，这个算法看了几个小时也没明白个所以然来，先写了个测试程序，分别按拼音和笔画排序：

$ ./sort zh@collation=pinyin 赵钱孙李
李钱孙赵
$ ./sort zh@collation=stroke 赵钱孙李
孙李赵钱

/* sort.c --- Unicode 排序 */
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unicode/ucol.h>
#include <unicode/ustdio.h>
#include <unicode/ustring.h>

int
main(int argc, char* argv[])
{
    if (argc < 3 || strcmp(argv[1], "--help") == 0) {
        fprintf(stderr, "usage: %s 排序方式 文字\n", argv[0]);
        exit(EXIT_FAILURE);
    }

    UErrorCode status = U_ZERO_ERROR;
    UCollator *coll = ucol_open(argv[1], &status);

    size_t n = strlen(argv[2]) + 1;
    UChar* s = malloc(n * sizeof(*s));
    int32_t sn;
    u_strFromUTF8(s, n, &sn, argv[2], -1, &status);

    for (int i = sn-1; i >= 1; i--)
        for (int j = 0; j < i; j++) {
            if (ucol_strcoll(coll, &s[j], 1, &s[j+1], 1) == UCOL_GREATER) {
                UChar tmp = s[j];
                s[j] = s[j+1];
                s[j+1] = tmp;
            }
        }
    ucol_close(coll);

    UFILE *out = u_get_stdout();
    u_fputs(s, out);
}