简单可用的中文取词？

xuchunyang · 2017 年5 月 21 日 12:03

我觉得 Google Chrome 的双击取词和 macOS 下的三指取词（词典）体验都还不错（显然我这里指的是中文内容），可见虽然据说中文分词很难，但不妨碍我们作简单的应用，比如取词。“取词”是“选中”操作的预判，因此会出现猜错完全在意料之中。总之有比没有强。我刚才尝试性地实现了一个取词命令

;; For testing:
;; 如何实现中文取词

(defvar mark-chinese-word--words '("如何" "实现" "中文" "取词"))

(defun mark-chinese-word--substrings (string nth)
  "Return all substring in STRING which contains NTH."
  (let (before
        (before-bound (1+ nth))
        after
        (after-bound (1+ (- (length string) nth)))
        result)
    (setq before 0)
    (while (< before before-bound)
      (setq after 1)
      (while (< after after-bound)
        (push (cons (substring string (- nth before) (+ nth after))
                    (cons before after))
              result)
        (incf after))
      (incf before))
    result))

;; (mark-chinese-word--substrings "中文取词" 2)
;; => (("中文取词" 2 . 2) ("中文取" 2 . 1) ("文取词" 1 . 2) ("文取" 1 . 1) ("取词" 0 . 2) ("取" 0 . 1))

(defun mark-chinese-word ()
  "Mark a Chinese word at point."
  (interactive)
  (let ((str (thing-at-point 'word))
        (nth (- (point) (car (bounds-of-thing-at-point 'word)))))
    (let ((word
           (loop for s in (mark-chinese-word--substrings str nth)
                 when (member (car s) mark-chinese-word--words)
                 return s)))
      (if word
          (progn (set-mark (- (point) (cadr word)))
                 (goto-char (+ (point) (cddr word))))
        (set-mark (point))
        (forward-char 1)))))

不清楚大家有什么想法？

frapples · 2017 年5 月 22 日 03:40

正好昨天我就在想这个事情，我记得找到你的github上有一个叫chinese-at-point的扩展吧？但是你那个用shell调用jieba分词，在我的本子上卡得很。

分词主要的难度在于分词算法怎么搞定吧。

后来我就想，能否通过pymacs这个扩展，来直接从elisp调用python的jieba分词的API，来提高性能？

xuchunyang · 2017 年5 月 22 日 04:53

是的。我一写完 chinese-word-at-point 这个包之后自己就几乎没用过了，因为的确比较慢，而且我发觉 Python 的包装起来比较困难。

应该给 jieba 分词做一个简单的封装就行了，让它一直保持运行中，用 STDIN 和 STDOUT 传输数据，不要每次分词都首先启动 Python、加载词典，估计这很耗时。然后 Emacs 通过 subprocess 调用它就行了。这样一来 Emacs 不懂 Python 也没问题。

frapples · 2017 年5 月 22 日 05:01

嗯，其实pymacs似乎也是这样和python交互的，开了个python进程，然后把代码送进去。

我还在github上找到了另外一个包：

这个挺有意思，采用的cjieba库，混合了c和elisp，用之前要编译。我还没尝试它的性能怎么样。

frapples · 2017 年5 月 22 日 05:08

刚才尝试了一下，very good！很快，不错。就是安装麻烦了点

xuchunyang · 2017 年5 月 22 日 05:10

这个不错，跟我设想的思路一样，也是做一个 Wrapper，然后用 Emacs 的 subprocess 调用。

tumashu · 2017 年5 月 22 日 11:08

pyim 自带现成的工具，根本不用你折腾

LdBeth · 2017 年5 月 22 日 11:20

Elisp 慢

xuchunyang · 2017 年5 月 22 日 11:26

如果用来分词的库提供 C 的 API 的话，除了 subprocess 之外，还可以尝试用 Emacs 25 的 Dynamic Module，应该也会影响性能。

frapples · 2017 年5 月 22 日 15:50

哈哈，我还没发现呢。不过pyim的算法比较简单。

frapples · 2017 年5 月 22 日 15:53

不过感觉C语言写的扩展，这个应该不是瓶颈吧。感觉而已，说不上来原因。。。

其实我更希望emacs能非常好的调用python，因为python拥有众多强大的库，可以弥补elisp作为一门小众语言库不足的缺点。

hitswint · 2017 年5 月 23 日 01:21

确实，pyim-cwords-at-point还可以，就是开启的时候得加载较大的词库。

tumashu · 2017 年5 月 23 日 03:54

pyim 默认的词库也不是太大，10万行的一个文本文件