基于结巴分词, 简单实现了一个中文 `word object`

jiahut · 2024 年5 月 24 日 09:26

背景

之前使用的是 pyim-cstring-utils 提供的函数实现的中文分词, 基于拼音的分词算法有时候会不准确;
受到 vim 的 text object 的启发, 期望中文一个单词就是一个 word object;
移动日常是一个高频的操作, 性能是非常重要的 ( 特别是在window下) ;

搜索了一下在社区 [基于结巴分词的 Emacs 中文分词工具](emacs-chinese-word-segmentation) 基础上做了一点改进, 初步实现了基于结巴分词的 word object

主要特性，

中文 word object
按键 w/e 自动跳过空行

实现代码在 cns-evil.el

配置使用

(use-package emacs
  :after evil
  :config
  (add-to-list 'load-path "c:/emacs-chinese-word-segmentation")
  (setq cns-prog "c:/emacs-chinese-word-segmentation/cnws.exe")
  (setq cns-dict-directory "c:/emacs-chinese-word-segmentation/cppjieba/dict")

  (setq cns-recent-segmentation-limit 20) ; default is 10
  (setq cns-debug t) ; disable debug output, default is t
  (require 'cns nil t)
  (when (featurep 'cns)
    (add-hook 'find-file-hook 'cns-auto-enable))
  (require 'cns-evil))

LuciusChen · 2024 年5 月 24 日 11:48

如果是 Mac 的话建议用这个来解决中文分词 GitHub - roife/emt: Emacs macOS Tokenizer, tokenizing CJK words with macOS's built-in NLP tokenizer.

pinacle2000 · 2024 年5 月 25 日 01:11

windows下使用，编译出cnws。使用提示：

if: Chinese word segmentation process is not running, enable ‘cns-mode’ first

driftcrow · 2024 年6 月 4 日 01:49

正想找这个东东，但是window 下编译好麻烦，可以提供打包版不