修改Helm以支持中文拼音搜索

zbelial · 2023 年5 月 20 日 02:40

EDIT:

helm-find-files用这种方式修改会有问题。

Helm怎么用拼音匹配候选项也是个时常被提起的话题了，我最近有部分功能从ivy切换到helm了，所以研究了一下helm的代码看怎么添加支持比较好。话不多说，上代码。

对helm的修改如下

(defvar helm-pattern-transformer-alist
              '(())
              "An alist of regex building functions for each source.

Each key is the source name.

Each value is a function that should take a string and return a
valid regex or a regex sequence.")

;; 暂时是重新实现了这个函数，也可以用advice
(defun helm-process-pattern-transformer (pattern source)
              "Execute pattern-transformer attribute function(s) on PATTERN in SOURCE."
              (let* ((name (assoc-default 'name source))
                     (transformaer (assoc-default name helm-pattern-transformer-alist)))
                (helm-aif (or transformaer
                              (assoc-default 'pattern-transformer source))
                    (helm-apply-functions-from-source source it pattern)
                  pattern)))

这么用：

;; 转换成拼音正则表达式
(defun helm-pyim-to-utf8 (str)
              (cond ((equal 0 (length str))
	             str)
	            (t
	             (pyim-cregexp-build str))))

;; 添加下面两行，buffer列表和recentf列表就可以支持拼音搜索了
(add-to-list 'helm-pattern-transformer-alist '("Buffers" . helm-pyim-to-utf8))
(add-to-list 'helm-pattern-transformer-alist '("Recentf" . helm-pyim-to-utf8))

简单灵活且改动小，是不是有啥副作用还得用用看。

当然如果有问题可以把helm-pattern-transformer-alist配置为空，就不会影响helm了。

twlz0ne · 2023 年5 月 20 日 09:57

不知道现在的 helm 有没有做结构上的改变。

我以前做的修改以及遇到的问题可以看看：

zbelial · 2023 年5 月 20 日 12:22

感谢！

我以前看到过这个帖子但没有细看，再去学习一下。感觉还是低估了helm的复杂程度。

zbelial · 2023 年5 月 22 日 10:57

咨询了一下Helm的维护者，通过pattern-transformer来支持拼音是正确的方式。不过不需要像我一开始的帖子里那样修改helm代码，可以用下面的方式：

            (cl-defmethod helm-setup-user-source ((source helm-source-buffers))
              (setf (slot-value source 'pattern-transformer) 'helm-pyim-to-utf8))

            (cl-defmethod helm-setup-user-source ((source helm-moccur-class))
              (setf (slot-value source 'pattern-transformer) 'helm-pyim-to-utf8))
            
            (with-eval-after-load 'helm-bookmark
              (helm-set-attr 'pattern-transformer #'helm-pyim-to-utf8 helm-source-bookmarks))

            (with-eval-after-load 'helm-for-files
              (helm-set-attr 'pattern-transformer #'helm-pyim-to-utf8 helm-source-recentf))

helm-find-files内部逻辑复杂点，没法直接这么改，维护者还要再看。讨论在这儿。

话说helm维护者人真不错，耐心又友好。

twlz0ne · 2023 年5 月 23 日 01:05

拼音搜索需要注意几个问题：

拼音正则表达式很容易爆栈：

(invalid-regexp "Regular expression too big")

支持模糊匹配以获得更好体验 (但会使得正则表达式长度倍增引发问题1)：
```
(cl-assert (string-match (xxx-build-pinyin-regexp "cs") "测个试"))
```

解决正则表达式爆栈问题的几种思路：

修改 Emacs 源码，加大缓冲尺寸^[1]

/* This is not an arbitrary limit: the arguments which represent offsets
into the pattern are two bytes long.  So if 2^15 bytes turns out to
be too small, many things would have to change.  */

# define MAX_BUF_SIZE (1 << 15)

限制正则表达式大小^[2]

a) @tumashu 大佬暴力法：每生成一个汉字的表达式就试一次，直至遇到 too big 错误。

b) 我采用的手动计算法：计算每增加一个（匹配/排除）分组需要消耗多少字节，增加第一个字母/汉字需要多少字节，增加后续汉字需要消耗多少字节。。。必要时跳过重复/同音字多的汉字。
把 helm 候选项中的汉字转为拼音，避免生成拼音表达式

原先的确这么想过，不过考虑到 helm 已经很慢了，就打消了这个念头。

^[1] regex-emacs.c\src - emacs.git - Emacs source repository

^[2] 如何防止正则表达式爆表？

zbelial · 2023 年5 月 23 日 01:25

感谢。

爆栈的问题我也遇到了。我是用pyim生成的正则表达式，后来发现用到的pyim函数里有个参数可以指定用什么常见水平的汉字（也就是可以控制生成的表达式里不出现不常用字），而且pyim函数里本身也有你说的暴力限制。改了之后基本没再遇到爆栈了。

helm的source还有个属性candidate-transformer，也就是你说的第三种方法：把汉字转为拼音。我尝试过，但是没成功。但是这个路子我觉得是可行的，效率跟列表项数量相关，如果这路子能成功我看看效率咋样。好处是能根绝爆栈问题，也不影响fuzzy匹配。

我也曾经觉得helm慢，后来发现是因为我把helm-input-idle-delay改的太大了，改小了之后觉得慢得不明显，对我来说完全可接受。

tumashu · 2023 年5 月 23 日 02:23

pyim 把汉字按照常用情况分成四个级别，如果生成的正则太长，就去掉一级不常用的汉字后，再试着生成一个正则，一直递归下去，虽然这个办法不完美，但大多数情况都可用

tumashu · 2023 年5 月 23 日 02:25

twlz0ne:

/* This is not an arbitrary limit: the arguments which represent offsets
into the pattern are two bytes long.  So if 2^15 bytes turns out to
be too small, many things would have to change.  */

# define MAX_BUF_SIZE (1 << 15)

这个更改不知道能不能提交到 emacs.git ?

tumashu · 2023 年5 月 23 日 02:26

这样处理会遇到多音字问题，也比较闹心

zbelial · 2023 年5 月 23 日 03:47

我用了级别2，看起来够用了。

~~如果是自己创建的文件，用的肯定是自己熟悉的音，所以我觉得问题不大。~~（这儿一开始想错了，跟熟悉不熟悉没关系，而是看转换逻辑）

zbelial · 2023 年5 月 23 日 04:27

这个好像真的比较闹心，我用pyim-pymap-cchar2py-get，得到的拼音确实有点超“预期”

的的拼音是di，不的拼音是dun。

tumashu · 2023 年5 月 23 日 11:34

用pyim-cstring-to-pinyin试试

zbelial · 2023 年5 月 24 日 07:40

赞！这个更合适，会把多音字的多个拼音都返回来。

twlz0ne · 2023 年5 月 25 日 01:42

我试了一下把候选项转拼音，发现并没有我先前担心的性能问题。

zbelial · 2023 年5 月 25 日 02:19

我也觉得性能应该还好，不会有大问题。就我个人来说，大部分文件还有目录都是英文的，转之前先判断一下是不是有中文，需要转换的候选项还是很少的。而且我这儿候选项数量一般来说最多也就几百条，不会有啥大问题。内容是中文的大文件里用helm-occur可能会有问题，但我拍脑袋认为也不会比把pattern转为正则差。

不过我用candidate-transformer一直没有成功，还有些helm逻辑没搞明白，目前还是用pattern-transformer。

tumashu · 2023 年5 月 25 日 03:46

这个 pyim-cstring-to-pinyin 函数内部有判断

dingyi342 · 2023 年5 月 25 日 03:50

github.com/oantolin/orderless

A note on Helm

opened 12:24AM - 22 Jan 22 UTC

closed 10:11PM - 23 Dec 22 UTC

bcardoso

Hi. In the README "Related packages" section, it might be useful to add that …one can use `orderless` as a completion style _for_ Helm if you set the `helm-completion-style` variable to `'emacs` and the `completion-styles` variable to `'(orderless)`. Although orderless does not seem to work on _every_ Helm buffer (and I don't know why), it does on most Helm buffers. Therefore it can be an useful alternate completion config for Helm users. Thanks for all your work on orderless :)

可以让 helm使用 orderless，然后 orderless支持拼音。

twlz0ne · 2023 年5 月 25 日 10:16

好主意。

不过我这样设置没起作用，不知少了哪些步骤：

$ emacsq.sh -P helm,orderless,pinyinlib --eval "\
  (progn
    ;; --- 8< ---
    ;; @credit https://emacs-china.org/t/vertico/17913/3
    (defun completion--regex-pinyin (str)
      (message \"==> str: %S\" str)
      (orderless-regexp (pinyinlib-build-regexp-string str)))
    (add-to-list 'orderless-matching-styles 'completion--regex-pinyin)
    ;; --- >8 ---
    (setq completion-styles '(orderless))
    (setq helm-completion-style 'emacs)
    (global-set-key (kbd \"M-x\") 'helm-M-x)
    (global-set-key (kbd \"C-x b\") 'helm-mini)
    (global-set-key (kbd \"C-x C-f\") 'helm-find-files)
    (helm-mode 1)
    (switch-to-buffer \"*Messages*\"))" -nw

dingyi342 · 2023 年5 月 25 日 13:49

看 helm-completion-styles 的 docstring,有限制。这样我是可以的，估计其他 helm- 命令要修改啥。

(setq helm-completion-style 'emacs)
(let ((completing-read-function #'helm--completing-read-default)
      (candidates '("如果" "测试" "变成" "谢谢")))
  (completing-read "test: " candidates))

twlz0ne · 2023 年5 月 25 日 22:39

看来是有限的支持。要改 helm-mini 之类使其支持 orderless 恐怕也不容易。