对于搜索中文，rg > grep > ag/ack？

xuchunyang · 2017 年11 月 11 日 20:28

没有全面测试，仅考虑到两点发现：

rg 支持 GBK 等中文编码，grep/ag/ack 不支持

$ rg --encoding gb18030 宝玉.妙玉 Red_Mansions_Anasoft_A_CHS_GBK_txt.txt
2261:  宝玉和妙玉陪笑道

# 能搜索 Raw Bytes [1] 的也不算支持
$ LC_ALL=C grep -P '\xB1\xA6\xD3\xF1..\xC3\xEE\xD3\xF1' Red_Mansions_Anasoft_A_CHS_GBK_txt.txt | iconv -f GB18030
  宝玉和妙玉陪笑道

对于 UTF-8，rg 和 grep 的正则表达式中 . 支持中文字符，ag/ack 不支持

$ grep 宝玉.妙玉 Red_Mansions_Anasoft_A_CHS_UTF8_txt.txt
  宝玉和妙玉陪笑道
$ rg 宝玉.妙玉 Red_Mansions_Anasoft_A_CHS_UTF8_txt.txt
2261:  宝玉和妙玉陪笑道

$ ag 宝玉.妙玉 Red_Mansions_Anasoft_A_CHS_UTF8_txt.txt
$ ack 宝玉.妙玉 Red_Mansions_Anasoft_A_CHS_UTF8_txt.txt

# ag/ack 需要三个 . (在 UTF-8 中汉字一般占 3 Bytes)
$ ag 宝玉...妙玉 Red_Mansions_Anasoft_A_CHS_UTF8_txt.txt
2261:  宝玉和妙玉陪笑道
$ ack 宝玉...妙玉 Red_Mansions_Anasoft_A_CHS_UTF8_txt.txt
  宝玉和妙玉陪笑道

Links

rg (ripgrep) GitHub - BurntSushi/ripgrep: ripgrep recursively searches directories for a regex pattern while respecting your gitignore
GNU Grep Grep - GNU Project - Free Software Foundation
ag (the_silver_searcher) GitHub - ggreer/the_silver_searcher: A code-searching tool similar to ack, but faster.
ack https://beyondgrep.com/
红楼梦文本 http://www.speedy7.com/cn/stguru/gb2312/redmansions.htm (标识成 GBK 的，实际是 GB18030)

[1] 在 Emacs 中用 C-u C-x = (what-cursor-position) 查询字符的编号，或用：

(defun unicode->other-encoding (char coding)
  (mapconcat
   (lambda (x) (format "%02X" x))
   (encode-coding-char char coding)
   " "))
     => unicode->other-encoding

(unicode->other-encoding ?宝 'gb18030)
     => "B1 A6"

Emacs Lisp 的字符对应的数字就是其 Unicode Codepoint。

ashfinal · 2017 年11 月 12 日 01:30

rg 好，跨平台，还快。

GBK 好像是 GB18030 的子集，一般不需要严格区分。

Chris · 2017 年11 月 12 日 10:31

Win下rg不好用，至少是在Emacs的shell-mode里。

twlz0ne · 2018 年3 月 13 日 08:38

忽然发现 ag 竟然这么不靠谱。

有匹配的时候看起来还正常：

没有匹配的时候就乱来了：

xuchunyang · 2018 年3 月 13 日 10:53

$ ag 'lsp-[^\-]+-enable'

或许是因为 Ag 的正则表达式默认是跨行的，加上 --nomultiline 就好了

   --[no]multiline
          Match regexes across newlines. Enabled by default.

twlz0ne · 2018 年3 月 13 日 16:12

确实是 multiline 选项引起的。不过这跨度没限制也是有问题的，隔着十万八千行都能匹配到。

虽然 multiline 存在隐患，但这个选项还是开启比较好，因为关掉就真的有 bug 了：

amosbird · 2018 年3 月 16 日 11:28

什么时候rg支持拼音搜索就好拉

seagle0128 · 2018 年3 月 19 日 16:39

ag有unicode问题，搜索中文还是rg和pt靠谱，跨平台，速度也不错。rg最快，pt占用内存稍多。

seagle0128 · 2018 年3 月 19 日 16:40

我在Windows下用的挺好，直接下载Windows原生版本，VS编译那个，速度很不错。

iab · 2018 年3 月 20 日 01:35

rg 是用 rust 编程语言，原生支持 unicode

LdBeth · 2018 年3 月 20 日 05:26

这你就不知道了，Unicode 和 Go 是还同一帮人搞的。

seagle0128 · 2018 年5 月 13 日 14:22

rg可以配置支持拼音搜索，加上pinyinlib就可以了。

amosbird · 2018 年5 月 15 日 03:37

GitHub - BurntSushi/ripgrep: ripgrep recursively searches directories for a regex pattern while respecting your gitignore 没搜到相关的feature啊

seagle0128 · 2018 年5 月 15 日 07:01

emacs中用rg和pinyinlib可实现。可以参考我的配置：GitHub - seagle0128/.emacs.d: Centaur Emacs - A Fancy and Fast Emacs Configuration

whatacold · 2018 年7 月 12 日 08:05

rg --encoding 指定编码实在太爽了

loveminimal · 2019 年3 月 22 日 09:13

windows 7 ，在 emacs 中无法使用 rg 搜索中文求解

seagle0128 · 2019 年3 月 22 日 10:29

我现在没有用 Windows7，但是我确认 rg 是支持中文的。Windows 10上没有问题。