【技巧+插件推广】中文字数统计 `advance-words-count.el`

LdBeth · 2017 年3 月 28 日 09:31

(defvar word-count-rule-chinese "\\cc"
  "A regexp string to match Chinese characters.")

(defvar word-count-rule-nonespace "[^[:space:]]"
  "A regexp string to match none pace characters.")

(defvar word-count-rule-ansci "[A-Za-z0-9][A-Za-z0-9[:punct:]]*"
  "A regexp string to match none pace characters.")

(defun special-words-count (start end regexp)
  "Count the word from START to END with REGEXP."
  (let ((count 0))
    (save-excursion
      (goto-char start)
      (while (and (< (point) end) (re-search-forward regexp end t))
        (setq count (1+ count))))
    count))

;;;###autoload
(defun Chinese-word-count (&optional beg end)
  "Chinese user preferred word count.
If BEG = END, count the whole buffer."
  (interactive (list (mark) (point)))
  (let ((min (if (= beg end) (point-min) beg))
        (max (if (= beg end) (point-max) end))
        list)
    (setq list
          (mapcar (lambda (r)
                    (special-words-count min max r))
                  (list
                   word-count-rule-chinese
                   word-count-rule-nonespace
                   word-count-rule-ansci
                   )))
    (message "字数:%d,字符数（不计空格）:%d,字符数（记空格）:%d,非中文单词:%d,中文:%d"
             (+ (car list) (car (last list)))
             (cadr list)
             (- max min)
             (car (last list))
             (car list))))

根据以前的一个贴子Hack了一下。

原先那个是开一个minor-mode在mode-line上显示字数。因为依赖暴力匹配所以有些性能问题我直接写成了个交互函数，能够在未选中时统计全文，选中区域时统计选中区域。

正在修改，准备写成个包。

LdBeth · 2017 年3 月 28 日 16:34

看了看杂七杂八的一些东西，估计word的字数统计也是差不多的原理

jixiuf · 2017 年3 月 29 日 02:29

M-x:count-words 不满足吗？

tumashu · 2017 年3 月 29 日 02:37

做成 package 吧

LdBeth · 2017 年3 月 29 日 03:51

然而仔细看看，不是中文字数。

不过你倒给我提了个醒。行数也挺实用的。

LdBeth · 2017 年3 月 29 日 03:53

准备再加点fancy的功能。

比如我还没试过日文和韩文。而且估计效率堪忧。怀疑几万字就吃不消了。

想试试利用^L作为标记，分区提高效率。

jixiuf · 2017 年3 月 29 日 04:47

message "C:%d,c:%d,nc:%d,an:%d,cc:%d"

写成中文吧别人不理解啊

LdBeth · 2017 年3 月 29 日 06:23

(defvar words-count-rule-chinese "\\cc"
  "A regexp string to match Chinese characters.")

(defvar words-count-rule-nonespace "[^[:space:]]"
  "A regexp string to match none pace characters.")

(defvar words-count-rule-ansci "[A-Za-z0-9][A-Za-z0-9[:punct:]]*"
  "A regexp string to match none pace characters.")

(defvar words-count-regexp-list
  (list words-count-rule-chinese
        words-count-rule-nonespace
        words-count-rule-ansci)
  "A list for the regexp used in `advance-words-count'.")

(defvar words-count-message-func 'message--words-count
  "The function used to format message in `advance-words-count'.")

(defun special--words-count (start end regexp)
  "Count the word from START to END with REGEXP."
  (let ((count 0))
    (save-excursion
      (save-restriction
        (goto-char start)
        (while (and (< (point) end) (re-search-forward regexp end t))
          (setq count (1+ count)))))
    count))

(defun message--words-count (list start end &optional arg)
  "Display the word count message.
Using the LIST passed form `advance-words-count'. START & END are
required to call extra functions, see `count-lines' &
`count-words'. When ARG is specified, display a verbose buffer."
  (message
   (format
    (if arg
        "
-----------~*~ Words Count ~*~----------
 Word Count .................... %d
 Characters (without Space) .... %d
 Characters (all) .............. %d
 Number of Lines ............... %d
 ANSCII Chars .................. %d
%s
========================================
"
      "Wc:%d,Ns:%d,Al:%d,Ln:%d,An:%d,%s")
    (+ (car list) (car (last list)))
    (cadr list)
    (- end start)
    (count-lines start end)
    (car (last list))
    (concat
     (unless (= 0 (car list))
       (format (if arg
                   " Chinese Chars ................. %d\n"
                 "Zh:%d,")
               (car list)))
     (format (if arg
                 " English Words ................. %d\n"
               "En:%d")
             (count-words start end))))))

;;;###autoload
(defun advance-words-count (beg end &optional arg)
  "Chinese user preferred word count.
If BEG = END, count the whole buffer. If called initeractively,
use minibuffer to display the messages. The optional ARG will be
passed to `message--words-count'.

See also `special-words-count'."
  (interactive (if (use-region-p)
                   (list (region-beginning)
                         (region-end)
                         (or current-prefix-arg nil))
                 (list nil nil (or current-prefix-arg nil))))
  (let ((min (or beg (point-min)))
        (max (or end (point-max)))
        list)
    (setq list
          (mapcar
           (lambda (r) (special--words-count min max r))
           words-count-regexp-list))
    (if (called-interactively-p 'any)
        (message--words-count list min max arg)
      list)))

超级进阶版。在命令前加 universal-argument 可以显示一个不错的菜单。不加就是一个单行的。稍微改几个拓展接口就可以写成一个包了。

LdBeth · 2017 年3 月 29 日 07:22

先这样了。

感觉要补一补正则。

LdBeth · 2017 年3 月 30 日 11:50

加了个功能，现在可以(setq words-count-message-display 'pos-tip)来弹出窗口显示。

不过没试过能不能在Mac Port以外的版本正常使用，因为我用的Mac Port在pos-tip上似乎有额外的配置。

以及为了避免歧义，在显示的时候把 Chinese Chars 换成了 CJK Chars

xuchunyang · 2017 年3 月 30 日 14:12

一致没搞明白 \cc 所说的中文字符是什么意思？或者说匹配的字符范围是什么。它也匹配全角的标点符号，似乎也匹配日文。要是不清楚的话，还不如自己手写这个正则表达式。

LdBeth · 2017 年3 月 30 日 14:50

*** “\cC” 匹配任何属于种类 C 的字符。例如，“\cc”匹配汉字，“\cg”匹配希腊字符等。如果想了解已知种类，用“M-x describe-categories ”。

http://m.blog.csdn.net/article/details?id=8067424

\cc 是匹配汉字，确实会匹配日文里的汉字和全角符号。所以如果要精确的话得写个更复杂的正则。

鉴于个人要求可能会不同，所以用 defcustume 留了修改和拓展的接口。可以自定显示样式，根据需求添加统计规则。我会尽量提高拓展性的。

twlz0ne · 2017 年3 月 30 日 17:09

没办法区分吧，同一个汉字，在各国写法可能不同，但是编码只有一个。

LdBeth · 2017 年3 月 31 日 00:28

然后我发现 \cc 可以匹配所有日文字符。\cj (匹配日文) 能匹配大部分中文的汉字。感觉要完。

twlz0ne · 2017 年3 月 31 日 01:39

把 Chinese 改成 CJK 就不违和了。

lukertty · 2017 年3 月 31 日 04:21

赶紧提交到melpa上啊。我蛮喜欢这个功能的，写论文的时候写一段统计一下，我用的函数是

(defun wc-non-ascii (&optional start end)
  (interactive)
  (let ((start (if mark-active (region-beginning) (point-min)))
        (end (if mark-active (region-end) (point-max))))
    (save-excursion
      (save-restriction
        (narrow-to-region start end)
        (goto-char start)
        (message "lines: %3d non ascii words: %3d chars: %3d"
           (count-lines start end)
           (count-matches "[^[:ascii:]]")
           (- end start))))))

twlz0ne · 2017 年3 月 31 日 05:51

如果是多语言(这里指东亚各国)混合的文件，把汉字和非汉字分开统计，好像不太合适；按语言把中文和非中文分开统计则更是不可能的任务，毕竟中国汉字、日本汉字、韩国汉字、越南汉字…大量编码是共用的。

我建议不要区分那么细，只要区分 cjk 和非 cjk 就行了。

https://zh.wikipedia.org/wiki/中日韩统一表意文字

LdBeth · 2017 年3 月 31 日 11:38

我打算在v 1.0.0 的时候推上melpa。现在基本结构还没稳定。

你的例子倒是给了我灵感，现在我写了一个macro方便进行这样的配置。

(defmacro words-count-define-func (name message rules &optional bind regexp)
  "Define the function used to format the strings displayed.
NAME    = Function's name.
MESSAGE = A string used to display.
RULES   = A list of functions to form the string.
BIND    = A boolean, if ture, bind the function to
          `words-count-message-func'.
REGEXP  = A list of regexp to call, if not specified, use
          `words-count-regexp-list'."
  `(progn
     (defun ,name (cons &optional arg)
       "Format a string to be shown for `message--words-count'.
Using the CONS passed form `advance-words-count'. See
`count-lines' & `count-words'. When ARG is specified, display
verbosely."
       (let ,(append
              '((start (car cons)))
              '((end (cdr cons)))
              (if regexp
                  `(words-count-regexp-list ,regexp))
              '(list))
         (setq list (advance-words-count start end))
         ,(append `(format ,message) rules)))
     (if ,bind
         (setq words-count-message-func (function ,name)))))

例子：

(words-count-define-func users-prefix-func
    "lines: %3d non ascii words: %3d chars: %3d"
    ((count-lines start end) 
     (car list))
     (- end start))
    t
    (list "[^[:ascii:]]"))

这样就可以用advance-words-count做到和你的例子一样的效果了。Lisp的宏果然强大。

jixiuf · 2017 年3 月 31 日 13:16

emacs 原生的叫count-words 而不是words-count ,建议命名的时候遵循之。

jixiuf · 2017 年3 月 31 日 13:17

要是能增加可选功能在mode-line上显示就更好了。。