看不见的字符, 出现了!

比如说, 复制右边这段代码 >>> ​​hugo​​ ​​new​​ ​​about.md <<<, 粘贴到 Emacs 肉眼是观察不出来有啥区别的

(注意复制的时候包括 >>><<<, 粘贴的时候去掉 >>><<<, 从而保证复制到了空白字符)

  1. 使用 utf-8 编码保存到文件, 用 find-file-literally 重新打开文件使空白字符现行

    \342\200\213\342\200\213hugo\342\200\213\342\200\213 \342\200\213\342\200\213new\342\200\213\342\200\213 \342\200\213\342\200\213about.md
    

    以上 \XXX 字符表示八进制, 如 \342 表示 八进制 342, 十进制 226, 十六进制 e2

    其中 \342\200\213 不断重复, 其十六进制为 e2 80 8b

  2. 使用 hexl-mode 查看编码

    00000000: e280 8be2 808b 6875 676f e280 8be2 808b  ......hugo......
    00000010: 20e2 808b e280 8b6e 6577 e280 8be2 808b   ......new......
    00000020: 20e2 808b e280 8b61 626f 7574 2e6d 64     ......about.md
    

重新用 find-file 打开文件, 光标移动到不可见字符上,

使用 C-x = 也就是 what-cursor-position 查了一下这个字符是 U+200B,

使用 M-x describe-char RET 正式名称是 ZERO WIDTH SPACE

谷歌了一下,

Commonly abbreviated ZWSP.

This character is intended for invisible word separation and for line break control;

It has no width, but its presence between two characters does not prevent increased letter spacing in justification.

之所以发现这个问题, 是因为直接粘贴到命令行无法运行:

$ ​​hugo​​ ​​new​​ ​​about.md
'​​hugo​​' is not recognized as an internal or external command,
operable program or batch file.

解决方法, 使用 elisp 替换掉不可见空白字符

;; from http://xahlee.info/emacs/emacs/elisp_unicode_replace_invisible_chars.html
(defun xah-replace-invisible-char ()
  "Query replace some invisible Unicode chars.

The chars to be searched are:
 ZERO WIDTH NO-BREAK SPACE    (65279, #xfeff)
 ZERO WIDTH SPACE             (codepoint 8203, #x200b)
 RIGHT-TO-LEFT MARK           (8207, #x200f)
 RIGHT-TO-LEFT OVERRIDE       (8238, #x202e)
 LEFT-TO-RIGHT MARK           (8206, #x200e)
 OBJECT REPLACEMENT CHARACTER (65532, #xfffc)

Search begins at cursor position. (respects `narrow-to-region')

URL `http://xahlee.info/emacs/emacs/elisp_unicode_replace_invisible_chars.html'
Version 2018-09-07"
  (interactive)
  (query-replace-regexp "\ufeff\\|\u200b\\|\u200f\\|\u202e\\|\u200e\\|\ufffc" ""))

或者直接在 buffer 中高亮显示不可见空白字符 (Glyphless characters)

“Glyphless characters” are characters which are displayed in a special way, e.g., as a box containing a hexadecimal code, instead of being displayed literally.

These include characters which are explicitly defined to be glyphless, as well as characters for which there is no available font (on a graphical display), and characters which cannot be encoded by the terminal’s coding system (on a text terminal).

(defun w/see-you ()
  "Highlight ZERO WIDTH chars in all buffers."
  (interactive)
  (let ((charnames (list "BYTE ORDER MARK"
                         "ZERO WIDTH NO-BREAK SPACE"
                         "ZERO WIDTH SPACE"
                         "RIGHT-TO-LEFT MARK"
                         "RIGHT-TO-LEFT OVERRIDE"
                         "LEFT-TO-RIGHT MARK"
                         "OBJECT REPLACEMENT CHARACTER"

                         "ZERO WIDTH JOINER"
                         "ZERO WIDTH NON-JOINER")))
    (set-face-background 'glyphless-char "RoyalBlue1")
    (dolist (name charnames)
      ;; see info node "info:elisp#Glyphless Chars" for available values
      (set-char-table-range glyphless-char-display
                            (char-from-name name) "fuck"))))

一些效果图:

其他的 unicode 字符: M-x list-unicode-display RET

一些补充信息:

Insert a Unicode character like (0x2192) by name (RIGHTWARDS ARROW):

     M-x insert-char RET or C-x 8 RET, 输入 RIGHTWARDS ARROW 回车

Insert a Unicode character like by its hexadecimal value (0x2192):

     M-x insert-char RET or C-x 8 RET, 输入 2192 回车

改进:

  1. 当复制的时候自动去掉这些不可见字符
  2. 当保存文件的时候, 检查空白字符并提醒 (but NOT silently remove/replace them, maybe you are taking notes :slight_smile: )

References:

  1. Unicode Character 'ZERO WIDTH SPACE' (U+200B)
  2. terminal emacs - Zero width space shows as underscore - Emacs Stack Exchange Glyphless Chars (GNU Emacs Lisp Reference Manual)
  3. Emacs: Unicode Tutorial Emacs: Replace Invisible Unicode Chars 🚀
  4. encoding - remove <200b> character from text file - Super User
5 个赞

甚至可以通过这些 invisble char 在代码里面留后门

4 个赞

tui 天生免疫这类不可见符号(

1 个赞

:rofl: :rofl:

image

你想多了,Emacs 在 TTY 下也会遵守 glyphless-char-display 和进行双向文字处理

1 个赞

glyphless-display-mode 这个名字的 glyphless 是什么意思?很难拼写和记忆。

今天被零宽字符坑了一把,复制的路径带了一个看不见的 U+202A字符

glyph+less

多谢,原来是个构造词,glyph 是字形。这样就好理解了