不用零宽空格在 org-mode 中标记中文的办法

VagrantJoker · 2022 年9 月 22 日 10:26

关于在 org-mode 中标记中文，Org-mode 中文行内格式化的问题中已经讨论很多了，主流做法是使用零宽空格，最近新出的包 org-extra-emphasis 使用的也是这种做法。不过我个人不是很喜欢这种做法，原因有二：

零宽空格是看不见的，有可能删不干净，虽然 org-extra-emphasis 可以显示零宽空格，但是没必要引入一个包。
插入比较麻烦，无法很好地与 smartparens 等包结合，需要进一步 hack 。

思前想后，我觉得可以使用 font-lock 来隐藏中文前后的空格，先上效果图

输入：

注意英文两边的空格不会隐藏

Peek 2022-09-22 18-04

删除：

标记删除的时候会显示空格

Peek 2022-09-22 17-53

html 导出效果：

pdf 导出效果：

我觉得这种做法好处有三：

输入模式统一，中文和英文下标记目标文本的方式是一样的。
可以方便地使用 smartparens 等包。
不需要依赖，代码量小。

代码

(font-lock-add-keywords 'org-mode
                        '(("\\cc\\( \\)[/+*_=~][^a-zA-Z0-9/+*_=~\n]+?[/+*_=~]\\( \\)?\\cc?"
                           (1 (prog1 () (compose-region (match-beginning 1) (match-end 1) ""))))
                          ("\\cc?\\( \\)?[/+*_=~][^a-zA-Z0-9/+*_=~\n]+?[/+*_=~]\\( \\)\\cc"
                           (2 (prog1 () (compose-region (match-beginning 2) (match-end 2) "")))))
                        'append)

导出需要过滤空格：

(with-eval-after-load 'ox
  (defun eli-strip-ws-maybe (text _backend _info)
    (let* ((text (replace-regexp-in-string
                  "\\(\\cc\\) *\n *\\(\\cc\\)"
                  "\\1\\2" text));; remove whitespace from line break
           ;; remove whitespace from `org-emphasis-alist'
           (text (replace-regexp-in-string "\\(\\cc\\) \\(.*?\\) \\(\\cc\\)"
                                           "\\1\\2\\3" text))
           ;; restore whitespace between English words and Chinese words
           (text (replace-regexp-in-string "\\(\\cc\\)\\(\\(?:<[^>]+>\\)?[a-z0-9A-Z-]+\\(?:<[^>]+>\\)?\\)\\(\\cc\\)"
                                           "\\1 \\2 \\3" text)))
      text))
  (add-to-list 'org-export-filter-paragraph-functions #'eli-strip-ws-maybe))

注意，导出部分中的正则存在一些不完善的地方，建议根据自己的情况自己配置相应的正则。

Hxppdv · 2022 年9 月 22 日 11:08

伟大的工作！

虽然我看到标题上写到了标记中文，但是零宽空格方案不仅能适用于中文标记，我想问的在于实际上这个方案能实现这种情况吗？比如我要标记某个单词的一部分，例如 *ha*ppy，使用零宽空格则能实现，你这个方案能很好实现吗？

使用自带的 whitespace 包其实就能实现显示零宽空格。

无论如何，这都是伟大的实现，感谢您。

VagrantJoker · 2022 年9 月 22 日 11:14

理论上能用正则匹配的都可以，目前只写了通用的正则，特殊需求可以自己写 (你不能对只有六行的代码有什么太高的期待

这个同时也会显示其他空格，平时基本不会开。

VagrantJoker · 2022 年9 月 22 日 15:35

里面提到的方法还是有用的（后面谁说没有用的

至于这个问题

修改下函数org-do-emphasis-faces就行了

(defun org-do-emphasis-faces (limit)
    "Run through the buffer and emphasize strings."
    (let ((quick-re (format "\\([%s]\\|^\\)\\([~=*/_+]\\).*?[~=*/_+]"
    		                (car org-emphasis-regexp-components))))
      (catch :exit
        (while (re-search-forward quick-re limit t)
          (let* ((marker (match-string 2))
                 (verbatim? (member marker '("~" "="))))
            (when (save-excursion
    	            (goto-char (match-beginning 0))
    	            (and
    	             ;; Do not match table hlines.
    	             (not (and (equal marker "+")
    		                   (org-match-line
    		                    "[ \t]*\\(|[-+]+|?\\|\\+[-+]+\\+\\)[ \t]*$")))
    	             ;; Do not match headline stars.  Do not consider
    	             ;; stars of a headline as closing marker for bold
    	             ;; markup either.
    	             (not (and (equal marker "*")
    		                   (save-excursion
    		                     (forward-char)
    		                     (skip-chars-backward "*")
    		                     (looking-at-p org-outline-regexp-bol))))
    	             ;; Match full emphasis markup regexp.
    	             (looking-at (if verbatim? org-verbatim-re org-emph-re))
    	             ;; Do not span over paragraph boundaries.
    	             (not (string-match-p org-element-paragraph-separate
    				                      (match-string 2)))
    	             ;; Do not span over cells in table rows.
    	             (not (and (save-match-data (org-match-line "[ \t]*|"))
    		                   (string-match-p "|" (match-string 4))))))
              (pcase-let ((`(,_ ,face ,_) (assoc marker org-emphasis-alist))
    		              (m (if org-hide-emphasis-markers 4 2)))
                (font-lock-prepend-text-property
                 (match-beginning m) (match-end m) 'face face)
                (when verbatim?
    	          (org-remove-flyspell-overlays-in
    	           (match-beginning 0) (match-end 0))
    	          (remove-text-properties (match-beginning 2) (match-end 2)
    				                      '(display t invisible t intangible t)))
                (add-text-properties (match-beginning 2) (match-end 2)
    			                     '(font-lock-multiline t org-emphasis t))
                (when (and org-hide-emphasis-markers
    		               (not (org-at-comment-p)))
    	          (add-text-properties (match-end 4) (match-beginning 5)
    			                       '(invisible t))
    	          (add-text-properties (match-beginning 3) (match-end 3)
    			                       '(invisible t)))
                (throw :exit t))))))))

VagrantJoker · 2022 年9 月 23 日 01:47

不需要零宽空格了

  (setq org-emphasis-regexp-components '("-[:space:]('\"{[:nonascii:][:alpha:]"
                                         "-[:space:].,:!?;'\")}\\[[:nonascii:][:alpha:]"
                                         "[:space:]"
                                         "."
                                         1))
  (org-set-emph-re 'org-emphasis-regexp-components org-emphasis-regexp-components)
  (org-element-update-syntax)

这样就可以了（原帖后面为什么会歪的这么离谱

rua · 2022 年9 月 23 日 02:42

会有这种使用场景吗，如果是 markup 两边的话应该会加上空格吧（然后 face-lock 隐藏空格

如果是连续 markup 的话放在一个 *上标和下标的匹配* 就好了吧

VagrantJoker · 2022 年9 月 23 日 02:49

这个是 org-mode 高亮错误，正常应该只高亮 “上标” 和 “匹配”

原帖讨论的是不用加空格就能高亮的方法

willbchang · 2022 年9 月 24 日 03:42

楼主太给力了！解决了用 org-mode 的一大痛点！个人认为这个应该合到上游去。

fountainer · 2022 年9 月 24 日 05:54

赞，我只用了导出过滤空格的代码，源码中空格我不是很在意。

fountainer · 2022 年9 月 25 日 12:27

这个过滤空格的函数，可能还有没有考虑的地方。再org文件中写公式\( 项目 \rightarrow 测试 \)，导出的latex是\(项目\rightarrow测试\)，\rightarrow和测试之间空格都被过滤了，rightarrow测试被视为一个宏，造成无法生成正确的pdf。如果不使用这段过滤代码，导出的latex代码为\(项目 \rightarrow 测试\)，可以生成正确的pdf。

VagrantJoker · 2022 年9 月 25 日 12:40

更新了，你可以试一下，也建议尝试一下 4 楼的方案。

fountainer · 2022 年9 月 25 日 12:51

但是，好像又不能过滤标记周围的空格了。导出的pdf标记的中文两边又多出了空格。4楼好像没有我需要的，我只需要导出过滤空格。

VagrantJoker · 2022 年9 月 25 日 13:06

(defun eli-strip-ws-maybe (text _backend _info)
    (let* ((text (replace-regexp-in-string
                  "\\(\\cc\\) *\n *\\(\\cc\\)"
                  "\\1\\2" text));; remove whitespace from line break
           ;; remove whitespace from `org-emphasis-alist'
           (text (replace-regexp-in-string "\\(\\cc\\) \\(.*?\\) \\(\\cc\\)"
                                           "\\1\\2\\3" text))
           ;; restore whitespace between English words and Chinese words
           (text (replace-regexp-in-string "\\(\\cc\\)\\(\\(?:<[^>]+>\\)?[a-z0-9A-Z-]+\\(?:<[^>]+>\\)?\\)\\(\\cc\\)"
                                           "\\1 \\2 \\3" text))
           (text (replace-regexp-in-string "\\(\\cc\\) ?\\(\\\\[^{}()]*?\\)\\(\\cc\\)"
                                           "\\1 \\2 \\3" text)))
      text))

这个应该可以了

fountainer · 2022 年9 月 25 日 13:09

我试了一下，中文标记两边的空格好像还是存在。

VagrantJoker · 2022 年9 月 25 日 13:12

我在 emacs -Q 下测试过没问题，请排除你的配置问题

给个样例文本看看

fountainer · 2022 年9 月 25 日 13:18

#+title: Test

#+LATEX_CLASS: beamer
#+OPTIONS: H:2 toc:t num:t
#+startup: beamer
#+latex_header: \usepackage{xeCJK}
#+latex_header: \xeCJKsetup{CJKmath=true}

* 第一节
** 测试

这是 *一个* 测试。这是一个公式 \( 项目 \rightarrow 测试 \)。

最怕emacs -Q测试，先看看是不是我配制问题吧。

VagrantJoker · 2022 年9 月 25 日 13:39

这是 emacs -Q 下导出结果，标记一直没有问题，只是行内公式根据个人写法不同导出结果不一致，已更新上面的函数。

我希望你每次回复的时候能注意一下预填充文本，反馈前先用 emacs -Q 测试是一个好习惯

fountainer · 2022 年9 月 25 日 13:42

好的，谢谢，那我慢慢找问题吧。顺便贴一下我的导出吧。

\begin{document}

\maketitle
\begin{frame}{Outline}
\tableofcontents
\end{frame}


\section{第一节}
\label{sec:orgb20d632}
\begin{frame}[label={sec:orga84fe28}]{测试}
这是 \alert{一个} 测试。这是一个公式 \(项目 \rightarrow 测试\)。
\end{frame}
\end{document}

fountainer · 2022 年9 月 25 日 13:52

这个可以啦，我复制你最新改的函数的时候，忘了把 (with-eval-after-load 'ox ...)这些加进去。感谢！

fountainer · 2022 年9 月 26 日 12:05

再报告一个问题

#+title: Test

#+LATEX_CLASS: beamer
#+OPTIONS: H:2 toc:t num:t
#+startup: beamer
#+latex_header: \usepackage{xeCJK}
#+latex_header: \xeCJKsetup{CJKmath=true}

* 第一节
** 测试

这是一个 *简单* 的 *小* 测试。这是一个公式 \( 项目 \rightarrow 测试 \)。

导出的latex是

\begin{document}

\maketitle
\begin{frame}{Outline}
\tableofcontents
\end{frame}


\section{第一节}
\label{sec:org4df4d97}
\begin{frame}[label={sec:orgff7d1eb}]{测试}
这是一个\alert{简单}的 \alert{小} 测试。这是一个公式\(项目 \rightarrow 测试\)。
\end{frame}
\end{document}

第二个标记“小”字两边的空格没有被过滤掉。