关于(中文友好)的 Org Mode 行内标记语法的设计想法征集

试了下,导出没有问题,Emacs 内的 Org buffer 显示可能有点。(上图为 Emacs 截图,下图为 HTML 在浏览器中截图)

改了下,我直接原始文件改的,就没整体 patch 用了。org-elements.el 的修改不变。

org-element.el

@@ -3293,7 +3293,8 @@
       (unless (bolp) (forward-char -1))
       (let ((opening-re
              (rx-to-string
-              `(seq (or line-start (any space ?- ?\( ?' ?\" ?\{))
+              `(seq (or line-start (any space ?- ?\( ?' ?\" ?\{)
+                        (category can-break))
                     ,mark
                     (not space)))))
         (when (looking-at-p opening-re)
@@ -3304,6 +3305,7 @@
                     (not space)
                     (group ,mark)
                     (or (any space ?- ?. ?, ?\; ?: ?! ?? ?' ?\" ?\) ?\} ?\\ ?\[)
+                        (category can-break)
                         line-end)))))
             (when (re-search-forward closing-re nil t)
               (let ((closing (match-end 1)))

org.el

@@ -3758,10 +3758,10 @@
 	 (body (if (<= nl 0) body
 		 (format "%s*?\\(?:\n%s*?\\)\\{0,%d\\}" body body nl)))
 	 (template
-	  (format (concat "\\([%s]\\|^\\)" ;before markers
-			  "\\(\\([%%s]\\)\\([^%s]\\|[^%s]%s[^%s]\\)\\3\\)"
-			  "\\([%s]\\|$\\)") ;after markers
-		  pre border border body border post)))
+	  (format (concat "\\(\\(?:[%s]\\|\\c|\\)\\|^\\)"  ; Group 1: Pre chars OR CJK OR BOL
+		          "\\(\\([%%s]\\)\\([^%s]\\|[^%s]%s[^%s]\\)\\3\\)"
+		          "\\(\\(?:[%s]\\|\\c|\\)\\|$\\)") ; Group 5/6: Post chars OR CJK OR EOL
+	          pre border border body border post)))
       (setq org-emph-re (format template "*/_+"))
       (setq org-verbatim-re (format template "=~")))))
 
@@ -5226,7 +5226,7 @@
 
 (defun org-do-emphasis-faces (limit)
   "Run through the buffer and emphasize strings."
-  (let ((quick-re (format "\\([%s]\\|^\\)\\([~=*/_+]\\)"
+  (let ((quick-re (format "\\(\\(?:[%s]\\|\\c|\\)\\|^\\)\\([~=*/_+]\\)"
 			  (car org-emphasis-regexp-components))))
     (catch :exit
       (while (re-search-forward quick-re limit t)

可见 +下划线+,有空格 +却没有+ 了 还是有点问题。导出无问题。(好像是 Org-mode 原来的默认行为,那没事了

附上 gemini 的修改方法:

Gemini

这个 patch 的初衷是为了让中文字符(通常属于 can-break 类别)能够像空格或标点符号一样作为 Org-mode 强调标记(如 *bold*, /italic/)的边界。

允许中文分隔符的角度来看,这个 patch 在 lisp/org.el 中的实现存在严重的回归(Regression)问题。虽然它可能成功让中文起作用了,但它破坏了现有的标准行为。

org-set-emph-re 函数的修改中:

-	  (format (concat "\\([%s]\\|^\\)" ;before markers
-			  "\\(\\([%%s]\\)\\([^%s]\\|[^%s]%s[^%s]\\)\\3\\)"
-			  "\\([%s]\\|$\\)") ;after markers
-		  pre border border body border post)))
+         (let ((pre "\\c|") (post "\\c|"))    ;; <--- 问题在这里
+	    (format (concat "\\(%s\\|^\\)" ;before markers
+			    "\\(\\([%%s]\\)\\([^%s]\\|[^%s]%s[^%s]\\)\\3\\)"
+			    "\\(%s\\|$\\)") ;after markers
+		    pre border border body border post))))`
  • 问题所在:使用了 let 强行将 prepost 变量重定义为 \\c|(Emacs 正则表达式中匹配 can-break 类别的字符,通常包含汉字)。
  • 后果
    • 原本的 prepost 是根据用户配置 org-emphasis-regexp-components 生成的,通常包含空格、括号、引号等 ASCII 标点。
    • 这个修改完全丢弃了这些标准分隔符。
    • 空格通常不属于 \\c| 类别。这意味着标准的英文写法 word *bold* word(其中 * 前面是空格)将无法被识别为粗体,因为空格不匹配 \\c|
    • 同理,(bold) 这样的括号包裹写法也会失效,除非左括号被定义为 can-break

总结:它为了支持中文,把英文(空格和标准标点)的支持给搞丢了。

2. lisp/org.ellisp/org-element.el 的逻辑不一致

  • lisp/org-element.el (Parser 层) 的修改是正确的:
  • lisp/org.el (Font Lock 层) 的修改是错误的: 它使用了 替换(Replacement) 逻辑,变成了“必须是 can-break 字符”。

这会导致解析器(Parser)认为某些文本是粗体(比如 word *bold*),但 Font Lock(高亮显示)却不显示粗体,导致 Org-mode 内部状态不一致。

3. 忽略了用户的自定义配置

Org-mode 允许用户通过变量 org-emphasis-regexp-components 自定义哪些字符可以作为边界。这个 patch 在 org.el 中通过硬编码 (let ((pre "\\c|")...),使得用户对该变量的自定义在边界检测中完全失效。

2 个赞

其实我觉得可以像Asciidoc一样引入一个强制行内标记的语法:

有更新的 patch 了,各位可以帮忙测试一下(

From 8ec306d2015d00a7158098fd76b979248d6b7e07 Mon Sep 17 00:00:00 2001
Message-ID: <8ec306d2015d00a7158098fd76b979248d6b7e07.1766228647.git.yantar92@posteo.net>
From: Ihor Radchenko <[email protected]>
Date: Sat, 20 Dec 2025 11:58:16 +0100
Subject: [PATCH] WIP: Org markup: Allow Unicode punctuation and breakable
symbols around emphasis

* lisp/org-element.el (org-element-category-table): Define custom
category table adding opening/closing punctuation, opening/closing
quotes, dashes, and auxiliary punctuation.
(org-element--parse-generic-emphasis): Extend allowed characters
around emphasis to generic opening/closing punctuation, quote
punctuation, dash-likes, and auxiliary ,-like punctuation.  Also,
allow breakable characters, like Chinese/Japanese symbols for
languages that do not use spaces.
* lisp/org.el (org-mode): Setup category table.
(org-emphasis-regexp-components): Allow pre/post to be nil to follow
the new defaults.  Change the default values of pre/past to nil.
(org-set-emph-re):
(org-do-emphasis-faces):
(org-emphasize): Fall back to parser defaults when pre/past in
`org-emphasis-regexp-components' is nil.
---
lisp/org-element.el | 50 +++++++++++++++++++++++++++++--
lisp/org.el         | 71 ++++++++++++++++++++++++++++++++++-----------
2 files changed, 102 insertions(+), 19 deletions(-)

diff --git a/lisp/org-element.el b/lisp/org-element.el
index 0b51b4524..54df11d91 100644
--- a/lisp/org-element.el
+++ b/lisp/org-element.el
@@ -3323,6 +3323,38 @@ ;;; Objects

;;;; Bold

+(defvar org-element-category-table
+  (let ((category-table (copy-category-table))
+        (uniprop-table (unicode-property-table-internal 'general-category)))
+    ;; Define categories
+    (define-category ?{ "Opening punctuation" category-table)
+    (define-category ?} "Closing punctuation" category-table)
+    (define-category ?\[ "Initial quote" category-table)
+    (define-category ?\] "Final quote" category-table)
+    (define-category ?- "Dash" category-table)
+    (define-category ?, "Other punctuation" category-table)
+    ;; Map characters to categories according to their general-category
+    (map-char-table
+     (lambda (key val)
+       (pcase val
+         ('Ps (modify-category-entry key ?{ category-table))
+         ('Pe (modify-category-entry key ?} category-table))
+         ('Pi (modify-category-entry key ?\[ category-table))
+         ('Pf (modify-category-entry key ?\] category-table))
+         ('Pd (modify-category-entry key ?- category-table))
+         ('Po (modify-category-entry key ?, category-table))))
+     uniprop-table)
+    category-table)
+  "Category table for Org buffers.
+The table defines additional Unicode categories:
+- ?{ for opening punctuation
+- ?} for closing punctuation
+- ?[ for opening quote
+- ?] for closing quote
+- ?- for dash-like
+- ?, for other punctuation.
+These categories are necessary for parsing emphasis.")
+
(defun org-element--parse-generic-emphasis (mark type)
  "Parse emphasis object at point, if any.

@@ -3336,7 +3368,14 @@ (defun org-element--parse-generic-emphasis (mark type)
      (unless (bolp) (forward-char -1))
      (let ((opening-re
             (rx-to-string
-              `(seq (or line-start (any space ?- ?\( ?' ?\" ?\{))
+              `(seq (or line-start space
+                        ;; opening punctuation
+                        (category ?{) (category ?\[)
+                        ;; dashes, other punctuation
+                        (category ?-) (category ?,)
+                        ;; Chinese, Japanese, and other breakable
+                        ;; characters
+                        (category ?|))
                    ,mark
                    (not space)))))
        (when (looking-at-p opening-re)
@@ -3346,7 +3385,14 @@ (defun org-element--parse-generic-emphasis (mark type)
                  `(seq
                    (not space)
                    (group ,mark)
-                    (or (any space ?- ?. ?, ?\; ?: ?! ?? ?' ?\" ?\) ?\} ?\\ ?\[)
+                    (or space
+                        ;; closing punctuation
+                        (category ?}) (category ?\])
+                        ;; dashes, other punctuation
+                        (category ?-) (category ?,)
+                        ;; Chinese, Japanese, and other breakable
+                        ;; characters
+                        (category ?|)
                        line-end)))))
            (when (re-search-forward closing-re nil t)
              (let ((closing (match-end 1)))
diff --git a/lisp/org.el b/lisp/org.el
index 910c075cd..720c8abf6 100644
--- a/lisp/org.el
+++ b/lisp/org.el
@@ -3858,10 +3858,22 @@ (defun org-set-emph-re (var val)
	 (body (if (<= nl 0) body
		 (format "%s*?\\(?:\n%s*?\\)\\{0,%d\\}" body body nl)))
	 (template
-	  (format (concat "\\([%s]\\|^\\)" ;before markers
+          ;; See `org-element--parse-generic-emphasis'
+	  (format (concat "\\(%s\\)" ;before markers
			  "\\(\\([%%s]\\)\\([^%s]\\|[^%s]%s[^%s]\\)\\3\\)"
			  "\\([%s]\\|$\\)") ;after markers
-		  pre border border body border post)))
+		  (if pre (format "[%s]\\|^" pre)
+                    (rx (or line-start space
+                            (category ?{) (category ?\[)
+                            (category ?-) (category ?,)
+                            (category ?|))))
+                  border border body border
+                  (if post (format "[%s]\\|$" post)
+                    (rx (or space
+                            (category ?}) (category ?\])
+                            (category ?-) (category ?,)
+                            (category ?|)
+                            line-end))))))
      (setq org-emph-re (format template "*/_+"))
      (setq org-verbatim-re (format template "=~")))))

@@ -3869,7 +3881,7 @@ (defun org-set-emph-re (var val)
;; set this option proved cumbersome.  See this message/thread:
;; https://orgmode.org/list/[email protected]
(defvar org-emphasis-regexp-components
-  '("-[:space:]('\"{" "-[:space:].,:!?;'\")}\\[" "[:space:]" "." 1)
+  '(nil nil "[:space:]" "." 1)
  "Components used to build the regular expression for FONTIFYING emphasis.
WARNING: This variable only affects visual fontification, but does not
change Org markup.  For example, it does not affect how emphasis markup
@@ -3882,7 +3894,9 @@ (defvar org-emphasis-regexp-components
specify what is allowed/forbidden in each part:

pre          Chars allowed as prematch.  Beginning of line will be allowed too.
+             nil means use parser defaults.
post         Chars allowed as postmatch.  End of line will be allowed too.
+             nil means use parser defaults.
border       The chars *forbidden* as border characters.
body-regexp  A regexp like \".\" to match a body character.  Don't use
             non-shy groups here, and don't allow newline here.
@@ -5127,6 +5141,9 @@ (define-derived-mode org-mode outline-mode "Org"
    (org-install-agenda-files-menu))
  (setq-local outline-regexp org-outline-regexp)
  (setq-local outline-level 'org-outline-level)
+  (require 'org-element)
+  (defvar org-element-category-table) ; org-element.el
+  (set-category-table org-element-category-table)
  ;; Initialize cache.
  (org-element-cache-reset)
  (when (and org-element-cache-persistent
@@ -5402,8 +5419,14 @@ (defsubst org-rear-nonsticky-at (pos)

(defun org-do-emphasis-faces (limit)
  "Run through the buffer and emphasize strings."
-  (let ((quick-re (format "\\([%s]\\|^\\)\\([~=*/_+]\\)"
-			  (car org-emphasis-regexp-components))))
+  (let ((quick-re (format "\\(%s\\)\\([~=*/_+]\\)"
+			  (if (car org-emphasis-regexp-components)
+                              (format "[%s]\\|^" (car org-emphasis-regexp-components))
+                            ;; See `org-element--parse-generic-emphasis'
+                            (rx (or line-start space
+                                    (category ?{) (category ?\[)
+                                    (category ?-) (category ?,)
+                                    (category ?|)))))))
    (catch :exit
      (while (re-search-forward quick-re limit t)
	(let* ((marker (match-string 2))
@@ -5413,24 +5436,24 @@ (defun org-do-emphasis-faces (limit)
		  (and
		   ;; Do not match table hlines.
		   (not (and (equal marker "+")
-			     (org-match-line
-			      "[ \t]*\\(|[-+]+|?\\|\\+[-+]+\\+\\)[ \t]*$")))
+			   (org-match-line
+			    "[ \t]*\\(|[-+]+|?\\|\\+[-+]+\\+\\)[ \t]*$")))
		   ;; Do not match headline stars.  Do not consider
		   ;; stars of a headline as closing marker for bold
		   ;; markup either.
		   (not (and (equal marker "*")
-			     (save-excursion
-			       (forward-char)
-			       (skip-chars-backward "*")
-			       (looking-at-p org-outline-regexp-bol))))
+			   (save-excursion
+			     (forward-char)
+			     (skip-chars-backward "*")
+			     (looking-at-p org-outline-regexp-bol))))
		   ;; Match full emphasis markup regexp.
		   (looking-at (if verbatim? org-verbatim-re org-emph-re))
		   ;; Do not span over paragraph boundaries.
		   (not (string-match-p org-element-paragraph-separate
-					(match-string 2)))
+				      (match-string 2)))
		   ;; Do not span over cells in table rows.
		   (not (and (save-match-data (org-match-line "[ \t]*|"))
-			     (string-match-p "|" (match-string 4))))))
+			   (string-match-p "|" (match-string 4))))))
	    (pcase-let ((`(,_ ,face ,_) (assoc marker org-emphasis-alist))
			(m (if org-hide-emphasis-markers 4 2)))
	      (font-lock-prepend-text-property
@@ -5495,12 +5518,26 @@ (defun org-emphasize (&optional char)
    (setq string (concat s string s))
    (when beg (delete-region beg end))
    (unless (or (bolp)
-		(string-match (concat "[" (nth 0 erc) "\n]")
-			      (char-to-string (char-before (point)))))
+		(string-match
+                 (if (nth 0 erc) (concat "[" (nth 0 erc) "\n]")
+                   ;; See `org-element--parse-generic-emphasis'
+                   (rx (or space
+                           (category ?{) (category ?\[)
+                           (category ?-) (category ?,)
+                           (category ?|)
+                           "\n")))
+		 (char-to-string (char-before (point)))))
      (insert " "))
    (unless (or (eobp)
-		(string-match (concat "[" (nth 1 erc) "\n]")
-			      (char-to-string (char-after (point)))))
+		(string-match
+                 ;; See `org-element--parse-generic-emphasis'
+                 (if (nth 1 erc) (concat "[" (nth 1 erc) "\n]")
+                   (rx (or space
+                           (category ?}) (category ?\])
+                           (category ?-) (category ?,)
+                           (category ?|)
+                           "\n")))
+		 (char-to-string (char-after (point)))))
      (insert " ") (backward-char 1))
    (insert string)
    (and move (backward-char 1))))
--
2.50.1
1 个赞

注意 HTML 导出结果与 Org buffer 渲染结果的区别。

当然,我感觉只要 Org 的高亮还在用正则这个问题避免不了…

1 个赞

未使用补丁:

使用补丁:

从导出结果来看,两者的主要差别在于:

对中文内容,XaXbXcX 模式中,无补丁版本会将 aXbXc 看作一个整体,有补丁版本会看作 XaX, b, XcX 三个部分,但对英文内容补丁前后没有变化。

另外我注意到一个神奇的现象:

*+ab+ 表现为粗体,即使 * 没有成对。

2 个赞

最新进度:

Ihor 解决了上述问题,可以试试这个 patch: Re: [BUG] Issues with Chinese (and potentially Japanese) text inline mar

但是,目前还是存在行内标记不直观的问题,下面的内容在无补丁的 Org-mode 中的效果如下所示:

That said, I tried
冰淇凌*。 (Hello *world* foo.
And with the new patch
"*。 (Hello *world*" is bold.

使用补丁后:

可以注意到在应用 patch 后,原始第二行的内容的开头的 * 被识别为标粗的起始位置,这是因为现在允许 * 前面出现 CJK 等字符。要想解决这个问题,一种方法是在第一个 * 后面添加 ZWS。

第四行也是类似的处理方法。

目前 Ihor 在思考有没有什么更加直观的处理方法来避免依赖 ZWS。

3 个赞

用群友的 tp.el 实现 [[fmt:bold][text]] 这样不错,可以隐藏成 link,导出时再根据不同的 backend 解析成不同的格式。

1 个赞

link 是比较好的解法。

两个月前和 Ihor 讨论了下没什么后续了,我到现在为止还没想到有什么比较好的办法解决标记不直观的问题。先前的 patch 解决了中文无间隔高亮的问题然后引入了高亮不直观的问题。

不管是否应用这一补丁,碰到的高亮问题可以通过 ZWS 来解决,但是不添加这一补丁时的使用 ZWS 的方式可能反倒更加符合直觉。

@yantar92 ,现在有什么想法吗?我的话可能没有 :rofl: ,不过我之后会考虑实现我们提到的没有实现的新的 @{XXX} 语法。

Link is not a good solution because link description has limits on what can be inside in terms of markup. For example, line breaks are not allowed inside link descriptions. There are also issues with escaping ‘[’ and ‘]’. (We have discussed this on the mailing list in the past).

As for unintuitive highlighting, it is indeed a problem, but not a new problem. We have similar issues in English. So, I do not see it as a show stopper (although it would be nice to come up with something better; but alas).

For @{…} markup, it would be nice even if my proposed patch is in place - one of my hopes with this syntax was addressing edge cases with markup. The idea is that @{…} can be used in rare occasions when we have ambiguity in the normal markup.


链接不是一个好的解决方案,因为链接描述在标记方面有内部限制。例如,链接描述内部不允许换行。此外,转义 [] 也存在问题(我们过去在邮件列表中讨论过这一点)。

至于不直观的高亮,这确实是个问题,但并非新问题。在英文中我们也有类似的问题。因此,我不认为这是阻碍(尽管能想出更好的方案会很好;但可惜)。

对于 =@{…}= 标记,即使我提议的补丁已经就位,它也会很有用——我对这种语法的期望之一就是处理标记的边界情况。其想法是,@{...} 可以在普通标记存在歧义的罕见情况下使用。

3 个赞

有道理,作为临时方案我认为是足够的,但远远不够通用。

希望我们(以及可能的其他人)能有时间来在未来的 Org-mode 版本中实现它 :slight_smile:

We have an incomplete patch implementing this feature. One can start from there and finish the work. See the branch and my comments on that patch in Re: Experimental public branch for inline special blocks - Ihor Radchenko


我们有一个实现此功能的不完整补丁。可以从那里开始并完成这项工作。请查看该分支以及我在该补丁上的评论,地址是 Re: Experimental public branch for inline special blocks - Ihor Radchenko

2 个赞