【讨论】一种提升英文阅读体验的可能性

zqso · 2022 年12 月 4 日 06:37

哈哈我不。粗现得多，经常看到就记住了，不再出现那就没必要记，不可惜。

@Jousimies 不过你有好的anki对接经验就快分享给大家，也许有人需要。

zqso · 2022 年12 月 4 日 07:11

@ginqi7 你那边使用：dictionary-overlay-jump-prev(next)-unknown-word 的时候，会出现 hl-line-mode与光标分离情况么 ↓↓

我直接裸用 (websocket-bridge-call-buffer "jump_next_unknown_word") 也出现分离的情况，不知道是什么原因

zqso · 2022 年12 月 4 日 10:58

(setq dictionary-overlay-translators '("local" "darwin" "sdcv" "web"))

整体用下来，把 local 和 darwin (原生) 放在前面，翻译质量一下子就上去了，野生的乱译、错译少了许多。

heisaari · 2022 年12 月 4 日 17:00

今日尝试在 windows 下 dictionary-overlay-install

Collecting six==1.16.0
  Using cached six-1.16.0-py2.py3-none-any.whl (11 kB)
Collecting snowballstemmer==2.2.0
  Using cached snowballstemmer-2.2.0-py2.py3-none-any.whl (93 kB)
Collecting tokenizers==0.13.2
  Using cached tokenizers-0.13.2.tar.gz (359 kB)
  Installing build dependencies: started
  Installing build dependencies: still running...
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting websocket-bridge-python==0.0.2
  Using cached websocket_bridge_python-0.0.2-py3-none-any.whl
Collecting websockets==10.4
  Using cached websockets-10.4-cp311-cp311-win_amd64.whl (101 kB)
Building wheels for collected packages: tokenizers
  Building wheel for tokenizers (pyproject.toml): started
  Building wheel for tokenizers (pyproject.toml): finished with status 'error'
  error: subprocess-exited-with-error
  
  Building wheel for tokenizers (pyproject.toml) did not run successfully.
  exit code: 1
  
  [51 lines of output]
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build\lib.win-amd64-cpython-311
  creating build\lib.win-amd64-cpython-311\tokenizers
  copying py_src\tokenizers\__init__.py -> build\lib.win-amd64-cpython-311\tokenizers
  creating build\lib.win-amd64-cpython-311\tokenizers\models
  copying py_src\tokenizers\models\__init__.py -> build\lib.win-amd64-cpython-311\tokenizers\models
  creating build\lib.win-amd64-cpython-311\tokenizers\decoders
  copying py_src\tokenizers\decoders\__init__.py -> build\lib.win-amd64-cpython-311\tokenizers\decoders
  creating build\lib.win-amd64-cpython-311\tokenizers\normalizers
  copying py_src\tokenizers\normalizers\__init__.py -> build\lib.win-amd64-cpython-311\tokenizers\normalizers
  creating build\lib.win-amd64-cpython-311\tokenizers\pre_tokenizers
  copying py_src\tokenizers\pre_tokenizers\__init__.py -> build\lib.win-amd64-cpython-311\tokenizers\pre_tokenizers
  creating build\lib.win-amd64-cpython-311\tokenizers\processors
  copying py_src\tokenizers\processors\__init__.py -> build\lib.win-amd64-cpython-311\tokenizers\processors
  creating build\lib.win-amd64-cpython-311\tokenizers\trainers
  copying py_src\tokenizers\trainers\__init__.py -> build\lib.win-amd64-cpython-311\tokenizers\trainers
  creating build\lib.win-amd64-cpython-311\tokenizers\implementations
  copying py_src\tokenizers\implementations\base_tokenizer.py -> build\lib.win-amd64-cpython-311\tokenizers\implementations
  copying py_src\tokenizers\implementations\bert_wordpiece.py -> build\lib.win-amd64-cpython-311\tokenizers\implementations
  copying py_src\tokenizers\implementations\byte_level_bpe.py -> build\lib.win-amd64-cpython-311\tokenizers\implementations
  copying py_src\tokenizers\implementations\char_level_bpe.py -> build\lib.win-amd64-cpython-311\tokenizers\implementations
  copying py_src\tokenizers\implementations\sentencepiece_bpe.py -> build\lib.win-amd64-cpython-311\tokenizers\implementations
  copying py_src\tokenizers\implementations\sentencepiece_unigram.py -> build\lib.win-amd64-cpython-311\tokenizers\implementations
  copying py_src\tokenizers\implementations\__init__.py -> build\lib.win-amd64-cpython-311\tokenizers\implementations
  creating build\lib.win-amd64-cpython-311\tokenizers\tools
  copying py_src\tokenizers\tools\visualizer.py -> build\lib.win-amd64-cpython-311\tokenizers\tools
  copying py_src\tokenizers\tools\__init__.py -> build\lib.win-amd64-cpython-311\tokenizers\tools
  copying py_src\tokenizers\__init__.pyi -> build\lib.win-amd64-cpython-311\tokenizers
  copying py_src\tokenizers\models\__init__.pyi -> build\lib.win-amd64-cpython-311\tokenizers\models
  copying py_src\tokenizers\decoders\__init__.pyi -> build\lib.win-amd64-cpython-311\tokenizers\decoders
  copying py_src\tokenizers\normalizers\__init__.pyi -> build\lib.win-amd64-cpython-311\tokenizers\normalizers
  copying py_src\tokenizers\pre_tokenizers\__init__.pyi -> build\lib.win-amd64-cpython-311\tokenizers\pre_tokenizers
  copying py_src\tokenizers\processors\__init__.pyi -> build\lib.win-amd64-cpython-311\tokenizers\processors
  copying py_src\tokenizers\trainers\__init__.pyi -> build\lib.win-amd64-cpython-311\tokenizers\trainers
  copying py_src\tokenizers\tools\visualizer-styles.css -> build\lib.win-amd64-cpython-311\tokenizers\tools
  running build_ext
  running build_rust
  error: can't find Rust compiler
  
  If you are using an outdated pip version, it is possible a prebuilt wheel is available for this package but pip is not able to install from it. Installing from the wheel would avoid the need for a Rust compiler.
  
  To update pip, run:
  
      pip install --upgrade pip
  
  and then retry package installation.
  
  If you did intend to build this package from source, try installing a Rust compiler from your system package manager and ensure it is on the PATH during installation. Alternatively, rustup (available at https://rustup.rs) is the recommended way to download and update the Rust compiler toolchain.
  [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for tokenizers
Failed to build tokenizers
ERROR: Could not build wheels for tokenizers, which is required to install pyproject.toml-based projects

单独执行了pip update 和 rust 安装未果

ginqi7 · 2022 年12 月 5 日 01:52

看来是 tokenizers 安装失败了。

pip install tokenizers 试试看？

zqso · 2022 年12 月 5 日 06:41

请教大佬 @ginqi7 ：我个人Fork 里加了一条：大于等于5位，且不包含韵母，则认为不是词。这么写不知道还能不能写得更精简，以及大buffer里不知道会不会有性能问题？

def new_word_p(word: str) -> bool:
    if len(word) < 3:
        return False
    if re.match(r"\b[^aeiou]{5,}\b", word, re.M | re.I):   # 大于等于五个字母时，若不含韵母
        return False                                       # 则不认为是词
    if re.search(r"[^a-z]", word, re.M | re.I):
        return False
    return not in_or_stem_in(word, known_words)

ginqi7 · 2022 年12 月 5 日 07:01

\b 应该不需要。分割出来的word里不会用空白符。性能应该也不会有什么影响。都是比较短的单词的匹配。

zqso · 2022 年12 月 5 日 07:23

我再试试。

另外请教一下，google translate有配额限制么？我现在出现了这个提示，也不知道我真的网络差（最近呆的地方网络真的不怎么样），还是被限流了？

[Dictionary-overlay]web-translate error, check your network. or run (websocket-bridge-app-open-buffer 'dictionary-overlay) see the error details. [5 times]

补充：(websocket-bridge-app-open-buffer 'dictionary-overlay) 后得到：

ginqi7 · 2022 年12 月 5 日 07:25

你可以看python日志里的信息。应该不是限流了，估计就是网络比较差。这个报错是我代码里给的提示。

zqso · 2022 年12 月 5 日 07:27

发现 terminal 下 git clone 也 down 了。应该是我网络有问题了。

再请教： ran out of input 指的是什么事件？

ginqi7 · 2022 年12 月 5 日 07:53

神奇，我没有遇到过。不知道是什么情况

zqso · 2022 年12 月 5 日 07:55

我准备把 log buffer 置顶一天，看看还有什么事件没有见过。

ginqi7 · 2022 年12 月 5 日 08:33

那你可以把python 里 run_and_log 里的print 手动注释了。免得打印的东西太多。

zqso · 2022 年12 月 5 日 11:05

我还是怀疑被google translate 暂时限制了，观察一阵子看看。

不过有个情况：如果web查词失败，生词只进hash-table, 但不产生overlay？（如上图，光标位置 auto jump 自动找到的，说明 hash-table 有效，但没见overlay。）

ginqi7 · 2022 年12 月 5 日 11:21

墙外也会被限制么。

计划上是没有查出结果，就跳过。不过我也没有注意没查出结果实际是什么效果。

zqso · 2022 年12 月 5 日 11:27

我觉得是触发了 daily quota limit? 我不清楚有没有这样的机制，且我没有实证，哈哈。但是这几天一直往自己的knownwords.txt 倒词汇进去，也开了上百个wall street journal 的elfeed文章，查得有点多？只能等24h再看看。

你方便的时候断网测试下无网会不会渲染失败? 或者comment掉↓

(setq dictionary-overlay-translators '("local" "darwin" "sdcv"
;;; "web"
))

ginqi7 · 2022 年12 月 5 日 12:01

我试过注释掉是可以的。

zqso · 2022 年12 月 5 日 12:16

那我不得不开始漫长的debug之旅了。。

zqso · 2022 年12 月 5 日 18:33

贴一下个人使用配置：主要在 elfeed-entry-mode 和 eww 中读新闻使用，恰巧昨天猫大更新了 popweb-dict 快速创建新词典的宏，于是赶紧结合了起来。（只看不抄好习惯，因为包含有未公开的私人配置，直接抄会报错）

首先是popweb:

(use-package popweb
  :commands (popweb-org-roam-link-show
             popweb-latex-mode)
  :straight nil
  :config
  (require 'popweb-latex) (add-hook 'latex-mode-hook #'popweb-latex-mode)
  (require 'popweb-org-roam-link)
  (require 'popweb-url)
  (require 'popweb-dict)

  ;; NOTE 2022-12-05: personal API, local static html as url, demonstration only.
  (popweb-dict-create
   "youglish-api"
   (concat
    "file:///"
    (let ((temp-file (concat path-cache-dir "popweb/tmp.html")))
      (with-temp-file temp-file
        (insert-file-contents
         (concat path-emacs-dir "lisp/popweb-dict-yg-js.html"))
        (goto-char (point-min))
        (silenzio (replace-regexp "query" (concat word " :r"))))
      temp-file))
   "")

  ;; NOTE 2022-12-05: WIP
  (popweb-dict-create
   "forvo"
   "https://forvo.com/search/%s/en_usa/"
   (concat
    "window.scrollTo(0, 0); "
    "document.getElementsByTagName('html')[0].style.visibility = 'hidden'; "
    "document.getElementsByClassName('main_search')[0].style.visibility = 'visible'; "
    "document.getElementsByClassName('main_section')[0].style.visibility = 'visible'; "
    ;; "document.getElementsByClassName('left-content col')[0].style.visibility = 'visible'; "
    ;; "document.getElementsByTagName('header')[0].style.display = 'none'; "
    ;; "document.getElementsByClassName('contentPadding')[0].style.padding = '10px';"
    ))

  ;; NOTE 2022-12-05: far from perfect
  (popweb-dict-create
   "mw"
   "https://www.merriam-webster.com/dictionary/%s"
   (concat
    "window.scrollTo(0, 0); "
    "document.getElementsByTagName('html')[0].style.visibility = 'hidden'; "
    "document.getElementsByClassName('left-sidebar')[0].style.visibility = 'visible'; " ; ✓
    "document.getElementsByClassName('redesign-container')[0].style.visibility = 'visible'; "
    )))

然后是dictionary overlay:

(use-package dictionary-overlay
  :commands (dictionary-overlay-render-buffer)
  :straight nil
  :custom-face
  (dictionary-overlay-unknownword ((t :inherit font-lock-keyword-face)))
  (dictionary-overlay-translation ((t :inherit font-lock-comment-face)))
  :config
  (dictionary-overlay-start)
  
  (setq dictionary-overlay-translators '("local" "darwin" "sdcv" "web")
        dictionary-overlay-recenter-after-mark-and-jump 10
        dictionary-overlay-user-data-directory (concat path-emacs-private-dir "dictionary-overlay-data/")
        dictionary-overlay-just-unknown-words nil
        dictionary-overlay-auto-jump-after '(mark-word-known
                                             ;; mark-word-unknown
                                             render-buffer))

  (use-package popweb :straight nil :demand)

  (defvar dictionary-overlay-lookup-prefix-map
    (let ((map (make-sparse-keymap)))
      (define-key map (kbd "y") #'popweb-dict-youdao-pointer)
      (define-key map (kbd "u") #'popweb-dict-youglish-pointer)
      (define-key map (kbd "o") #'popweb-dict-forvo-pointer)
      (define-key map (kbd "k") #'popweb-dict-youglish-api-pointer)
      (define-key map (kbd "m") #'popweb-dict-mw-pointer)
      (define-key map (kbd "b") #'popweb-dict-bing-pointer)
      map)
    "Keymap for 3rd party dictionaries.")

  (defun dictionary-overlay-lookup-prefix-map ()
    "Transient keymap for fast lookup with different dictionaries."
    (interactive)
    (set-transient-map dictionary-overlay-lookup-prefix-map))

  :general
  (dictionary-overlay-map
   "p" nil
   "n" nil
   "m" nil
   "M" nil
   ;; --
   "j" #'dictionary-overlay-jump-prev-unknown-word
   "k" #'dictionary-overlay-lookup-prefix-map
   "l" #'dictionary-overlay-jump-next-unknown-word
   "L" (lambda () (interactive) (websocket-bridge-app-open-buffer 'dictionary-overlay))
   "o" #'dictionary-overlay-mark-word-smart
   "O" #'dictionary-overlay-mark-word-smart-reversely
   ;; --
   "a" #'dictionary-overlay-mark-word-unknown
   "." #'dictionary-overlay-jump-out-of-overlay
   "r" #'popweb-restart-process)
  (elfeed-show-mode-map
   "a" #'dictionary-overlay-mark-word-unknown
   "r" #'dictionary-overlay-restart
   "." #'dictionary-overlay-render-buffer)
  (eww-mode-map
   "a" #'dictionary-overlay-mark-word-unknown
   "r" #'dictionary-overlay-restart
   "." #'dictionary-overlay-render-buffer))

heisaari · 2022 年12 月 5 日 21:07

看了下最近的词汇，发现 knownword 里有不少数字，估计是 mark buffer 来的；另外还有些人名

可否再有一个 txt 放数字、人名地名啥的，执行 dictionary-overlay-special 放入？或者至少忽略掉阿拉伯数字

感谢分享，看来要学下 use-package 和 straight 了

btw: 大家用什么来快速移动到和选中单词以便标记？