如何保存网页上的内容到org-mode中？

zhixing · 2017 年7 月 19 日 08:30

如题：如何保存网页上的内容到org-mode中？

通常在网页上有许多有用的内容，特别是带有链接和表格的内容，如何能很好的保存到org-mode中啊？

smallst · 2017 年7 月 19 日 08:38

firefox 有 org-mode capture 插件，不知道是不是你要的，我目前只用这个收藏网页到 org-agenda

zhixing · 2017 年7 月 19 日 08:49

我用过这个，后来觉得不好用，只是拷贝了链接，所以给卸载了，没再用过。

我希望能快速的保存一段带有链接的文字，然后保存一些表格。不知道有没有什么好的方法？

你现在还在用这个插件吗？好用吗？你是收藏整个网页吗？还是网页的一部分？

smallst · 2017 年7 月 19 日 09:20

整个网页……突然发现我也好久没用这个了2333

xuchunyang · 2017 年7 月 19 日 09:58

有用 Pandoc¹ 把整个 HTML 转换成 Org mode 的：

可以试试把获得选中区域的 HTML 然后转化成 Org mode，比如先用浏览器的开发者工具复制，然后用 Pandoc 转化：

$ pbpaste | pandoc -f html -t org

应该也有一些现成的浏览器插件、Bookmarklet 之类，可以用 JavaScript 转化，可以根据自己的需求了解看看。

¹ Pandoc 把 HTML 转化成 Org mode 的效果并不完美

zhixing · 2017 年7 月 19 日 10:48

没有必要去转整个网页吧，那样太麻烦了，好多没必要的元素都会进来。

Onenote和word处理的还是不错的，Onenote也算是纯文本了吧。

都是纯文本格式不是应该很容易才对吗？看来有这个需求的人不多啊

xuchunyang · 2017 年7 月 19 日 11:08

有没有必要取决于你的需求。“麻烦”是指什么？

不清楚，没用过 Onenote，对 word 也不了解

不明白容易或者不容易是指什么？实现还是使用？

zhixing · 2017 年7 月 19 日 12:56

麻烦是指网页上许多自己不需要的内容、链接都会被转到org文件里，很长的文本里可能需要的只是一小段文字而已

html、org都是纯文本格式，org可以很方便的转成html，却不能反过来（可能是资料较少，或者我没找到），个人认为html转成org格式会很容易实现。

xuchunyang · 2017 年7 月 19 日 14:57

你自己没这个需要，就不用它呗。用不着解释它没必要以及麻烦。

上面提到的 Pandoc 可以。你觉得容易就试试。

zhixing · 2017 年7 月 19 日 15:26

对，我不是解释它没必要及麻烦，是我用不到这个功能，所以说了一下，主要是想请教大家有没有更好的方法，没有抱怨的意思，是我没有表达清楚。

我用pandoc试过了，确实可以，我看了一下你上面说的package的源码和你提到的命令，实现这样的功能并不难，只能说这个功能比较小众吧。

Voleking · 2017 年7 月 20 日 08:06

其实是可以的，高亮你想转换的内容，通过上面那个插件存到 org 中，不过 Pandoc 把 HTML 转化成 Org mode 效果只能算差强人意吧。

zhixing · 2017 年7 月 20 日 09:06

具体怎么实现呢？

我看了他们的代码，实现是很容易的。

我手动操作，利用pandoc也是可以的，只是对lisp语言不大懂，所以不知道怎么修改。

Voleking · 2017 年7 月 20 日 13:53

Readme 中有写怎么通过 Bookmark 存

This bookmarklet captures what is currently selected in the browser. Or if nothing is selected, it just captures the page’s URL and title.

javascript:location.href%20=%20'org-protocol:///capture-html?template=w&url='%20+%20encodeURIComponent(location.href)%20+%20'&title='%20+%20encodeURIComponent(document.title%20||%20"[untitled%20page]")%20+%20'&body='%20+%20encodeURIComponent(function%20()%20{var%20html%20=%20"";%20if%20(typeof%20window.getSelection%20!=%20"undefined")%20{var%20sel%20=%20window.getSelection();%20if%20(sel.rangeCount)%20{var%20container%20=%20document.createElement("div");%20for%20(var%20i%20=%200,%20len%20=%20sel.rangeCount;%20i%20<%20len;%20++i)%20{container.appendChild(sel.getRangeAt(i).cloneContents());}%20html%20=%20container.innerHTML;}}%20else%20if%20(typeof%20document.selection%20!=%20"undefined")%20{if%20(document.selection.type%20==%20"Text")%20{html%20=%20document.selection.createRange().htmlText;}}%20var%20relToAbs%20=%20function%20(href)%20{var%20a%20=%20document.createElement("a");%20a.href%20=%20href;%20var%20abs%20=%20a.protocol%20+%20"//"%20+%20a.host%20+%20a.pathname%20+%20a.search%20+%20a.hash;%20a.remove();%20return%20abs;};%20var%20elementTypes%20=%20[['a',%20'href'],%20['img',%20'src']];%20var%20div%20=%20document.createElement('div');%20div.innerHTML%20=%20html;%20elementTypes.map(function(elementType)%20{var%20elements%20=%20div.getElementsByTagName(elementType[0]);%20for%20(var%20i%20=%200;%20i%20<%20elements.length;%20i++)%20{elements[i].setAttribute(elementType[1],%20relToAbs(elements[i].getAttribute(elementType[1])));}});%20return%20div.innerHTML;}());

不过我一般都是保存部分文字，那样的话直接用 protocol 就可以了：

javascript:location.href='org-protocol://capture?template=l&url='+encodeURIComponent(location.href)+'&title='+encodeURIComponent(document.title%20%7C%7C%20%22%5Buntitled%20page%5D%22)+'&body='+encodeURIComponent(window.getSelection())

zmonster · 2017 年7 月 23 日 05:54

我的方案是用印象笔记插件将需要的内容保存到印象笔记，然后用 geeknote 定期同步到 Dropbox 里，并转成 org-mode 格式，不过有时候还需要手动整理调一下格式什么的。

BoRhap · 2021 年1 月 28 日 13:36

可以参考这里第一个答案的方法 Org mode - Parsing rich HTML directly when pasting?
MacOS和Linux下都可以，Linux需要重新编译xclip

osascript gets the HTML text from the clipboard. It is hex encoded, so
perl converts the hex to a string
We could convert that HTML to Org directly with pandoc, but the HTML is full of complicated tags and therefore produces a ton of Org code. In order to simply the HTML to the minimal set of tags needed to capture the formatting, I
Convert the HTML to json, and then
Convert the json to Org (these two steps simplify the HTML).
Replace non-standard spaces with standard ones.