134 lines
8.6 KiB
Org Mode
134 lines
8.6 KiB
Org Mode
:PROPERTIES:
|
||
:ID: f68dfc34-5349-42d1-8074-6c4be231a69b
|
||
:END:
|
||
#+title: org-web-tools
|
||
#+filetags: :ORG:EMACS:
|
||
|
||
Toolbox for downloading HTML websites. There are several functions of the framework. We need the external program [[id:75ea690d-deee-4592-ae99-1c2385c208fb][pandoc]] to convert the HTML pages to org-files....
|
||
|
||
|
||
* Commands
|
||
+-------------------------------------------+----------------------------------------+
|
||
|org-web-tools-insert-link-for-url |Insert an Org-mode link to the URL in |
|
||
| |the clipboard or kill-ring. Downloads |
|
||
| |the page to get the HTML title. |
|
||
+-------------------------------------------+----------------------------------------+
|
||
|org-web-tools-insert-web-page-as-entry |Insert the web page for the URL in the |
|
||
| |clipboard or kill-ring as an Org-mode |
|
||
| |entry, as a sibling heading of the |
|
||
| |current entry. |
|
||
+-------------------------------------------+----------------------------------------+
|
||
|org-web-tools-read-url-as-org |Display the web page for the URL in the |
|
||
| |clipboard or kill-ring as Org-mode text |
|
||
| |in a new buffer, processed with |
|
||
| |eww-readable. |
|
||
+-------------------------------------------+----------------------------------------+
|
||
|org-web-tools-convert-links-to-page-entries|Convert all URLs and Org links in |
|
||
| |current Org entry to Org headings, each |
|
||
| |containing the web page content of that |
|
||
| |URL, converted to Org-mode text and |
|
||
| |processed with eww-readable. This should|
|
||
| |be called on an entry that solely |
|
||
| |contains a list of URLs or links. |
|
||
+-------------------------------------------+----------------------------------------+
|
||
|org-web-tools-archive-attach |Download archive of page at URL and |
|
||
| |attach with org-attach. If CHOOSE-FN is |
|
||
| |non-nil (interactively, with universal |
|
||
| |prefix), prompt for the archive function|
|
||
| |to use. If VIEW is non-nil |
|
||
| |(interactively, with two universal |
|
||
| |prefixes), view the archive immediately |
|
||
| |after attaching. (See also org-board). |
|
||
+-------------------------------------------+----------------------------------------+
|
||
|org-web-tools-archive-view |Open Zip file archive of web |
|
||
| |page. Extracts to a temp directory and |
|
||
| |opens with |
|
||
| |browse-url-default-browser. Note, the |
|
||
| |extracted files are left on-disk in the |
|
||
| |temp directory. |
|
||
+-------------------------------------------+----------------------------------------+
|
||
|
||
* Troubleshooting
|
||
|
||
The attach command does not work natively because wget's variables are set incorrectly. The solution is:
|
||
|
||
|
||
#+begin_src emacs-lisp
|
||
(use-package org-web-tools
|
||
:ensure t
|
||
:config
|
||
(setq org-web-tools-archive-wget-options
|
||
(delete "--execute robots=off" org-web-tools-archive-wget-options))
|
||
(setq org-web-tools-archive-wget-html-only-options
|
||
(delete "--execute robots=off" org-web-tools-archive-wget-html-only-options))
|
||
|
||
(add-to-list 'org-web-tools-archive-wget-options "-e robots=off")
|
||
(add-to-list 'org-web-tools-archive-wget-html-only-options "-e robots=off"))
|
||
#+end_src
|
||
|
||
Nevertheless, the normal attach function cannot be used, but only the command with C-u as prefix (1xtype and then the command. There then HTML-only or tar with resources can be used.
|
||
|
||
|
||
* Functions
|
||
These are used in the commands above and may be useful in building your own commands.
|
||
|
||
+--------------------------------------+------------------------------+
|
||
|org-web-tools--dom-to-html |Return parsed HTML DOM as an |
|
||
| |HTML string. Note: This is an |
|
||
| |approximation and is not |
|
||
| |necessarily correct HTML |
|
||
| |(e.g. IMG tags may be rendered|
|
||
| |with a closing “</img>” tag). |
|
||
+--------------------------------------+------------------------------+
|
||
|org-web-tools--eww-readable |Return “readable” part of HTML|
|
||
| |with title. |
|
||
+--------------------------------------+------------------------------+
|
||
|org-web-tools--get-url |Return content for URL as |
|
||
| |string. |
|
||
+--------------------------------------+------------------------------+
|
||
|org-web-tools--html-title |Return title of HTML page. |
|
||
+--------------------------------------+------------------------------+
|
||
|org-web-tools--html-to-org-with-pandoc|Return string of HTML |
|
||
| |converted to Org with |
|
||
| |Pandoc. When SELECTOR is |
|
||
| |non-nil, the HTML is filtered |
|
||
| |using esxml-query SELECTOR and|
|
||
| |re-rendered to HTML with |
|
||
| |org-web-tools--dom-to-html, |
|
||
| |which see. |
|
||
+--------------------------------------+------------------------------+
|
||
|org-web-tools--url-as-readable-org |Return string containing Org |
|
||
| |entry of URL’s web page |
|
||
| |content. Content is processed |
|
||
| |with eww-readable and |
|
||
| |Pandoc. Entry will be a |
|
||
| |top-level heading, with |
|
||
| |article contents below a |
|
||
| |second-level “Article” |
|
||
| |heading, and a timestamp in |
|
||
| |the first-level entry for |
|
||
| |writing comments. |
|
||
+--------------------------------------+------------------------------+
|
||
|org-web-tools--demote-headings-below |Demote all headings in buffer |
|
||
| |so the highest level is below |
|
||
| |LEVEL. |
|
||
+--------------------------------------+------------------------------+
|
||
|org-web-tools--get-first-url |Return URL in clipboard, or |
|
||
| |first URL in the kill-ring, or|
|
||
| |nil if none. |
|
||
+--------------------------------------+------------------------------+
|
||
|org-web-tools--read-url |Return a URL by searching at |
|
||
| |point, then in clipboard, then|
|
||
| |in kill-ring, and finally |
|
||
| |prompting the user. |
|
||
+--------------------------------------+------------------------------+
|
||
|org-web-tools--read-org-bracket-link |Return (TARGET . DESCRIPTION) |
|
||
| |for Org bracket LINK or next |
|
||
| |link on current line. |
|
||
+--------------------------------------+------------------------------+
|
||
|org-web-tools--remove-dos-crlf |Remove all DOS CRLF (^M) in |
|
||
| |buffer. |
|
||
+--------------------------------------+------------------------------+
|
||
|
||
|