8.6 KiB
org-web-tools
Toolbox for downloading HTML websites. There are several functions of the framework. We need the external program pandoc to convert the HTML pages to org-files….
Commands
——————————————-—————————————-+
| org-web-tools-insert-link-for-url | Insert an Org-mode link to the URL in |
| the clipboard or kill-ring. Downloads | |
| the page to get the HTML title. |
——————————————-—————————————-+
| org-web-tools-insert-web-page-as-entry | Insert the web page for the URL in the |
| clipboard or kill-ring as an Org-mode | |
| entry, as a sibling heading of the | |
| current entry. |
——————————————-—————————————-+
| org-web-tools-read-url-as-org | Display the web page for the URL in the |
| clipboard or kill-ring as Org-mode text | |
| in a new buffer, processed with | |
| eww-readable. |
——————————————-—————————————-+
| org-web-tools-convert-links-to-page-entries | Convert all URLs and Org links in |
| current Org entry to Org headings, each | |
| containing the web page content of that | |
| URL, converted to Org-mode text and | |
| processed with eww-readable. This should | |
| be called on an entry that solely | |
| contains a list of URLs or links. |
——————————————-—————————————-+
| org-web-tools-archive-attach | Download archive of page at URL and |
| attach with org-attach. If CHOOSE-FN is | |
| non-nil (interactively, with universal | |
| prefix), prompt for the archive function | |
| to use. If VIEW is non-nil | |
| (interactively, with two universal | |
| prefixes), view the archive immediately | |
| after attaching. (See also org-board). |
——————————————-—————————————-+
| org-web-tools-archive-view | Open Zip file archive of web |
| page. Extracts to a temp directory and | |
| opens with | |
| browse-url-default-browser. Note, the | |
| extracted files are left on-disk in the | |
| temp directory. |
——————————————-—————————————-+
Troubleshooting
The attach command does not work natively because wget's variables are set incorrectly. The solution is:
(use-package org-web-tools
:ensure t
:config
(setq org-web-tools-archive-wget-options
(delete "--execute robots=off" org-web-tools-archive-wget-options))
(setq org-web-tools-archive-wget-html-only-options
(delete "--execute robots=off" org-web-tools-archive-wget-html-only-options))
(add-to-list 'org-web-tools-archive-wget-options "-e robots=off")
(add-to-list 'org-web-tools-archive-wget-html-only-options "-e robots=off"))
Nevertheless, the normal attach function cannot be used, but only the command with C-u as prefix (1xtype and then the command. There then HTML-only or tar with resources can be used.
Functions
These are used in the commands above and may be useful in building your own commands.
————————————–——————————+
| org-web-tools–dom-to-html | Return parsed HTML DOM as an |
| HTML string. Note: This is an | |
| approximation and is not | |
| necessarily correct HTML | |
| (e.g. IMG tags may be rendered | |
| with a closing “</img>” tag). |
————————————–——————————+
| org-web-tools–eww-readable | Return “readable” part of HTML |
| with title. |
————————————–——————————+
| org-web-tools–get-url | Return content for URL as |
| string. |
————————————–——————————+
| org-web-tools–html-title | Return title of HTML page. |
————————————–——————————+
| org-web-tools–html-to-org-with-pandoc | Return string of HTML |
| converted to Org with | |
| Pandoc. When SELECTOR is | |
| non-nil, the HTML is filtered | |
| using esxml-query SELECTOR and | |
| re-rendered to HTML with | |
| org-web-tools–dom-to-html, | |
| which see. |
————————————–——————————+
| org-web-tools–url-as-readable-org | Return string containing Org |
| entry of URL’s web page | |
| content. Content is processed | |
| with eww-readable and | |
| Pandoc. Entry will be a | |
| top-level heading, with | |
| article contents below a | |
| second-level “Article” | |
| heading, and a timestamp in | |
| the first-level entry for | |
| writing comments. |
————————————–——————————+
| org-web-tools–demote-headings-below | Demote all headings in buffer |
| so the highest level is below | |
| LEVEL. |
————————————–——————————+
| org-web-tools–get-first-url | Return URL in clipboard, or |
| first URL in the kill-ring, or | |
| nil if none. |
————————————–——————————+
| org-web-tools–read-url | Return a URL by searching at |
| point, then in clipboard, then | |
| in kill-ring, and finally | |
| prompting the user. |
————————————–——————————+
| org-web-tools–read-org-bracket-link | Return (TARGET . DESCRIPTION) |
| for Org bracket LINK or next | |
| link on current line. |
————————————–——————————+
| org-web-tools–remove-dos-crlf | Remove all DOS CRLF (^M) in |
| buffer. |
————————————–——————————+