:PROPERTIES: :ID: f68dfc34-5349-42d1-8074-6c4be231a69b :END: #+title: org-web-tools #+filetags: :ORG:EMACS: Toolbox for downloading HTML websites. There are several functions of the framework. We need the external program [[id:75ea690d-deee-4592-ae99-1c2385c208fb][pandoc]] to convert the HTML pages to org-files.... * Commands +-------------------------------------------+----------------------------------------+ |org-web-tools-insert-link-for-url |Insert an Org-mode link to the URL in | | |the clipboard or kill-ring. Downloads | | |the page to get the HTML title. | +-------------------------------------------+----------------------------------------+ |org-web-tools-insert-web-page-as-entry |Insert the web page for the URL in the | | |clipboard or kill-ring as an Org-mode | | |entry, as a sibling heading of the | | |current entry. | +-------------------------------------------+----------------------------------------+ |org-web-tools-read-url-as-org |Display the web page for the URL in the | | |clipboard or kill-ring as Org-mode text | | |in a new buffer, processed with | | |eww-readable. | +-------------------------------------------+----------------------------------------+ |org-web-tools-convert-links-to-page-entries|Convert all URLs and Org links in | | |current Org entry to Org headings, each | | |containing the web page content of that | | |URL, converted to Org-mode text and | | |processed with eww-readable. This should| | |be called on an entry that solely | | |contains a list of URLs or links. | +-------------------------------------------+----------------------------------------+ |org-web-tools-archive-attach |Download archive of page at URL and | | |attach with org-attach. If CHOOSE-FN is | | |non-nil (interactively, with universal | | |prefix), prompt for the archive function| | |to use. If VIEW is non-nil | | |(interactively, with two universal | | |prefixes), view the archive immediately | | |after attaching. (See also org-board). | +-------------------------------------------+----------------------------------------+ |org-web-tools-archive-view |Open Zip file archive of web | | |page. Extracts to a temp directory and | | |opens with | | |browse-url-default-browser. Note, the | | |extracted files are left on-disk in the | | |temp directory. | +-------------------------------------------+----------------------------------------+ * Troubleshooting The attach command does not work natively because wget's variables are set incorrectly. The solution is: #+begin_src emacs-lisp (use-package org-web-tools :ensure t :config (setq org-web-tools-archive-wget-options (delete "--execute robots=off" org-web-tools-archive-wget-options)) (setq org-web-tools-archive-wget-html-only-options (delete "--execute robots=off" org-web-tools-archive-wget-html-only-options)) (add-to-list 'org-web-tools-archive-wget-options "-e robots=off") (add-to-list 'org-web-tools-archive-wget-html-only-options "-e robots=off")) #+end_src Nevertheless, the normal attach function cannot be used, but only the command with C-u as prefix (1xtype and then the command. There then HTML-only or tar with resources can be used. * Functions These are used in the commands above and may be useful in building your own commands. +--------------------------------------+------------------------------+ |org-web-tools--dom-to-html |Return parsed HTML DOM as an | | |HTML string. Note: This is an | | |approximation and is not | | |necessarily correct HTML | | |(e.g. IMG tags may be rendered| | |with a closing “” tag). | +--------------------------------------+------------------------------+ |org-web-tools--eww-readable |Return “readable” part of HTML| | |with title. | +--------------------------------------+------------------------------+ |org-web-tools--get-url |Return content for URL as | | |string. | +--------------------------------------+------------------------------+ |org-web-tools--html-title |Return title of HTML page. | +--------------------------------------+------------------------------+ |org-web-tools--html-to-org-with-pandoc|Return string of HTML | | |converted to Org with | | |Pandoc. When SELECTOR is | | |non-nil, the HTML is filtered | | |using esxml-query SELECTOR and| | |re-rendered to HTML with | | |org-web-tools--dom-to-html, | | |which see. | +--------------------------------------+------------------------------+ |org-web-tools--url-as-readable-org |Return string containing Org | | |entry of URL’s web page | | |content. Content is processed | | |with eww-readable and | | |Pandoc. Entry will be a | | |top-level heading, with | | |article contents below a | | |second-level “Article” | | |heading, and a timestamp in | | |the first-level entry for | | |writing comments. | +--------------------------------------+------------------------------+ |org-web-tools--demote-headings-below |Demote all headings in buffer | | |so the highest level is below | | |LEVEL. | +--------------------------------------+------------------------------+ |org-web-tools--get-first-url |Return URL in clipboard, or | | |first URL in the kill-ring, or| | |nil if none. | +--------------------------------------+------------------------------+ |org-web-tools--read-url |Return a URL by searching at | | |point, then in clipboard, then| | |in kill-ring, and finally | | |prompting the user. | +--------------------------------------+------------------------------+ |org-web-tools--read-org-bracket-link |Return (TARGET . DESCRIPTION) | | |for Org bracket LINK or next | | |link on current line. | +--------------------------------------+------------------------------+ |org-web-tools--remove-dos-crlf |Remove all DOS CRLF (^M) in | | |buffer. | +--------------------------------------+------------------------------+