Files
org-roam/20230522132904-org_web_tools.org
2025-11-05 09:18:11 +01:00

134 lines
8.6 KiB
Org Mode
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
:PROPERTIES:
:ID: f68dfc34-5349-42d1-8074-6c4be231a69b
:END:
#+title: org-web-tools
#+filetags: :ORG:EMACS:
Toolbox for downloading HTML websites. There are several functions of the framework. We need the external program [[id:75ea690d-deee-4592-ae99-1c2385c208fb][pandoc]] to convert the HTML pages to org-files....
* Commands
+-------------------------------------------+----------------------------------------+
|org-web-tools-insert-link-for-url |Insert an Org-mode link to the URL in |
| |the clipboard or kill-ring. Downloads |
| |the page to get the HTML title. |
+-------------------------------------------+----------------------------------------+
|org-web-tools-insert-web-page-as-entry |Insert the web page for the URL in the |
| |clipboard or kill-ring as an Org-mode |
| |entry, as a sibling heading of the |
| |current entry. |
+-------------------------------------------+----------------------------------------+
|org-web-tools-read-url-as-org |Display the web page for the URL in the |
| |clipboard or kill-ring as Org-mode text |
| |in a new buffer, processed with |
| |eww-readable. |
+-------------------------------------------+----------------------------------------+
|org-web-tools-convert-links-to-page-entries|Convert all URLs and Org links in |
| |current Org entry to Org headings, each |
| |containing the web page content of that |
| |URL, converted to Org-mode text and |
| |processed with eww-readable. This should|
| |be called on an entry that solely |
| |contains a list of URLs or links. |
+-------------------------------------------+----------------------------------------+
|org-web-tools-archive-attach |Download archive of page at URL and |
| |attach with org-attach. If CHOOSE-FN is |
| |non-nil (interactively, with universal |
| |prefix), prompt for the archive function|
| |to use. If VIEW is non-nil |
| |(interactively, with two universal |
| |prefixes), view the archive immediately |
| |after attaching. (See also org-board). |
+-------------------------------------------+----------------------------------------+
|org-web-tools-archive-view |Open Zip file archive of web |
| |page. Extracts to a temp directory and |
| |opens with |
| |browse-url-default-browser. Note, the |
| |extracted files are left on-disk in the |
| |temp directory. |
+-------------------------------------------+----------------------------------------+
* Troubleshooting
The attach command does not work natively because wget's variables are set incorrectly. The solution is:
#+begin_src emacs-lisp
(use-package org-web-tools
:ensure t
:config
(setq org-web-tools-archive-wget-options
(delete "--execute robots=off" org-web-tools-archive-wget-options))
(setq org-web-tools-archive-wget-html-only-options
(delete "--execute robots=off" org-web-tools-archive-wget-html-only-options))
(add-to-list 'org-web-tools-archive-wget-options "-e robots=off")
(add-to-list 'org-web-tools-archive-wget-html-only-options "-e robots=off"))
#+end_src
Nevertheless, the normal attach function cannot be used, but only the command with C-u as prefix (1xtype and then the command. There then HTML-only or tar with resources can be used.
* Functions
These are used in the commands above and may be useful in building your own commands.
+--------------------------------------+------------------------------+
|org-web-tools--dom-to-html |Return parsed HTML DOM as an |
| |HTML string. Note: This is an |
| |approximation and is not |
| |necessarily correct HTML |
| |(e.g. IMG tags may be rendered|
| |with a closing “</img>” tag). |
+--------------------------------------+------------------------------+
|org-web-tools--eww-readable |Return “readable” part of HTML|
| |with title. |
+--------------------------------------+------------------------------+
|org-web-tools--get-url |Return content for URL as |
| |string. |
+--------------------------------------+------------------------------+
|org-web-tools--html-title |Return title of HTML page. |
+--------------------------------------+------------------------------+
|org-web-tools--html-to-org-with-pandoc|Return string of HTML |
| |converted to Org with |
| |Pandoc. When SELECTOR is |
| |non-nil, the HTML is filtered |
| |using esxml-query SELECTOR and|
| |re-rendered to HTML with |
| |org-web-tools--dom-to-html, |
| |which see. |
+--------------------------------------+------------------------------+
|org-web-tools--url-as-readable-org |Return string containing Org |
| |entry of URLs web page |
| |content. Content is processed |
| |with eww-readable and |
| |Pandoc. Entry will be a |
| |top-level heading, with |
| |article contents below a |
| |second-level “Article” |
| |heading, and a timestamp in |
| |the first-level entry for |
| |writing comments. |
+--------------------------------------+------------------------------+
|org-web-tools--demote-headings-below |Demote all headings in buffer |
| |so the highest level is below |
| |LEVEL. |
+--------------------------------------+------------------------------+
|org-web-tools--get-first-url |Return URL in clipboard, or |
| |first URL in the kill-ring, or|
| |nil if none. |
+--------------------------------------+------------------------------+
|org-web-tools--read-url |Return a URL by searching at |
| |point, then in clipboard, then|
| |in kill-ring, and finally |
| |prompting the user. |
+--------------------------------------+------------------------------+
|org-web-tools--read-org-bracket-link |Return (TARGET . DESCRIPTION) |
| |for Org bracket LINK or next |
| |link on current line. |
+--------------------------------------+------------------------------+
|org-web-tools--remove-dos-crlf |Remove all DOS CRLF (^M) in |
| |buffer. |
+--------------------------------------+------------------------------+