brain initiation

This commit is contained in:
2025-11-05 09:18:11 +01:00
commit 933aa8a985
191 changed files with 6203 additions and 0 deletions

View File

@@ -0,0 +1,133 @@
:PROPERTIES:
:ID: f68dfc34-5349-42d1-8074-6c4be231a69b
:END:
#+title: org-web-tools
#+filetags: :ORG:EMACS:
Toolbox for downloading HTML websites. There are several functions of the framework. We need the external program [[id:75ea690d-deee-4592-ae99-1c2385c208fb][pandoc]] to convert the HTML pages to org-files....
* Commands
+-------------------------------------------+----------------------------------------+
|org-web-tools-insert-link-for-url |Insert an Org-mode link to the URL in |
| |the clipboard or kill-ring. Downloads |
| |the page to get the HTML title. |
+-------------------------------------------+----------------------------------------+
|org-web-tools-insert-web-page-as-entry |Insert the web page for the URL in the |
| |clipboard or kill-ring as an Org-mode |
| |entry, as a sibling heading of the |
| |current entry. |
+-------------------------------------------+----------------------------------------+
|org-web-tools-read-url-as-org |Display the web page for the URL in the |
| |clipboard or kill-ring as Org-mode text |
| |in a new buffer, processed with |
| |eww-readable. |
+-------------------------------------------+----------------------------------------+
|org-web-tools-convert-links-to-page-entries|Convert all URLs and Org links in |
| |current Org entry to Org headings, each |
| |containing the web page content of that |
| |URL, converted to Org-mode text and |
| |processed with eww-readable. This should|
| |be called on an entry that solely |
| |contains a list of URLs or links. |
+-------------------------------------------+----------------------------------------+
|org-web-tools-archive-attach |Download archive of page at URL and |
| |attach with org-attach. If CHOOSE-FN is |
| |non-nil (interactively, with universal |
| |prefix), prompt for the archive function|
| |to use. If VIEW is non-nil |
| |(interactively, with two universal |
| |prefixes), view the archive immediately |
| |after attaching. (See also org-board). |
+-------------------------------------------+----------------------------------------+
|org-web-tools-archive-view |Open Zip file archive of web |
| |page. Extracts to a temp directory and |
| |opens with |
| |browse-url-default-browser. Note, the |
| |extracted files are left on-disk in the |
| |temp directory. |
+-------------------------------------------+----------------------------------------+
* Troubleshooting
The attach command does not work natively because wget's variables are set incorrectly. The solution is:
#+begin_src emacs-lisp
(use-package org-web-tools
:ensure t
:config
(setq org-web-tools-archive-wget-options
(delete "--execute robots=off" org-web-tools-archive-wget-options))
(setq org-web-tools-archive-wget-html-only-options
(delete "--execute robots=off" org-web-tools-archive-wget-html-only-options))
(add-to-list 'org-web-tools-archive-wget-options "-e robots=off")
(add-to-list 'org-web-tools-archive-wget-html-only-options "-e robots=off"))
#+end_src
Nevertheless, the normal attach function cannot be used, but only the command with C-u as prefix (1xtype and then the command. There then HTML-only or tar with resources can be used.
* Functions
These are used in the commands above and may be useful in building your own commands.
+--------------------------------------+------------------------------+
|org-web-tools--dom-to-html |Return parsed HTML DOM as an |
| |HTML string. Note: This is an |
| |approximation and is not |
| |necessarily correct HTML |
| |(e.g. IMG tags may be rendered|
| |with a closing “</img>” tag). |
+--------------------------------------+------------------------------+
|org-web-tools--eww-readable |Return “readable” part of HTML|
| |with title. |
+--------------------------------------+------------------------------+
|org-web-tools--get-url |Return content for URL as |
| |string. |
+--------------------------------------+------------------------------+
|org-web-tools--html-title |Return title of HTML page. |
+--------------------------------------+------------------------------+
|org-web-tools--html-to-org-with-pandoc|Return string of HTML |
| |converted to Org with |
| |Pandoc. When SELECTOR is |
| |non-nil, the HTML is filtered |
| |using esxml-query SELECTOR and|
| |re-rendered to HTML with |
| |org-web-tools--dom-to-html, |
| |which see. |
+--------------------------------------+------------------------------+
|org-web-tools--url-as-readable-org |Return string containing Org |
| |entry of URLs web page |
| |content. Content is processed |
| |with eww-readable and |
| |Pandoc. Entry will be a |
| |top-level heading, with |
| |article contents below a |
| |second-level “Article” |
| |heading, and a timestamp in |
| |the first-level entry for |
| |writing comments. |
+--------------------------------------+------------------------------+
|org-web-tools--demote-headings-below |Demote all headings in buffer |
| |so the highest level is below |
| |LEVEL. |
+--------------------------------------+------------------------------+
|org-web-tools--get-first-url |Return URL in clipboard, or |
| |first URL in the kill-ring, or|
| |nil if none. |
+--------------------------------------+------------------------------+
|org-web-tools--read-url |Return a URL by searching at |
| |point, then in clipboard, then|
| |in kill-ring, and finally |
| |prompting the user. |
+--------------------------------------+------------------------------+
|org-web-tools--read-org-bracket-link |Return (TARGET . DESCRIPTION) |
| |for Org bracket LINK or next |
| |link on current line. |
+--------------------------------------+------------------------------+
|org-web-tools--remove-dos-crlf |Remove all DOS CRLF (^M) in |
| |buffer. |
+--------------------------------------+------------------------------+