brain initiation
This commit is contained in:
133
20230522132904-org_web_tools.org
Normal file
133
20230522132904-org_web_tools.org
Normal file
@@ -0,0 +1,133 @@
|
||||
:PROPERTIES:
|
||||
:ID: f68dfc34-5349-42d1-8074-6c4be231a69b
|
||||
:END:
|
||||
#+title: org-web-tools
|
||||
#+filetags: :ORG:EMACS:
|
||||
|
||||
Toolbox for downloading HTML websites. There are several functions of the framework. We need the external program [[id:75ea690d-deee-4592-ae99-1c2385c208fb][pandoc]] to convert the HTML pages to org-files....
|
||||
|
||||
|
||||
* Commands
|
||||
+-------------------------------------------+----------------------------------------+
|
||||
|org-web-tools-insert-link-for-url |Insert an Org-mode link to the URL in |
|
||||
| |the clipboard or kill-ring. Downloads |
|
||||
| |the page to get the HTML title. |
|
||||
+-------------------------------------------+----------------------------------------+
|
||||
|org-web-tools-insert-web-page-as-entry |Insert the web page for the URL in the |
|
||||
| |clipboard or kill-ring as an Org-mode |
|
||||
| |entry, as a sibling heading of the |
|
||||
| |current entry. |
|
||||
+-------------------------------------------+----------------------------------------+
|
||||
|org-web-tools-read-url-as-org |Display the web page for the URL in the |
|
||||
| |clipboard or kill-ring as Org-mode text |
|
||||
| |in a new buffer, processed with |
|
||||
| |eww-readable. |
|
||||
+-------------------------------------------+----------------------------------------+
|
||||
|org-web-tools-convert-links-to-page-entries|Convert all URLs and Org links in |
|
||||
| |current Org entry to Org headings, each |
|
||||
| |containing the web page content of that |
|
||||
| |URL, converted to Org-mode text and |
|
||||
| |processed with eww-readable. This should|
|
||||
| |be called on an entry that solely |
|
||||
| |contains a list of URLs or links. |
|
||||
+-------------------------------------------+----------------------------------------+
|
||||
|org-web-tools-archive-attach |Download archive of page at URL and |
|
||||
| |attach with org-attach. If CHOOSE-FN is |
|
||||
| |non-nil (interactively, with universal |
|
||||
| |prefix), prompt for the archive function|
|
||||
| |to use. If VIEW is non-nil |
|
||||
| |(interactively, with two universal |
|
||||
| |prefixes), view the archive immediately |
|
||||
| |after attaching. (See also org-board). |
|
||||
+-------------------------------------------+----------------------------------------+
|
||||
|org-web-tools-archive-view |Open Zip file archive of web |
|
||||
| |page. Extracts to a temp directory and |
|
||||
| |opens with |
|
||||
| |browse-url-default-browser. Note, the |
|
||||
| |extracted files are left on-disk in the |
|
||||
| |temp directory. |
|
||||
+-------------------------------------------+----------------------------------------+
|
||||
|
||||
* Troubleshooting
|
||||
|
||||
The attach command does not work natively because wget's variables are set incorrectly. The solution is:
|
||||
|
||||
|
||||
#+begin_src emacs-lisp
|
||||
(use-package org-web-tools
|
||||
:ensure t
|
||||
:config
|
||||
(setq org-web-tools-archive-wget-options
|
||||
(delete "--execute robots=off" org-web-tools-archive-wget-options))
|
||||
(setq org-web-tools-archive-wget-html-only-options
|
||||
(delete "--execute robots=off" org-web-tools-archive-wget-html-only-options))
|
||||
|
||||
(add-to-list 'org-web-tools-archive-wget-options "-e robots=off")
|
||||
(add-to-list 'org-web-tools-archive-wget-html-only-options "-e robots=off"))
|
||||
#+end_src
|
||||
|
||||
Nevertheless, the normal attach function cannot be used, but only the command with C-u as prefix (1xtype and then the command. There then HTML-only or tar with resources can be used.
|
||||
|
||||
|
||||
* Functions
|
||||
These are used in the commands above and may be useful in building your own commands.
|
||||
|
||||
+--------------------------------------+------------------------------+
|
||||
|org-web-tools--dom-to-html |Return parsed HTML DOM as an |
|
||||
| |HTML string. Note: This is an |
|
||||
| |approximation and is not |
|
||||
| |necessarily correct HTML |
|
||||
| |(e.g. IMG tags may be rendered|
|
||||
| |with a closing “</img>” tag). |
|
||||
+--------------------------------------+------------------------------+
|
||||
|org-web-tools--eww-readable |Return “readable” part of HTML|
|
||||
| |with title. |
|
||||
+--------------------------------------+------------------------------+
|
||||
|org-web-tools--get-url |Return content for URL as |
|
||||
| |string. |
|
||||
+--------------------------------------+------------------------------+
|
||||
|org-web-tools--html-title |Return title of HTML page. |
|
||||
+--------------------------------------+------------------------------+
|
||||
|org-web-tools--html-to-org-with-pandoc|Return string of HTML |
|
||||
| |converted to Org with |
|
||||
| |Pandoc. When SELECTOR is |
|
||||
| |non-nil, the HTML is filtered |
|
||||
| |using esxml-query SELECTOR and|
|
||||
| |re-rendered to HTML with |
|
||||
| |org-web-tools--dom-to-html, |
|
||||
| |which see. |
|
||||
+--------------------------------------+------------------------------+
|
||||
|org-web-tools--url-as-readable-org |Return string containing Org |
|
||||
| |entry of URL’s web page |
|
||||
| |content. Content is processed |
|
||||
| |with eww-readable and |
|
||||
| |Pandoc. Entry will be a |
|
||||
| |top-level heading, with |
|
||||
| |article contents below a |
|
||||
| |second-level “Article” |
|
||||
| |heading, and a timestamp in |
|
||||
| |the first-level entry for |
|
||||
| |writing comments. |
|
||||
+--------------------------------------+------------------------------+
|
||||
|org-web-tools--demote-headings-below |Demote all headings in buffer |
|
||||
| |so the highest level is below |
|
||||
| |LEVEL. |
|
||||
+--------------------------------------+------------------------------+
|
||||
|org-web-tools--get-first-url |Return URL in clipboard, or |
|
||||
| |first URL in the kill-ring, or|
|
||||
| |nil if none. |
|
||||
+--------------------------------------+------------------------------+
|
||||
|org-web-tools--read-url |Return a URL by searching at |
|
||||
| |point, then in clipboard, then|
|
||||
| |in kill-ring, and finally |
|
||||
| |prompting the user. |
|
||||
+--------------------------------------+------------------------------+
|
||||
|org-web-tools--read-org-bracket-link |Return (TARGET . DESCRIPTION) |
|
||||
| |for Org bracket LINK or next |
|
||||
| |link on current line. |
|
||||
+--------------------------------------+------------------------------+
|
||||
|org-web-tools--remove-dos-crlf |Remove all DOS CRLF (^M) in |
|
||||
| |buffer. |
|
||||
+--------------------------------------+------------------------------+
|
||||
|
||||
|
||||
Reference in New Issue
Block a user