Files
org-roam/20230522132904-org_web_tools.org
2025-11-05 09:18:11 +01:00

8.6 KiB
Raw Permalink Blame History

org-web-tools

Toolbox for downloading HTML websites. There are several functions of the framework. We need the external program pandoc to convert the HTML pages to org-files….

Commands

——————————————-—————————————-+

org-web-tools-insert-link-for-url Insert an Org-mode link to the URL in
the clipboard or kill-ring. Downloads
the page to get the HTML title.

——————————————-—————————————-+

org-web-tools-insert-web-page-as-entry Insert the web page for the URL in the
clipboard or kill-ring as an Org-mode
entry, as a sibling heading of the
current entry.

——————————————-—————————————-+

org-web-tools-read-url-as-org Display the web page for the URL in the
clipboard or kill-ring as Org-mode text
in a new buffer, processed with
eww-readable.

——————————————-—————————————-+

org-web-tools-convert-links-to-page-entries Convert all URLs and Org links in
current Org entry to Org headings, each
containing the web page content of that
URL, converted to Org-mode text and
processed with eww-readable. This should
be called on an entry that solely
contains a list of URLs or links.

——————————————-—————————————-+

org-web-tools-archive-attach Download archive of page at URL and
attach with org-attach. If CHOOSE-FN is
non-nil (interactively, with universal
prefix), prompt for the archive function
to use. If VIEW is non-nil
(interactively, with two universal
prefixes), view the archive immediately
after attaching. (See also org-board).

——————————————-—————————————-+

org-web-tools-archive-view Open Zip file archive of web
page. Extracts to a temp directory and
opens with
browse-url-default-browser. Note, the
extracted files are left on-disk in the
temp directory.

——————————————-—————————————-+

Troubleshooting

The attach command does not work natively because wget's variables are set incorrectly. The solution is:

(use-package org-web-tools
  :ensure t
  :config
    (setq org-web-tools-archive-wget-options
        (delete "--execute robots=off" org-web-tools-archive-wget-options))
    (setq org-web-tools-archive-wget-html-only-options
        (delete "--execute robots=off" org-web-tools-archive-wget-html-only-options))

    (add-to-list 'org-web-tools-archive-wget-options "-e robots=off")
    (add-to-list 'org-web-tools-archive-wget-html-only-options "-e robots=off"))

Nevertheless, the normal attach function cannot be used, but only the command with C-u as prefix (1xtype and then the command. There then HTML-only or tar with resources can be used.

Functions

These are used in the commands above and may be useful in building your own commands.

————————————–——————————+

org-web-toolsdom-to-html Return parsed HTML DOM as an
HTML string. Note: This is an
approximation and is not
necessarily correct HTML
(e.g. IMG tags may be rendered
with a closing “</img>” tag).

————————————–——————————+

org-web-toolseww-readable Return “readable” part of HTML
with title.

————————————–——————————+

org-web-toolsget-url Return content for URL as
string.

————————————–——————————+

org-web-toolshtml-title Return title of HTML page.

————————————–——————————+

org-web-toolshtml-to-org-with-pandoc Return string of HTML
converted to Org with
Pandoc. When SELECTOR is
non-nil, the HTML is filtered
using esxml-query SELECTOR and
re-rendered to HTML with
org-web-toolsdom-to-html,
which see.

————————————–——————————+

org-web-toolsurl-as-readable-org Return string containing Org
entry of URLs web page
content. Content is processed
with eww-readable and
Pandoc. Entry will be a
top-level heading, with
article contents below a
second-level “Article”
heading, and a timestamp in
the first-level entry for
writing comments.

————————————–——————————+

org-web-toolsdemote-headings-below Demote all headings in buffer
so the highest level is below
LEVEL.

————————————–——————————+

org-web-toolsget-first-url Return URL in clipboard, or
first URL in the kill-ring, or
nil if none.

————————————–——————————+

org-web-toolsread-url Return a URL by searching at
point, then in clipboard, then
in kill-ring, and finally
prompting the user.

————————————–——————————+

org-web-toolsread-org-bracket-link Return (TARGET . DESCRIPTION)
for Org bracket LINK or next
link on current line.

————————————–——————————+

org-web-toolsremove-dos-crlf Remove all DOS CRLF (^M) in
buffer.

————————————–——————————+