Archival Tools

From Bibliotheca Anonoma
Revision as of 17:46, 16 October 2016 by Antonizoon (talk | contribs) (Created page with "== Complete Website Archival == === Wget === Outputs plain HTML. <pre> wget -mbc -np "http://aya.shii.org" \ --convert-links \ --adjust-extension \ --page-requisit...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Complete Website Archival

Wget

Outputs plain HTML.

wget -mbc -np "http://aya.shii.org" \
   --convert-links \
   --adjust-extension \
   --page-requisites --no-check-certificate --restrict-file-names=nocontrol \
   -e robots=off \
   --waitretry 5 \
   --timeout 60 \
   --tries 5 \
   --wait 1 \
   --user-agent "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27"
  • `-k` or `-K` - keep original file as `*.orig` after `--adjust-extension`. Otherwise, when remirroring with wget, each page will be redownloaded.

Wget WARC

Outputs in WARC format, ready for upload to the Internet Archive.

wget -mbc -np "http://aya.shii.org" \
   --page-requisites --no-check-certificate --restrict-file-names=nocontrol   -e robots=off \
   --waitretry 5 \
   --timeout 60 \
    --tries 5 \
   --wait 1 \
   --warc-file=aya.shii.org \
   --warc-cdx \
   --warc-max-size=1G \
   --user-agent "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27"

Youtube-dl

youtube-dl "http://www.youtube.com/playlist?list=PL3634152194A90D8B&feature=mh_lolz" \
   --write-thumbnail \
   --write-description \
   --write-info-json \
   --write-annotations \
   --write-sub \
   --all-subs \
   --add-metadata \
   --embed-subs \
   --restrict-filenames \
   --user-agent "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27"