Archival Tools: Difference between revisions

Revision as of 15:02, 19 October 2016

Complete Website Archival

For wget, you can just replace the command with wpull and it will work.

Wget

Outputs plain HTML.

wget -mbc -np "http://aya.shii.org" \
   --convert-links \
   --adjust-extension \
   --page-requisites --no-check-certificate --restrict-file-names=nocontrol \
   -e robots=off \
   --waitretry 5 \
   --timeout 60 \
   --tries 5 \
   --wait 1 \
   --user-agent "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27"

`-k` or `-K` - keep original file as `*.orig` after `--adjust-extension`. Otherwise, when remirroring with wget, each page will be redownloaded.

Wget WARC

Outputs in WARC format, ready for upload to the Internet Archive.

wget -mbc -np "http://aya.shii.org" \
   --page-requisites --no-check-certificate --restrict-file-names=nocontrol   -e robots=off \
   --waitretry 5 \
   --timeout 60 \
    --tries 5 \
   --wait 1 \
   --warc-file=aya.shii.org \
   --warc-cdx \
   --warc-max-size=1G \
   --user-agent "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27"

Archiving a list of URLs

Since I find it incredibly useful to be able to copy a list of links from a web page (or a Pastebin paste) to download with Wpull, I wanted others to know how to do this themselves.

You’ll need the Firefox add-on “Copy All Links”. Install it.
After installing it, go to the web page (or Pastebin paste–preferably as a raw paste so other nav links don’t interfere), right click, highlight “Copy All Links” -> “Current Tab” -> “All Links”.
- To easily find Pastebin pastes with numerous links, restrict your Google Search to pastebin.com (add a subject and the word “links”). Like this: site:pastebin.com <subject> links.
Head to Put Text In Alphabetical Order to organize the text alphabetically (and remove duplicate links). Check “Use a line break separator” and uncheck “Remove Punctuation, and Brackets”, under “Removal Options”. Leave “Remove Duplicates” checked, as it’s unnecessary to download a webpage twice.
After you click “Alphabetize Text”, copy+paste your link list into a text file and save it.
Open Linux terminal, type this command ( based on this command ), and run:
- ```
wpull -i TEXTFILE --page-requisites --no-robots --no-check-certificate --tries 3 --timeout 60 --delete-after --warc-file WARCNAME --warc-max-size=4294967296 --database DATABASE.db --output-file OUTPUT.log --user-agent "Scraper v1.0"
```
- --youtube-dl - (Optional) add this if there are videos you want to download. Please download video hosting links standalone because it will give problems if you use this argument while downloading ordinary web pages.
- --warc-append - (Optional) add this if a WARC stopped downloading and you want to resume.
And lastly, you will need to install [internetarchive](https://github.com/jjjake/internetarchive). This link explains how to use and install the program.
After installing internetarchive, use this command to upload, and you are finished:

ia upload <identifier> <file/foldername> \
--metadata="title:<title>" \
--metadata="subject:<tag>;<tag>;etc...

Youtube-dl

youtube-dl "http://www.youtube.com/playlist?list=PL3634152194A90D8B&feature=mh_lolz" \
   --write-thumbnail \
   --write-description \
   --write-info-json \
   --write-annotations \
   --write-sub \
   --all-subs \
   --add-metadata \
   --embed-subs \
   --restrict-filenames \
   --user-agent "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27"

@@ Line 49: / Line 49: @@
 # After you click “Alphabetize Text”, copy+paste your link list into a text file and save it.
 # Open Linux terminal, type this command ( [https://github.com/chfoo/wpull based on this command] ), and run:
-#* <pre>wpull -i TEXTFILE --page-requisites --no-robots --no-check-certificate --tries 3 --timeout 60 --delete-after --warc-file WARCNAME --warc-max-size=4294967296 --database DATABASE.db --output-file OUTPUT.log --user-agent &quot;Scraper v1.0&quot;</pre
+#* <pre>wpull -i TEXTFILE --page-requisites --no-robots --no-check-certificate --tries 3 --timeout 60 --delete-after --warc-file WARCNAME --warc-max-size=4294967296 --database DATABASE.db --output-file OUTPUT.log --user-agent &quot;Scraper v1.0&quot;</pre>
 #* <code>--youtube-dl</code> - (Optional) add this if there are videos you want to download. '''Please download video hosting links standalone because it will give problems if you use this argument while downloading ordinary web pages.'''
 #* <code>--warc-append</code> - (Optional) add this if a WARC stopped downloading and you want to resume.
@@ Line 55: / Line 55: @@
 # After installing internetarchive, use this command to upload, and you are finished:
-<pre>
+<pre><nowiki>
 ia upload <identifier> <file/foldername> \
 --metadata="title:<title>" \
 --metadata="subject:<tag>;<tag>;etc...
-</pre>
+</nowiki></pre>
 == Youtube-dl ==

Anonymous

Search

Archival Tools: Difference between revisions

Namespaces

More

Page actions

Revision as of 15:02, 19 October 2016

Contents

Complete Website Archival

Wget

Wget WARC

Archiving a list of URLs

Youtube-dl

Navigation

Navigation

Wiki tools

Wiki tools

Anonymous

Search

Archival Tools: Difference between revisions

Revision as of 15:02, 19 October 2016

Complete Website Archival

Wget

Wget WARC

Archiving a list of URLs

Youtube-dl

Navigation

Wiki tools

Page tools