Archival Tools: Difference between revisions

From Bibliotheca Anonoma
No edit summary
No edit summary
Line 76: Line 76:
   --user-agent "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27"
   --user-agent "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27"
</pre>
</pre>
== AWS Mass Scraping ==
Just like empty cargo containers returning to China, incoming bandwidth is free on Servers. Like, unlimited, at massive speeds. It's just outgoing that costs money.
Thus, this man leveraged it to scrape 10TB of data from domain names at light speed. Of course, the domain names actually stored were just in in the range of gigabytes.
http://blog.waleson.com/2016/01/parsing-10tb-of-metadata-26m-domains.html

Revision as of 15:32, 19 October 2016

Complete Website Archival

For wget, you can just replace the command with wpull and it will work.

Wget

Outputs plain HTML.

wget -mbc -np "http://aya.shii.org" \
   --convert-links \
   --adjust-extension \
   --page-requisites --no-check-certificate --restrict-file-names=nocontrol \
   -e robots=off \
   --waitretry 5 \
   --timeout 60 \
   --tries 5 \
   --wait 1 \
   --user-agent "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27"
  • `-k` or `-K` - keep original file as `*.orig` after `--adjust-extension`. Otherwise, when remirroring with wget, each page will be redownloaded.

Wget WARC

Outputs in WARC format, ready for upload to the Internet Archive.

wget -mbc -np "http://aya.shii.org" \
   --page-requisites --no-check-certificate --restrict-file-names=nocontrol   -e robots=off \
   --waitretry 5 \
   --timeout 60 \
    --tries 5 \
   --wait 1 \
   --warc-file=aya.shii.org \
   --warc-cdx \
   --warc-max-size=1G \
   --user-agent "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27"

Archiving a list of URLs

Since I find it incredibly useful to be able to copy a list of links from a web page (or a Pastebin paste) to download with Wpull, I wanted others to know how to do this themselves.

  1. You’ll need the Firefox add-on “Copy All Links”. Install it.
  2. After installing it, go to the web page (or Pastebin paste–preferably as a raw paste so other nav links don’t interfere), right click, highlight “Copy All Links” -> “Current Tab” -> “All Links”.
    • To easily find Pastebin pastes with numerous links, restrict your Google Search to pastebin.com (add a subject and the word “links”). Like this: site:pastebin.com <subject> links.
  3. Head to Put Text In Alphabetical Order to organize the text alphabetically (and remove duplicate links). Check “Use a line break separator” and uncheck “Remove Punctuation, and Brackets”, under “Removal Options”. Leave “Remove Duplicates” checked, as it’s unnecessary to download a webpage twice.
  4. After you click “Alphabetize Text”, copy+paste your link list into a text file and save it.
  5. Open Linux terminal, type this command ( based on this command ), and run:
    • wpull -i TEXTFILE --page-requisites --no-robots --no-check-certificate --tries 3 --timeout 60 --delete-after --warc-file WARCNAME --warc-max-size=4294967296 --database DATABASE.db --output-file OUTPUT.log --user-agent "Scraper v1.0"
    • --youtube-dl - (Optional) add this if there are videos you want to download. Please download video hosting links standalone because it will give problems if you use this argument while downloading ordinary web pages.
    • --warc-append - (Optional) add this if a WARC stopped downloading and you want to resume.
  6. And lastly, you will need to install [internetarchive](https://github.com/jjjake/internetarchive). This link explains how to use and install the program.
  7. After installing internetarchive, use this command to upload, and you are finished:
ia upload <identifier> <file/foldername> \
--metadata="title:<title>" \
--metadata="subject:<tag>;<tag>;etc...

Youtube-dl

youtube-dl "http://www.youtube.com/playlist?list=PL3634152194A90D8B&feature=mh_lolz" \
   --write-thumbnail \
   --write-description \
   --write-info-json \
   --write-annotations \
   --write-sub \
   --all-subs \
   --add-metadata \
   --embed-subs \
   --restrict-filenames \
   --user-agent "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27"

AWS Mass Scraping

Just like empty cargo containers returning to China, incoming bandwidth is free on Servers. Like, unlimited, at massive speeds. It's just outgoing that costs money.

Thus, this man leveraged it to scrape 10TB of data from domain names at light speed. Of course, the domain names actually stored were just in in the range of gigabytes.

http://blog.waleson.com/2016/01/parsing-10tb-of-metadata-26m-domains.html