Archival Tools: Difference between revisions

From Bibliotheca Anonoma
Line 38: Line 38:
   --user-agent "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27"
   --user-agent "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27"
</pre>
</pre>
=== grab-site ===
Grab-site can be particularly hard to install due to all the updated libraries needed, but even on CentOS 7 it can be done.
First, install all the dependencies.
{{bc|
sudo yum install gcc make gcc-c++ sqlite-devel readline-devel re2-devel libxml2-devel libffi-devel gcc openssl-devel bzip2-devel
}}
Then follow the guide to have python 3.7 compiled per user. Doing this allows the user to upgrade it by themselves without needing admin intervention, though it does take a lot more time to compile python.
https://github.com/ludios/grab-site#install-on-ubuntu-1604-1804-debian-9-stretch-debian-10-buster


=== Archiving a list of URLs ===
=== Archiving a list of URLs ===

Revision as of 05:50, 10 February 2019

Complete Website Archival

For wget, you can just replace the command with wpull and it will work.

Wget

Outputs plain HTML.

wget -mbc -np "http://aya.shii.org" \
   --convert-links \
   --adjust-extension \
   --page-requisites --no-check-certificate --restrict-file-names=nocontrol \
   -e robots=off \
   --waitretry 5 \
   --timeout 60 \
   --tries 5 \
   --wait 1 \
   --user-agent "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27"
  • `-k` or `-K` - keep original file as `*.orig` after `--adjust-extension`. Otherwise, when remirroring with wget, each page will be redownloaded.

Wget WARC

Outputs in WARC format, ready for upload to the Internet Archive.

wget -mbc -np "http://aya.shii.org" \
   --page-requisites --no-check-certificate --restrict-file-names=nocontrol   -e robots=off \
   --waitretry 5 \
   --timeout 60 \
    --tries 5 \
   --wait 1 \
   --warc-file=aya.shii.org \
   --warc-cdx \
   --warc-max-size=1G \
   --user-agent "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27"

grab-site

Grab-site can be particularly hard to install due to all the updated libraries needed, but even on CentOS 7 it can be done.

First, install all the dependencies.

sudo yum install gcc make gcc-c++ sqlite-devel readline-devel re2-devel libxml2-devel libffi-devel gcc openssl-devel bzip2-devel

Then follow the guide to have python 3.7 compiled per user. Doing this allows the user to upgrade it by themselves without needing admin intervention, though it does take a lot more time to compile python.

https://github.com/ludios/grab-site#install-on-ubuntu-1604-1804-debian-9-stretch-debian-10-buster

Archiving a list of URLs

Since I find it incredibly useful to be able to copy a list of links from a web page (or a Pastebin paste) to download with Wpull, I wanted others to know how to do this themselves.

  1. You’ll need the Firefox add-on “Copy All Links”. Install it.
  2. After installing it, go to the web page (or Pastebin paste–preferably as a raw paste so other nav links don’t interfere), right click, highlight “Copy All Links” -> “Current Tab” -> “All Links”.
    • To easily find Pastebin pastes with numerous links, restrict your Google Search to pastebin.com (add a subject and the word “links”). Like this: site:pastebin.com <subject> links.
  3. Head to Put Text In Alphabetical Order to organize the text alphabetically (and remove duplicate links). Check “Use a line break separator” and uncheck “Remove Punctuation, and Brackets”, under “Removal Options”. Leave “Remove Duplicates” checked, as it’s unnecessary to download a webpage twice.
  4. After you click “Alphabetize Text”, copy+paste your link list into a text file and save it.
  5. Open Linux terminal, type this command ( based on this command ), and run:
    • wpull -i TEXTFILE --page-requisites --no-robots --no-check-certificate --tries 3 --timeout 60 --delete-after --warc-file WARCNAME --warc-max-size=4294967296 --database DATABASE.db --output-file OUTPUT.log --user-agent "Scraper v1.0"
    • --youtube-dl - (Optional) add this if there are videos you want to download. Please download video hosting links standalone because it will give problems if you use this argument while downloading ordinary web pages.
    • --warc-append - (Optional) add this if a WARC stopped downloading and you want to resume.
  6. And lastly, you will need to install [internetarchive](https://github.com/jjjake/internetarchive). This link explains how to use and install the program.
  7. After installing internetarchive, use this command to upload, and you are finished:
ia upload <identifier> <file/foldername> \
--metadata="title:<title>" \
--metadata="subject:<tag>;<tag>;" \
--metadata="description:This is the description. You can use HTML tags in it to make </br> line breaks."

Youtube-dl

youtube-dl "http://www.youtube.com/playlist?list=PL3634152194A90D8B&feature=mh_lolz" \
   --write-thumbnail \
   --write-description \
   --write-info-json \
   --write-annotations \
   --write-sub \
   --all-subs \
   --add-metadata \
   --embed-subs \
   --restrict-filenames \
   --user-agent "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27"

AWS Mass Scraping

Just like empty cargo containers returning to China, incoming bandwidth is free on Servers. Like, unlimited, at massive speeds. It's just outgoing that costs money.

Thus, this man leveraged it to scrape 10TB of data from domain names at light speed. Of course, the domain names actually stored were just in in the range of gigabytes.

http://blog.waleson.com/2016/01/parsing-10tb-of-metadata-26m-domains.html

Handy Programs

  • internetarchive Uploader
  • The Internet Archive requires metadata for creating new objects. For now, just create new objects using the web interface and upload a small file.
  • Then, upload the real files from the server using:
  • ia upload <identifier> file1 file2
  • Wget - Wget makes WARCs.
  • Archive entire Imgur galleries. https://github.com/iceTwy/imgur-scraper