Archival Tools: Difference between revisions
Antonizoon (talk | contribs) No edit summary |
Antonizoon (talk | contribs) |
||
(6 intermediate revisions by the same user not shown) | |||
Line 37: | Line 37: | ||
--warc-max-size=1G \ | --warc-max-size=1G \ | ||
--user-agent "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27" | --user-agent "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27" | ||
</pre> | |||
=== grab-site === | |||
Grab-site can be particularly hard to install due to all the updated libraries needed, but even on CentOS 7 it can be done. | |||
First, install all the dependencies. | |||
{{bc| | |||
sudo yum install gcc make gcc-c++ sqlite-devel readline-devel re2-devel libxml2-devel libffi-devel gcc openssl-devel bzip2-devel | |||
}} | |||
Then follow the guide to have python 3.7 compiled per user. Doing this allows the user to upgrade it by themselves without needing admin intervention, though it does take a lot more time to compile python. | |||
https://github.com/ludios/grab-site#install-on-ubuntu-1604-1804-debian-9-stretch-debian-10-buster | |||
<pre> | |||
# yum install gcc make gcc-c++ sqlite-devel readline-devel re2-devel libxml2-devel libffi-devel gcc openssl-devel bzip2-devel | |||
# cd /usr/src | |||
# wget https://www.python.org/ftp/python/3.7.1/Python-3.7.1.tgz | |||
# tar xzf Python-3.7.1.tgz | |||
# cd Python-3.7.1 | |||
# ./configure --enable-optimizations --enable-loadable-sqlite-extensions | |||
# make altinstall | |||
</pre> | |||
logout as root, then become your own personal user: | |||
<pre> | |||
$ cd ~ | |||
$ echo 'export PATH=$PATH:/usr/local/bin' >> ~/.bashrc | |||
$ pip3.7 install --process-dependency-links --no-binary --upgrade git+https://github.com/ludios/grab-site --user # install as user | |||
</pre> | </pre> | ||
Line 49: | Line 79: | ||
# After you click “Alphabetize Text”, copy+paste your link list into a text file and save it. | # After you click “Alphabetize Text”, copy+paste your link list into a text file and save it. | ||
# Open Linux terminal, type this command ( [https://github.com/chfoo/wpull based on this command] ), and run: | # Open Linux terminal, type this command ( [https://github.com/chfoo/wpull based on this command] ), and run: | ||
#* <pre>wpull -i TEXTFILE --page-requisites --no-robots --no-check-certificate --tries 3 --timeout 60 --delete-after --warc-file WARCNAME --warc-max-size=4294967296 --database DATABASE.db --output-file OUTPUT.log --user-agent "Scraper v1.0"</pre | #* <pre>wpull -i TEXTFILE --page-requisites --no-robots --no-check-certificate --tries 3 --timeout 60 --delete-after --warc-file WARCNAME --warc-max-size=4294967296 --database DATABASE.db --output-file OUTPUT.log --user-agent "Scraper v1.0"</pre> | ||
#* <code>--youtube-dl</code> - (Optional) add this if there are videos you want to download. '''Please download video hosting links standalone because it will give problems if you use this argument while downloading ordinary web pages.''' | #* <code>--youtube-dl</code> - (Optional) add this if there are videos you want to download. '''Please download video hosting links standalone because it will give problems if you use this argument while downloading ordinary web pages.''' | ||
#* <code>--warc-append</code> - (Optional) add this if a WARC stopped downloading and you want to resume. | #* <code>--warc-append</code> - (Optional) add this if a WARC stopped downloading and you want to resume. | ||
Line 55: | Line 85: | ||
# After installing internetarchive, use this command to upload, and you are finished: | # After installing internetarchive, use this command to upload, and you are finished: | ||
<pre> | <pre><nowiki> | ||
ia upload <identifier> <file/foldername> \ | ia upload <identifier> <file/foldername> \ | ||
--metadata="title:<title>" \ | --metadata="title:<title>" \ | ||
--metadata="subject:<tag>;<tag>; | --metadata="subject:<tag>;<tag>;" \ | ||
</pre> | --metadata="description:This is the description. You can use HTML tags in it to make </br> line breaks." | ||
</nowiki></pre> | |||
== Youtube-dl == | == Youtube-dl == | ||
Line 76: | Line 106: | ||
--user-agent "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27" | --user-agent "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27" | ||
</pre> | </pre> | ||
== AWS Mass Scraping == | |||
Just like empty cargo containers returning to China, incoming bandwidth is free on Servers. Like, unlimited, at massive speeds. It's just outgoing that costs money. | |||
Thus, this man leveraged it to scrape 10TB of data from domain names at light speed. Of course, the domain names actually stored were just in in the range of gigabytes. | |||
http://blog.waleson.com/2016/01/parsing-10tb-of-metadata-26m-domains.html | |||
== Handy Programs == | |||
* [https://pypi.python.org/pypi/internetarchive/ internetarchive Uploader] | |||
* The Internet Archive requires metadata for creating new objects. For now, just create new objects using the web interface and upload a small file. | |||
* Then, upload the real files from the server using: | |||
* <code>ia upload <identifier> file1 file2</code> | |||
* [https://github.com/BASLQC/BASLQC/wiki/Wget Wget] - Wget makes WARCs. | |||
* Archive entire Imgur galleries. https://github.com/iceTwy/imgur-scraper |
Latest revision as of 07:10, 10 February 2019
Complete Website Archival[edit]
For wget, you can just replace the command with wpull and it will work.
Wget[edit]
Outputs plain HTML.
wget -mbc -np "http://aya.shii.org" \ --convert-links \ --adjust-extension \ --page-requisites --no-check-certificate --restrict-file-names=nocontrol \ -e robots=off \ --waitretry 5 \ --timeout 60 \ --tries 5 \ --wait 1 \ --user-agent "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27"
- `-k` or `-K` - keep original file as `*.orig` after `--adjust-extension`. Otherwise, when remirroring with wget, each page will be redownloaded.
Wget WARC[edit]
Outputs in WARC format, ready for upload to the Internet Archive.
wget -mbc -np "http://aya.shii.org" \ --page-requisites --no-check-certificate --restrict-file-names=nocontrol -e robots=off \ --waitretry 5 \ --timeout 60 \ --tries 5 \ --wait 1 \ --warc-file=aya.shii.org \ --warc-cdx \ --warc-max-size=1G \ --user-agent "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27"
grab-site[edit]
Grab-site can be particularly hard to install due to all the updated libraries needed, but even on CentOS 7 it can be done.
First, install all the dependencies.
sudo yum install gcc make gcc-c++ sqlite-devel readline-devel re2-devel libxml2-devel libffi-devel gcc openssl-devel bzip2-devel
Then follow the guide to have python 3.7 compiled per user. Doing this allows the user to upgrade it by themselves without needing admin intervention, though it does take a lot more time to compile python.
https://github.com/ludios/grab-site#install-on-ubuntu-1604-1804-debian-9-stretch-debian-10-buster
# yum install gcc make gcc-c++ sqlite-devel readline-devel re2-devel libxml2-devel libffi-devel gcc openssl-devel bzip2-devel # cd /usr/src # wget https://www.python.org/ftp/python/3.7.1/Python-3.7.1.tgz # tar xzf Python-3.7.1.tgz # cd Python-3.7.1 # ./configure --enable-optimizations --enable-loadable-sqlite-extensions # make altinstall
logout as root, then become your own personal user:
$ cd ~ $ echo 'export PATH=$PATH:/usr/local/bin' >> ~/.bashrc $ pip3.7 install --process-dependency-links --no-binary --upgrade git+https://github.com/ludios/grab-site --user # install as user
Archiving a list of URLs[edit]
Since I find it incredibly useful to be able to copy a list of links from a web page (or a Pastebin paste) to download with Wpull, I wanted others to know how to do this themselves.
- You’ll need the Firefox add-on “Copy All Links”. Install it.
- After installing it, go to the web page (or Pastebin paste–preferably as a raw paste so other nav links don’t interfere), right click, highlight “Copy All Links” -> “Current Tab” -> “All Links”.
- To easily find Pastebin pastes with numerous links, restrict your Google Search to pastebin.com (add a subject and the word “links”). Like this: site:pastebin.com <subject> links.
- Head to Put Text In Alphabetical Order to organize the text alphabetically (and remove duplicate links). Check “Use a line break separator” and uncheck “Remove Punctuation, and Brackets”, under “Removal Options”. Leave “Remove Duplicates” checked, as it’s unnecessary to download a webpage twice.
- After you click “Alphabetize Text”, copy+paste your link list into a text file and save it.
- Open Linux terminal, type this command ( based on this command ), and run:
wpull -i TEXTFILE --page-requisites --no-robots --no-check-certificate --tries 3 --timeout 60 --delete-after --warc-file WARCNAME --warc-max-size=4294967296 --database DATABASE.db --output-file OUTPUT.log --user-agent "Scraper v1.0"
--youtube-dl
- (Optional) add this if there are videos you want to download. Please download video hosting links standalone because it will give problems if you use this argument while downloading ordinary web pages.--warc-append
- (Optional) add this if a WARC stopped downloading and you want to resume.
- And lastly, you will need to install [internetarchive](https://github.com/jjjake/internetarchive). This link explains how to use and install the program.
- After installing internetarchive, use this command to upload, and you are finished:
ia upload <identifier> <file/foldername> \ --metadata="title:<title>" \ --metadata="subject:<tag>;<tag>;" \ --metadata="description:This is the description. You can use HTML tags in it to make </br> line breaks."
Youtube-dl[edit]
youtube-dl "http://www.youtube.com/playlist?list=PL3634152194A90D8B&feature=mh_lolz" \ --write-thumbnail \ --write-description \ --write-info-json \ --write-annotations \ --write-sub \ --all-subs \ --add-metadata \ --embed-subs \ --restrict-filenames \ --user-agent "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27"
AWS Mass Scraping[edit]
Just like empty cargo containers returning to China, incoming bandwidth is free on Servers. Like, unlimited, at massive speeds. It's just outgoing that costs money.
Thus, this man leveraged it to scrape 10TB of data from domain names at light speed. Of course, the domain names actually stored were just in in the range of gigabytes.
http://blog.waleson.com/2016/01/parsing-10tb-of-metadata-26m-domains.html
Handy Programs[edit]
- internetarchive Uploader
- The Internet Archive requires metadata for creating new objects. For now, just create new objects using the web interface and upload a small file.
- Then, upload the real files from the server using:
ia upload <identifier> file1 file2
- Wget - Wget makes WARCs.
- Archive entire Imgur galleries. https://github.com/iceTwy/imgur-scraper