Editing Archival Tools

From Bibliotheca Anonoma

Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then publish the changes below to finish undoing the edit.

Latest revision Your text
Line 1: Line 1:
== Complete Website Archival ==
== Complete Website Archival ==
For wget, you can just replace the command with wpull and it will work.


=== Wget ===
=== Wget ===
Line 39: Line 37:
</pre>
</pre>


=== grab-site ===
Grab-site can be particularly hard to install due to all the updated libraries needed, but even on CentOS 7 it can be done.
First, install all the dependencies.
{{bc|
sudo yum install gcc make gcc-c++ sqlite-devel readline-devel re2-devel libxml2-devel libffi-devel gcc openssl-devel bzip2-devel
}}
Then follow the guide to have python 3.7 compiled per user. Doing this allows the user to upgrade it by themselves without needing admin intervention, though it does take a lot more time to compile python.
https://github.com/ludios/grab-site#install-on-ubuntu-1604-1804-debian-9-stretch-debian-10-buster
<pre>
# yum install gcc make gcc-c++ sqlite-devel readline-devel re2-devel libxml2-devel libffi-devel gcc openssl-devel bzip2-devel
# cd /usr/src
# wget https://www.python.org/ftp/python/3.7.1/Python-3.7.1.tgz
# tar xzf Python-3.7.1.tgz
# cd Python-3.7.1
# ./configure --enable-optimizations --enable-loadable-sqlite-extensions
# make altinstall
</pre>
logout as root, then become your own personal user:
<pre>
$ cd ~
$ echo 'export PATH=$PATH:/usr/local/bin' >> ~/.bashrc
$ pip3.7 install  --process-dependency-links --no-binary --upgrade git+https://github.com/ludios/grab-site --user # install as user
</pre>
=== Archiving a list of URLs ===
Since I find it incredibly useful to be able to copy a list of links from a web page (or a Pastebin paste) to download with Wpull, I wanted others to know how to do this themselves.
# You’ll need the Firefox add-on “[https://addons.mozilla.org/en-US/firefox/addon/copy-all-links/ Copy All Links]”. Install it.
# After installing it, go to the web page (or Pastebin paste–preferably as a raw paste so other nav links don’t interfere), right click, highlight “Copy All Links” -> “Current Tab” -> “All Links”.
#* To easily find Pastebin pastes with numerous links, restrict your Google Search to pastebin.com (add a subject and the word “links”). Like this: ''site:pastebin.com <subject> links''.
# Head to [http://www.textfixer.com/tools/alphabetize-text-words.php Put Text In Alphabetical Order] to organize the text alphabetically (and remove duplicate links). Check “Use a line break separator” and uncheck “Remove Punctuation, and Brackets”, under “Removal Options”. Leave “Remove Duplicates” checked, as it’s unnecessary to download a webpage twice.
# After you click “Alphabetize Text”, copy+paste your link list into a text file and save it.
# Open Linux terminal, type this command ( [https://github.com/chfoo/wpull based on this command] ), and run:
#* <pre>wpull -i TEXTFILE --page-requisites --no-robots --no-check-certificate --tries 3 --timeout 60 --delete-after --warc-file WARCNAME --warc-max-size=4294967296 --database DATABASE.db --output-file OUTPUT.log --user-agent &quot;Scraper v1.0&quot;</pre>
#* <code>--youtube-dl</code> - (Optional) add this if there are videos you want to download. '''Please download video hosting links standalone because it will give problems if you use this argument while downloading ordinary web pages.'''
#* <code>--warc-append</code> - (Optional) add this if a WARC stopped downloading and you want to resume.
# And lastly, you will need to install [internetarchive](https://github.com/jjjake/internetarchive). This link explains how to use and install the program.
# After installing internetarchive, use this command to upload, and you are finished:
<pre><nowiki>
ia upload <identifier> <file/foldername> \
--metadata="title:<title>" \
--metadata="subject:<tag>;<tag>;" \
--metadata="description:This is the description. You can use HTML tags in it to make </br> line breaks."
</nowiki></pre>
== Youtube-dl ==
== Youtube-dl ==


Line 106: Line 52:
   --user-agent "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27"
   --user-agent "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27"
</pre>
</pre>
== AWS Mass Scraping ==
Just like empty cargo containers returning to China, incoming bandwidth is free on Servers. Like, unlimited, at massive speeds. It's just outgoing that costs money.
Thus, this man leveraged it to scrape 10TB of data from domain names at light speed. Of course, the domain names actually stored were just in in the range of gigabytes.
http://blog.waleson.com/2016/01/parsing-10tb-of-metadata-26m-domains.html
== Handy Programs ==
* [https://pypi.python.org/pypi/internetarchive/ internetarchive Uploader]
* The Internet Archive requires metadata for creating new objects. For now, just create new objects using the web interface and upload a small file.
* Then, upload the real files from the server using:
* <code>ia upload &lt;identifier&gt; file1 file2</code>
* [https://github.com/BASLQC/BASLQC/wiki/Wget Wget] - Wget makes WARCs.
* Archive entire Imgur galleries. https://github.com/iceTwy/imgur-scraper
Please note that all contributions to Bibliotheca Anonoma are considered to be released under the Creative Commons Attribution-ShareAlike (see Bibliotheca Anonoma:Copyrights for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource. Do not submit copyrighted work without permission!
Cancel Editing help (opens in new window)

Template used on this page: