Latest revision |
Your text |
Line 37: |
Line 37: |
| --warc-max-size=1G \ | | --warc-max-size=1G \ |
| --user-agent "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27" | | --user-agent "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27" |
| </pre>
| |
|
| |
| === grab-site ===
| |
|
| |
| Grab-site can be particularly hard to install due to all the updated libraries needed, but even on CentOS 7 it can be done.
| |
|
| |
| First, install all the dependencies.
| |
|
| |
| {{bc|
| |
| sudo yum install gcc make gcc-c++ sqlite-devel readline-devel re2-devel libxml2-devel libffi-devel gcc openssl-devel bzip2-devel
| |
| }}
| |
|
| |
| Then follow the guide to have python 3.7 compiled per user. Doing this allows the user to upgrade it by themselves without needing admin intervention, though it does take a lot more time to compile python.
| |
|
| |
| https://github.com/ludios/grab-site#install-on-ubuntu-1604-1804-debian-9-stretch-debian-10-buster
| |
|
| |
| <pre>
| |
| # yum install gcc make gcc-c++ sqlite-devel readline-devel re2-devel libxml2-devel libffi-devel gcc openssl-devel bzip2-devel
| |
| # cd /usr/src
| |
| # wget https://www.python.org/ftp/python/3.7.1/Python-3.7.1.tgz
| |
| # tar xzf Python-3.7.1.tgz
| |
| # cd Python-3.7.1
| |
| # ./configure --enable-optimizations --enable-loadable-sqlite-extensions
| |
| # make altinstall
| |
| </pre>
| |
| logout as root, then become your own personal user:
| |
| <pre>
| |
| $ cd ~
| |
| $ echo 'export PATH=$PATH:/usr/local/bin' >> ~/.bashrc
| |
| $ pip3.7 install --process-dependency-links --no-binary --upgrade git+https://github.com/ludios/grab-site --user # install as user
| |
| </pre> | | </pre> |
|
| |
|
Line 88: |
Line 58: |
| ia upload <identifier> <file/foldername> \ | | ia upload <identifier> <file/foldername> \ |
| --metadata="title:<title>" \ | | --metadata="title:<title>" \ |
| --metadata="subject:<tag>;<tag>;" \ | | --metadata="subject:<tag>;<tag>;etc... |
| --metadata="description:This is the description. You can use HTML tags in it to make </br> line breaks."
| |
| </nowiki></pre> | | </nowiki></pre> |
| | |
| == Youtube-dl == | | == Youtube-dl == |
|
| |
|
Line 106: |
Line 76: |
| --user-agent "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27" | | --user-agent "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27" |
| </pre> | | </pre> |
|
| |
| == AWS Mass Scraping ==
| |
|
| |
| Just like empty cargo containers returning to China, incoming bandwidth is free on Servers. Like, unlimited, at massive speeds. It's just outgoing that costs money.
| |
|
| |
| Thus, this man leveraged it to scrape 10TB of data from domain names at light speed. Of course, the domain names actually stored were just in in the range of gigabytes.
| |
|
| |
| http://blog.waleson.com/2016/01/parsing-10tb-of-metadata-26m-domains.html
| |
|
| |
| == Handy Programs ==
| |
|
| |
| * [https://pypi.python.org/pypi/internetarchive/ internetarchive Uploader]
| |
| * The Internet Archive requires metadata for creating new objects. For now, just create new objects using the web interface and upload a small file.
| |
| * Then, upload the real files from the server using:
| |
| * <code>ia upload <identifier> file1 file2</code>
| |
| * [https://github.com/BASLQC/BASLQC/wiki/Wget Wget] - Wget makes WARCs.
| |
| * Archive entire Imgur galleries. https://github.com/iceTwy/imgur-scraper
| |