Baidu Tieba/Archiver

I am struggling to figure out how the API works when I look at the site, so we need to figure it out from reverse engineering code. Thankfully, code can only be written in English.

How to Grab
We will need to grab fucktons of data from Baidu. Luckily, AWS has a Japan node which has good peering to Asia and free incoming bandwidth. And it is said that a man used to to grab terabytes of data for $10.

http://blog.waleson.com/2016/01/parsing-10tb-of-metadata-26m-domains.html

It might kind of get perceived as high traffic, so cook the frog slowly.

Scrapers
Supposedly it records the contents of a thread?

https://github.com/omnbmh/baidu-tieba-capture/


 * baidu/tieba/capture

Logging into Baidu

 * Logging into Baidu

Gathering Tieba Threads

 * Gather Tieba Threads
 * URL Format:
 * (thread id) -
 * (page number to display - 要抓取的起始页码) -
 * - Total number of pages to increment  up to
 * Image URL (img src tag):
 * Example:

(帖子没有被删除) Grab data if post wasn’t deleted

if (html.select(&quot;li.l_reply_num span&quot;).size != 0) { // 帖子没有被删除 pages = Integer.parseInt(html.select(&quot;li.l_reply_num span&quot;)           .get(1).text); } Example post data (chinese characters are unicode escaped) found under

{“author”:{“user_id”:237987149,“user_name”:“zangkuiyhq”,“props”:{“1070002”:{“num”:8,“end_time”:1403538299,“notice”:0}}},“content”:{“post_id”:77078081563,“is_anonym”:false,“forum_id”:139226,“thread_id”:4089961953,“content”:“b22fce2a0165DIYba14066e261f.c0f427d23efb9812c5f53,.92742762ff60c38fdc6846fe24c.e7f44ad34.c34d34.5f9e8bd346f43a5390b7b.53e24b3bbe72.98116cb63.e0d981c4216ce73.97d97d70b70b4279c4.e0dee51eadf159c97dee552824be2d743229&lt;c2814e,c2814e,c0f427e4b927fcc&gt;.6f4e0d9816e0e3a015f97f6a22beba2d875f1eadf1. 72cd342a540d210e3aDIY427b9ee60c0f427.b9ee6071f927427709743229e0d01a7e56f43a526593a0a8684c0f427743229. ee1708ee540e0a83ef0fdc3162fe0befbDIY927427e3b684019009ebae4be004e6.71ff850a86848683b0. 981c42bcf468728ebf5f65f4e0dc11e8e5929002 0fd19f7e5c0f427684e492a1002e76e140fdb6386efd0528c0f427743229002 1ea9c9e26934ef42a4427185b21e8f0020fd3ca5f6bf9a813d1e8bef6f5c1fa3cd620002 16cb63002b636f4002e0dee5743c0b9c1002  2a540d83

Tieba Automatic Sign In Script
https://github.com/kikyous/tieba

https://github.com/skyline75489/baidu-tieba-auto-sign/blob/master/baidu-tieba-auto-sign.py

Uses URLlib to sign in every day or something.

Tieba Bot
https://github.com/piglei/tieba_poster/blob/master/baidu_poster.py

Automatically posts using a robot written in python to the tieba

Tieba Washer
https://github.com/tigerstudent/TiebaWasher

spams tiebas with junk.