Baidu Tieba/Archiver

From Bibliotheca Anonoma

I am struggling to figure out how the API works when I look at the site, so we need to figure it out from reverse engineering code. Thankfully, code can only be written in English.

How to Grab[edit]

We will need to grab fucktons of data from Baidu. Luckily, AWS has a Japan node which has good peering to Asia and free incoming bandwidth. And it is said that a man used to to grab terabytes of data for $10.

http://blog.waleson.com/2016/01/parsing-10tb-of-metadata-26m-domains.html

It might kind of get perceived as high traffic, so cook the frog slowly.

Scrapers[edit]

Supposedly it records the contents of a thread?

https://github.com/omnbmh/baidu-tieba-capture/

Logging into Baidu[edit]

Gathering Tieba Threads[edit]

[edit]

(帖子没有被删除) Grab data if post wasn’t deleted

if (html.select("li.l_reply_num span").size() != 0) {
    // 帖子没有被删除
    pages = Integer.parseInt(html.select("li.l_reply_num span")
            .get(1).text());
}

Example post data (chinese characters are unicode escaped) found under <div class="p_postlist" id="j_p_postlist"><div class="l_post l_post_bright j_l_post clearfix "

{“author”:{“user_id”:237987149,“user_name”:“zangkuiyhq”,“props”:{“1070002”:{“num”:8,“end_time”:1403538299,“notice”:0}}},“content”:{“post_id”:77078081563,“is_anonym”:false,“forum_id”:139226,“thread_id”:4089961953,“content”:“b22fce2a0165DIYba14066e261f.c0f427d23efb9812c5f53,.92742762ff60c38fdc6846fe24c.e7f44ad34.c34d34.5f9e8bd346f43a5390b7b.53e24b3bbe72.98116cb63.e0d981c4216ce73.97d97d70b70b4279c4.e0dee51eadf159c97dee552824be2d743229<c2814e,c2814e,c0f427e4b927fcc>.6f4e0d9816e0e3a015f97f6a22beba2d875f1eadf1.<br>72cd342a540d210e3aDIY427b9ee60c0f427.b9ee6071f927427709743229e0d01a7e56f43a526593a0a8684c0f427743229.<br>ee1708ee540e0a83ef0fdc3162fe0befbDIY927427e3b684019009ebae4be004e6.71ff850a86848683b0.<br><br><br>981c42bcf468728ebf5f65f4e0dc11e8e5929002<br>0fd19f7e5c0f427684e492a1002e76e140fdb6386efd0528c0f427743229002<br>1ea9c9e26934ef42a4427185b21e8f0020fd3ca5f6bf9a813d1e8bef6f5c1fa3cd620002<br>16cb63002b636f4002e0dee5743c0b9c1002<br><br><br>2a540d83

Tieba Automatic Sign In Script[edit]

https://github.com/kikyous/tieba

https://github.com/skyline75489/baidu-tieba-auto-sign/blob/master/baidu-tieba-auto-sign.py

Uses URLlib to sign in every day or something.

Tieba Bot[edit]

https://github.com/piglei/tieba_poster/blob/master/baidu_poster.py

Automatically posts using a robot written in python to the tieba

Tieba Washer[edit]

https://github.com/tigerstudent/TiebaWasher

spams tiebas with junk.

Sources[edit]