Baidu Tieba/Archiver

From Bibliotheca Anonoma
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

I am struggling to figure out how the API works when I look at the site, so we need to figure it out from reverse engineering code. Thankfully, code can only be written in English.

How to Grab

We will need to grab fucktons of data from Baidu. Luckily, AWS has a Japan node which has good peering to Asia and free incoming bandwidth. And it is said that a man used to to grab terabytes of data for $10.

http://blog.waleson.com/2016/01/parsing-10tb-of-metadata-26m-domains.html

It might kind of get perceived as high traffic, so cook the frog slowly.

Scrapers

Supposedly it records the contents of a thread?

https://github.com/omnbmh/baidu-tieba-capture/

Logging into Baidu

Gathering Tieba Threads

(帖子没有被删除) Grab data if post wasn’t deleted

if (html.select("li.l_reply_num span").size() != 0) {
    // 帖子没有被删除
    pages = Integer.parseInt(html.select("li.l_reply_num span")
            .get(1).text());
}

Example post data (chinese characters are unicode escaped) found under <div class="p_postlist" id="j_p_postlist"><div class="l_post l_post_bright j_l_post clearfix "

{“author”:{“user_id”:237987149,“user_name”:“zangkuiyhq”,“props”:{“1070002”:{“num”:8,“end_time”:1403538299,“notice”:0}}},“content”:{“post_id”:77078081563,“is_anonym”:false,“forum_id”:139226,“thread_id”:4089961953,“content”:“b22fce2a0165DIYba14066e261f.c0f427d23efb9812c5f53,.92742762ff60c38fdc6846fe24c.e7f44ad34.c34d34.5f9e8bd346f43a5390b7b.53e24b3bbe72.98116cb63.e0d981c4216ce73.97d97d70b70b4279c4.e0dee51eadf159c97dee552824be2d743229<c2814e,c2814e,c0f427e4b927fcc>.6f4e0d9816e0e3a015f97f6a22beba2d875f1eadf1.<br>72cd342a540d210e3aDIY427b9ee60c0f427.b9ee6071f927427709743229e0d01a7e56f43a526593a0a8684c0f427743229.<br>ee1708ee540e0a83ef0fdc3162fe0befbDIY927427e3b684019009ebae4be004e6.71ff850a86848683b0.<br><br><br>981c42bcf468728ebf5f65f4e0dc11e8e5929002<br>0fd19f7e5c0f427684e492a1002e76e140fdb6386efd0528c0f427743229002<br>1ea9c9e26934ef42a4427185b21e8f0020fd3ca5f6bf9a813d1e8bef6f5c1fa3cd620002<br>16cb63002b636f4002e0dee5743c0b9c1002<br><br><br>2a540d83

Tieba Automatic Sign In Script

https://github.com/kikyous/tieba

https://github.com/skyline75489/baidu-tieba-auto-sign/blob/master/baidu-tieba-auto-sign.py

Uses URLlib to sign in every day or something.

Tieba Bot

https://github.com/piglei/tieba_poster/blob/master/baidu_poster.py

Automatically posts using a robot written in python to the tieba

Tieba Washer

https://github.com/tigerstudent/TiebaWasher

spams tiebas with junk.

Sources