Baidu Tieba/Archiver

The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

I am struggling to figure out how the API works when I look at the site, so we need to figure it out from reverse engineering code. Thankfully, code can only be written in English.

How to Grab

We will need to grab fucktons of data from Baidu. Luckily, AWS has a Japan node which has good peering to Asia and free incoming bandwidth. And it is said that a man used to to grab terabytes of data for $10.

http://blog.waleson.com/2016/01/parsing-10tb-of-metadata-26m-domains.html

It might kind of get perceived as high traffic, so cook the frog slowly.

Scrapers

Supposedly it records the contents of a thread?

https://github.com/omnbmh/baidu-tieba-capture/

baidu/tieba/capture

Logging into Baidu

Logging into Baidu

Gathering Tieba Threads

Gather Tieba Threads
URL Format: "http://tieba.baidu.com/p/%s?pn=%s%22 % (id, pn)
id (thread id) -
pn (page number to display - 要抓取的起始页码) -
pages - Total number of pages to increment pn up to
Image URL (img src tag): "http://imgsrc.baidu.com/forum/%22
Example: http://imgsrc.baidu.com/forum/w%3D580/sign=457ea272292eb938ec6d7afae56385fe/632799504fc2d56294ac717ee11190ef77c66c98.jpg

(帖子没有被删除) Grab data if post wasn’t deleted

if (html.select("li.l_reply_num span").size() != 0) {
    // 帖子没有被删除
    pages = Integer.parseInt(html.select("li.l_reply_num span")
            .get(1).text());
}

Example post data (chinese characters are unicode escaped) found under <div class="p_postlist" id="j_p_postlist"><div class="l_post l_post_bright j_l_post clearfix "

{“author”:{“user_id”:237987149,“user_name”:“zangkuiyhq”,“props”:{“1070002”:{“num”:8,“end_time”:1403538299,“notice”:0}}},“content”:{“post_id”:77078081563,“is_anonym”:false,“forum_id”:139226,“thread_id”:4089961953,“content”:“b22fce2a0165DIYba14066e261f.c0f427d23efb9812c5f53,.92742762ff60c38fdc6846fe24c.e7f44ad34.c34d34.5f9e8bd346f43a5390b7b.53e24b3bbe72.98116cb63.e0d981c4216ce73.97d97d70b70b4279c4.e0dee51eadf159c97dee552824be2d743229<c2814e,c2814e,c0f427e4b927fcc>.6f4e0d9816e0e3a015f97f6a22beba2d875f1eadf1.<br>72cd342a540d210e3aDIY427b9ee60c0f427.b9ee6071f927427709743229e0d01a7e56f43a526593a0a8684c0f427743229.<br>ee1708ee540e0a83ef0fdc3162fe0befbDIY927427e3b684019009ebae4be004e6.71ff850a86848683b0.<br><br><br>981c42bcf468728ebf5f65f4e0dc11e8e5929002<br>0fd19f7e5c0f427684e492a1002e76e140fdb6386efd0528c0f427743229002<br>1ea9c9e26934ef42a4427185b21e8f0020fd3ca5f6bf9a813d1e8bef6f5c1fa3cd620002<br>16cb63002b636f4002e0dee5743c0b9c1002<br><br><br>2a540d83

Tieba Automatic Sign In Script

https://github.com/kikyous/tieba

https://github.com/skyline75489/baidu-tieba-auto-sign/blob/master/baidu-tieba-auto-sign.py

Uses URLlib to sign in every day or something.

Tieba Bot

https://github.com/piglei/tieba_poster/blob/master/baidu_poster.py

Automatically posts using a robot written in python to the tieba

Tieba Washer

https://github.com/tigerstudent/TiebaWasher

spams tiebas with junk.

Sources

[Official Baidu Tieba SDK http://pan.baidu.com/s/1pJ18AiJ]

Anonymous

Search

Baidu Tieba/Archiver

Namespaces

More

Page actions

Contents

How to Grab

Scrapers

Logging into Baidu

Gathering Tieba Threads

Tieba Automatic Sign In Script

Tieba Bot

Tieba Washer

Sources

Navigation

Navigation

Wiki tools

Wiki tools

Anonymous

Search

Baidu Tieba/Archiver

How to Grab

Scrapers

Logging into Baidu

Gathering Tieba Threads

Tieba Automatic Sign In Script

Tieba Bot

Tieba Washer

Sources

Navigation

Wiki tools

Page tools