Baidu Tieba/Archiver
I am struggling to figure out how the API works when I look at the site, so we need to figure it out from reverse engineering code. Thankfully, code can only be written in English.
How to Grab
We will need to grab fucktons of data from Baidu. Luckily, AWS has a Japan node which has good peering to Asia and free incoming bandwidth. And it is said that a man used to to grab terabytes of data for $10.
http://blog.waleson.com/2016/01/parsing-10tb-of-metadata-26m-domains.html
It might kind of get perceived as high traffic, so cook the frog slowly.
Scrapers
Supposedly it records the contents of a thread?
https://github.com/omnbmh/baidu-tieba-capture/
Logging into Baidu
Gathering Tieba Threads
- Gather Tieba Threads
- URL Format:
"http://tieba.baidu.com/p/%s?pn=%s%22 % (id, pn)
id
(thread id) -pn
(page number to display - 要抓取的起始页码) -pages
- Total number of pages to incrementpn
up to- Image URL (img src tag):
"http://imgsrc.baidu.com/forum/%22
- Example:
http://imgsrc.baidu.com/forum/w%3D580/sign=457ea272292eb938ec6d7afae56385fe/632799504fc2d56294ac717ee11190ef77c66c98.jpg
(帖子没有被删除) Grab data if post wasn’t deleted
if (html.select("li.l_reply_num span").size() != 0) { // 帖子没有被删除 pages = Integer.parseInt(html.select("li.l_reply_num span") .get(1).text()); }
Example post data (chinese characters are unicode escaped) found under <div class="p_postlist" id="j_p_postlist"><div class="l_post l_post_bright j_l_post clearfix "
{“author”:{“user_id”:237987149,“user_name”:“zangkuiyhq”,“props”:{“1070002”:{“num”:8,“end_time”:1403538299,“notice”:0}}},“content”:{“post_id”:77078081563,“is_anonym”:false,“forum_id”:139226,“thread_id”:4089961953,“content”:“b22fce2a0165DIYba14066e261f.c0f427d23efb9812c5f53,.92742762ff60c38fdc6846fe24c.e7f44ad34.c34d34.5f9e8bd346f43a5390b7b.53e24b3bbe72.98116cb63.e0d981c4216ce73.97d97d70b70b4279c4.e0dee51eadf159c97dee552824be2d743229<c2814e,c2814e,c0f427e4b927fcc>.6f4e0d9816e0e3a015f97f6a22beba2d875f1eadf1.<br>72cd342a540d210e3aDIY427b9ee60c0f427.b9ee6071f927427709743229e0d01a7e56f43a526593a0a8684c0f427743229.<br>ee1708ee540e0a83ef0fdc3162fe0befbDIY927427e3b684019009ebae4be004e6.71ff850a86848683b0.<br><br><br>981c42bcf468728ebf5f65f4e0dc11e8e5929002<br>0fd19f7e5c0f427684e492a1002e76e140fdb6386efd0528c0f427743229002<br>1ea9c9e26934ef42a4427185b21e8f0020fd3ca5f6bf9a813d1e8bef6f5c1fa3cd620002<br>16cb63002b636f4002e0dee5743c0b9c1002<br><br><br>2a540d83
Tieba Automatic Sign In Script
https://github.com/kikyous/tieba
https://github.com/skyline75489/baidu-tieba-auto-sign/blob/master/baidu-tieba-auto-sign.py
Uses URLlib to sign in every day or something.
Tieba Bot
https://github.com/piglei/tieba_poster/blob/master/baidu_poster.py
Automatically posts using a robot written in python to the tieba
Tieba Washer
https://github.com/tigerstudent/TiebaWasher
spams tiebas with junk.
Sources
- [Official Baidu Tieba SDK http://pan.baidu.com/s/1pJ18AiJ]