Baidu Tieba/Archiver
I am struggling to figure out how the API works when I look at the site, so we need to figure it out from reverse engineering code. Thankfully, code can only be written in English.
How to Grab[edit]
We will need to grab fucktons of data from Baidu. Luckily, AWS has a Japan node which has good peering to Asia and free incoming bandwidth. And it is said that a man used to to grab terabytes of data for $10.
http://blog.waleson.com/2016/01/parsing-10tb-of-metadata-26m-domains.html
It might kind of get perceived as high traffic, so cook the frog slowly.
Scrapers[edit]
Supposedly it records the contents of a thread?
https://github.com/omnbmh/baidu-tieba-capture/
Logging into Baidu[edit]
Gathering Tieba Threads[edit]
- Gather Tieba Threads
- URL Format:
"http://tieba.baidu.com/p/%s?pn=%s%22 % (id, pn)
id
(thread id) -pn
(page number to display - 要抓取的起始页码) -pages
- Total number of pages to incrementpn
up to- Image URL (img src tag):
"http://imgsrc.baidu.com/forum/%22
- Example:
http://imgsrc.baidu.com/forum/w%3D580/sign=457ea272292eb938ec6d7afae56385fe/632799504fc2d56294ac717ee11190ef77c66c98.jpg
[edit]
(帖子没有被删除) Grab data if post wasn’t deleted
if (html.select("li.l_reply_num span").size() != 0) { // 帖子没有被删除 pages = Integer.parseInt(html.select("li.l_reply_num span") .get(1).text()); }
Example post data (chinese characters are unicode escaped) found under <div class="p_postlist" id="j_p_postlist"><div class="l_post l_post_bright j_l_post clearfix "
{“author”:{“user_id”:237987149,“user_name”:“zangkuiyhq”,“props”:{“1070002”:{“num”:8,“end_time”:1403538299,“notice”:0}}},“content”:{“post_id”:77078081563,“is_anonym”:false,“forum_id”:139226,“thread_id”:4089961953,“content”:“b22fce2a0165DIYba14066e261f.c0f427d23efb9812c5f53,.92742762ff60c38fdc6846fe24c.e7f44ad34.c34d34.5f9e8bd346f43a5390b7b.53e24b3bbe72.98116cb63.e0d981c4216ce73.97d97d70b70b4279c4.e0dee51eadf159c97dee552824be2d743229<c2814e,c2814e,c0f427e4b927fcc>.6f4e0d9816e0e3a015f97f6a22beba2d875f1eadf1.<br>72cd342a540d210e3aDIY427b9ee60c0f427.b9ee6071f927427709743229e0d01a7e56f43a526593a0a8684c0f427743229.<br>ee1708ee540e0a83ef0fdc3162fe0befbDIY927427e3b684019009ebae4be004e6.71ff850a86848683b0.<br><br><br>981c42bcf468728ebf5f65f4e0dc11e8e5929002<br>0fd19f7e5c0f427684e492a1002e76e140fdb6386efd0528c0f427743229002<br>1ea9c9e26934ef42a4427185b21e8f0020fd3ca5f6bf9a813d1e8bef6f5c1fa3cd620002<br>16cb63002b636f4002e0dee5743c0b9c1002<br><br><br>2a540d83
Tieba Automatic Sign In Script[edit]
https://github.com/kikyous/tieba
https://github.com/skyline75489/baidu-tieba-auto-sign/blob/master/baidu-tieba-auto-sign.py
Uses URLlib to sign in every day or something.
Tieba Bot[edit]
https://github.com/piglei/tieba_poster/blob/master/baidu_poster.py
Automatically posts using a robot written in python to the tieba
Tieba Washer[edit]
https://github.com/tigerstudent/TiebaWasher
spams tiebas with junk.
Sources[edit]
- [Official Baidu Tieba SDK http://pan.baidu.com/s/1pJ18AiJ]