b站w_rid参数逆向

最近爬b站评论时,发现不能直接爬了,需要带上w_rid参数,于是开始逆向分析

1. 分析

首先,打开开发者工具,在评论页面,找到请求,可以看到请求的url为
a30851c39b23a6fd.png
88158ad29f692cf2.png
不难发现第二次请求时pagination_str内会多一个session_id及其他,在后续请求时需要带上
通过断点
2531f699a4c8cefe.png
可以看到,y为请求的某些参数,a貌似是个固定值
函数将两个输入字符串 y 和 a 直接拼接在一起形成一个新的字符串后进行md5加密…
我就直接上代码了

import requests
import hashlib
import time
import os
import csv
from hanlp_restful import HanLPClient

def md5_to_w_rid(y, a):
input_string = y + a
md5_hash = hashlib.md5()
md5_hash.update(input_string.encode('utf-8'))
w_rid = ''.join(f'{byte:02x}' for byte in md5_hash.digest())
return w_rid

def save_to_csv(s, uname, content, emotion):
file_exists = os.path.exists('ba数据分析.csv')
print('...正在写入文件')
with open('ba数据分析.csv', 'a', newline='', encoding='utf-8') as csvfile:
fieldnames = ['情感', '楼层', '用户名', '评论内容']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

if not file_exists:
writer.writeheader()

writer.writerow({
'情感': emotion,
'楼层': s,
'用户名': uname,
'评论内容': content
})

def fetch_comments(oid, headers, cookies, s):
a = 'ea1db124af3c7062474693fa704f4ff8'
wts = int(time.time())
session_id = "1766149779423829"
y = f'mode=3&oid={oid}&pagination_str=%7B%22offset%22%3A%22%7B%5C%22type%5C%22%3A1%2C%5C%22direction%5C%22%3A1%2C%5C%22session_id%5C%22%3A%5C%22{session_id}%5C%22%2C%5C%22data%5C%22%3A%7B%7D%7D%22%7D&plat=1&type=1&web_location=1315875&wts={wts}'
w_rid = md5_to_w_rid(y, a)

url = "https://api.bilibili.com/x/v2/reply/wbi/main"
params = {
"oid": oid,
"type": "1", # 动态评论为 17,视频评论为 1
"mode": "3",
"plat": "1",
"pagination_str": "{\"offset\":\"{\\\"type\\\":1,\\\"direction\\\":1,\\\"session_id\\\":\\\"" + session_id + "\\\",\\\"data\\\":{}}\"}",
"web_location": "1315875",
"w_rid": w_rid,
"wts": wts
}

try:
response = requests.get(url, headers=headers, cookies=cookies, params=params).json()
comments = response.get('data', {}).get('replies', [])

if not comments:
print("没有更多评论了。")
return s, False

HanLP = HanLPClient('https://www.hanlp.com/api', auth='', language='zh')

for comment in comments:
uname = comment['member']['uname']
content = comment['content']['message']
emotion = HanLP.sentiment_analysis(content)

s += 1
print(f'第{s}楼\n用户名:{uname}\n评论内容:{content}\n\n')
save_to_csv(s, uname, content, emotion)

return s, True
except Exception as e:
print(f"请求失败: {e}")
return s, False

# 主程序
if __name__ == "__main__":
oid = 1556265189 # 视频或动态的 oid
headers = {

}
cookies = {
}

s = 0 # 初始化 s 值为 0
for i in range(20):
s, success = fetch_comments(oid, headers, cookies, s)
if not success:
break
time.sleep(1)

1681c1f15f53b4a6.png 可以看到已经可以正常请求了 注意翻页是根据时间戳来的,不能太快的进行爬取,需要有一个时间间隔

2. 总结

总的来说不难,难的我也不会。。。,爬评论主要是想对评论进行情感分析,大致掌握一个话题的舆论方向,比较推荐使用HanLP,国内几家的要实名比较麻烦且没有HanLP准确