<th id="okziy"></th>

<strike id="okziy"><em id="okziy"><nav id="okziy"></nav></em></strike>

溫馨提示×

溫馨提示×

您好，登錄后才能下訂單哦！

密碼登錄×

忘記密碼？

登錄注冊(cè)×

獲取短信驗(yàn)證碼

其他方式登錄

點(diǎn)擊登錄注冊(cè) 即表示同意《億速云用戶服務(wù)條款》

用戶登錄×

賬戶密碼登錄

請(qǐng)使用微信掃描上方二維碼

使用幫助

請(qǐng)求超時(shí)！

請(qǐng)點(diǎn)擊重新獲取二維碼

Python爬取微信公眾號(hào)文章和評(píng)論的案例

發(fā)布時(shí)間：2021-02-04 14:30:43 來(lái)源：億速云閱讀：677 作者：小新欄目：開(kāi)發(fā)技術(shù)

小編給大家分享一下Python爬取微信公眾號(hào)文章和評(píng)論的案例，相信大部分人都還不怎么了解，因此分享這篇文章給大家參考一下，希望大家閱讀完這篇文章后大有收獲，下面讓我們一起去了解一下吧！

背景說(shuō)明

感覺(jué)微信公眾號(hào)算得是比較難爬的平臺(tái)之一，不過(guò)一番折騰之后還是小有收獲的。沒(méi)有用Scrapy(估計(jì)爬太快也有反爬限制)，但后面會(huì)開(kāi)始整理寫(xiě)一些實(shí)戰(zhàn)出來(lái)。簡(jiǎn)單介紹下本次的開(kāi)發(fā)環(huán)境：

python3
requests
psycopg2 (操作postgres數(shù)據(jù)庫(kù))

抓包分析

本次實(shí)戰(zhàn)對(duì)抓取的公眾號(hào)沒(méi)有限制，但不同公眾號(hào)每次抓取之前都要進(jìn)行分析。打開(kāi)Fiddler，將手機(jī)配置好相關(guān)代理，為避免干擾過(guò)多，這里給Fiddler加個(gè)過(guò)濾規(guī)則，只需要指定微信域名mp.weixin.qq.com就好：

Python爬取微信公眾號(hào)文章和評(píng)論的案例

Fiddler配置Filter規(guī)則

平時(shí)關(guān)注的公眾號(hào)也比較多，本次實(shí)戰(zhàn)以“36氪”公眾號(hào)為例，繼續(xù)往下看：

Python爬取微信公眾號(hào)文章和評(píng)論的案例

“36氪”公眾號(hào)

Python爬取微信公眾號(hào)文章和評(píng)論的案例

公眾號(hào)右上角 -> 全部消息

在公眾號(hào)主頁(yè)，右上角有三個(gè)實(shí)心圓點(diǎn)，點(diǎn)擊進(jìn)入消息界面，下滑找到并點(diǎn)擊“全部消息”，往下請(qǐng)求加載幾次歷史文章，然后回到Fiddler界面，不出意外的話應(yīng)該可以看到這幾次請(qǐng)求，可以看到返回的數(shù)據(jù)是json格式的，同時(shí)文章數(shù)據(jù)是以json字符串的形式定義在general_msg_list字段中：

Python爬取微信公眾號(hào)文章和評(píng)論的案例

公眾號(hào)文章列表抓包請(qǐng)求

分析文章列表接口

把請(qǐng)求URL和Cookie貼上來(lái)進(jìn)行分析：

https://mp.weixin.qq.com/mp/profile_ext?action=getmsg&__biz=MzI2NDk5NzA0Mw==&f=json&offset=10&count=10&is_ok=1&scene=126&uin=777&key=777&pass_ticket=QhOypNwH5dAr5w6UgMjyBrTSOdMEUT86vWc73GANoziWFl8xJd1hIMbMZ82KgCpN&wxtoken=&appmsg_token=971_LwY7Z%252BFBoaEv5z8k_dFWfJkdySbNkMR4OmFxNw~~&x5=1&f=json
Cookie: pgv_pvid=2027337976; pgv_info=ssid=s3015512850; rewardsn=; wxtokenkey=777; wxuin=2089823341; devicetype=android-26; version=26070237; lang=zh_CN;pass_ticket=NDndxxaZ7p6Z9PYulWpLqMbI0i3ULFeCPIHBFu1sf5pX2IhkGfyxZ6b9JieSYRUy;wap_sid2=CO3YwOQHEogBQnN4VTNhNmxQWmc3UHI2U3kteWhUeVExZHFVMnN0QXlsbzVJRUJKc1pkdVFUU2Y5UzhSVEtOZmt1VVlYTkR4SEllQ2huejlTTThJWndMQzZfYUw2SldLVGVMQUthUjc3QWdVMUdoaGN0Nml2SU05cXR1dTN2RkhRUVd1V2Y3SFJ5d01BQUF+fjCB1pLcBTgNQJVO

下面把重要的參數(shù)說(shuō)明一下，沒(méi)提到的說(shuō)明就不那么重要了：

__biz：相當(dāng)于是當(dāng)前公眾號(hào)的id(唯一固定標(biāo)志)
offset：文章數(shù)據(jù)接口請(qǐng)求偏移量標(biāo)志(從0開(kāi)始)，每次返回的json數(shù)據(jù)中會(huì)有下一次請(qǐng)求的offset，注意這里并不是按某些規(guī)則遞增的
count：每次請(qǐng)求的數(shù)據(jù)量(親測(cè)最多可以是10)
pass_ticket：可以理解是請(qǐng)求票據(jù)，而且隔一段時(shí)間后(大概幾個(gè)小時(shí))就會(huì)過(guò)期，這也是為什么微信公眾號(hào)比較難按固定規(guī)則進(jìn)行抓取的原因
appmsg_token：同樣理解為非固定有過(guò)期策略的票據(jù)
Cookie：使用的時(shí)候可以把整段貼上去，但最少僅需要wap_sid2這部分

是不是感覺(jué)有點(diǎn)麻煩，畢竟不是要搞大規(guī)模專業(yè)的爬蟲(chóng)，所以單就一個(gè)公眾號(hào)這么分析下來(lái)，還是可以往下繼續(xù)的，貼上截取的一段json數(shù)據(jù)，用于設(shè)計(jì)文章數(shù)據(jù)表：

{
"ret": 0,
"errmsg": "ok",
"msg_count": 10,
"can_msg_continue": 1,
"general_msg_list": "{\"list\":[{\"comm_msg_info\":{\"id\":1000005700,\"type\":49,\"datetime\":1535100943,\"fakeid\":\"3264997043\",\"status\":2,\"content\":\"\"},\"app_msg_ext_info\":{\"title\":\"金融危機(jī)又十年：錢荒之下，二手基金迎來(lái)高光時(shí)刻\",\"digest\":\"退出永遠(yuǎn)是基金的主旋律。\",\"content\":\"\",\"fileid\":100034824,\"content_url\":\"http:\\/\\/mp.weixin.qq.com\\/s?__biz=MzI2NDk5NzA0Mw==&mid=2247518479&idx=1&sn=124ab52f7478c1069a6b4592cdf3c5f5&chksm=eaa6d8d3ddd151c5bb95a7ae118de6d080023246aa0a419e1d53bfe48a8d9a77e52b752d9b80&scene=27#wechat_redirect\",\"source_url\":\"\",\"cover\":\"http:\\/\\/mmbiz.qpic.cn\\/mmbiz_jpg\\/QicyPhNHD5vYgdpprkibtnWCAN7l4ZaqibKvopNyCWWLQAwX7QpzWicnQSVfcBZmPrR5YuHS45JIUzVjb0dZTiaLPyA\\/0?wx_fmt=jpeg\",\"subtype\":9,\"is_multi\":0,\"multi_app_msg_item_list\":[],\"author\":\"石亞瓊\",\"copyright_stat\":11,\"duration\":0,\"del_flag\":1,\"item_show_type\":0,\"audio_fileid\":0,\"play_url\":\"\",\"malicious_title_reason_id\":0,\"malicious_content_type\":0}}]}",
"next_offset": 20,
"video_count": 1,
"use_video_tab": 1,
"real_type": 0
}

可以簡(jiǎn)單抽取想要的數(shù)據(jù)，這里將文章表結(jié)構(gòu)定義如下，順便貼上建表的SQL語(yǔ)句：

Python爬取微信公眾號(hào)文章和評(píng)論的案例

文章數(shù)據(jù)表

-- ----------------------------
-- Table structure for tb_article
-- ----------------------------
DROP TABLE IF EXISTS "public"."tb_article";
CREATE TABLE "public"."tb_article" (
"id" serial4 PRIMARY KEY,
"msg_id" int8 NOT NULL,
"title" varchar(200) COLLATE "pg_catalog"."default" NOT NULL,
"author" varchar(20) COLLATE "pg_catalog"."default",
"cover" varchar(500) COLLATE "pg_catalog"."default",
"digest" varchar(200) COLLATE "pg_catalog"."default",
"source_url" varchar(800) COLLATE "pg_catalog"."default",
"content_url" varchar(600) COLLATE "pg_catalog"."default" NOT NULL,
"post_time" timestamp(6),
"create_time" timestamp(6) NOT NULL
)
;
COMMENT ON COLUMN "public"."tb_article"."id" IS '自增主鍵';
COMMENT ON COLUMN "public"."tb_article"."msg_id" IS '消息id (唯一)';
COMMENT ON COLUMN "public"."tb_article"."title" IS '標(biāo)題';
COMMENT ON COLUMN "public"."tb_article"."author" IS '作者';
COMMENT ON COLUMN "public"."tb_article"."cover" IS '封面圖';
COMMENT ON COLUMN "public"."tb_article"."digest" IS '關(guān)鍵字';
COMMENT ON COLUMN "public"."tb_article"."source_url" IS '原文地址';
COMMENT ON COLUMN "public"."tb_article"."content_url" IS '文章地址';
COMMENT ON COLUMN "public"."tb_article"."post_time" IS '發(fā)布時(shí)間';
COMMENT ON COLUMN "public"."tb_article"."create_time" IS '入庫(kù)時(shí)間';
COMMENT ON TABLE "public"."tb_article" IS '公眾號(hào)文章表';
-- ----------------------------
-- Indexes structure for table tb_article
-- ----------------------------
CREATE UNIQUE INDEX "unique_msg_id" ON "public"."tb_article" USING btree (
"msg_id" "pg_catalog"."int8_ops" ASC NULLS LAST
);

附請(qǐng)求文章接口并解析數(shù)據(jù)保存到數(shù)據(jù)庫(kù)的相關(guān)代碼：

class WxMps(object):
"""微信公眾號(hào)文章、評(píng)論抓取爬蟲(chóng)"""

def __init__(self, _biz, _pass_ticket, _app_msg_token, _cookie, _offset=0):
self.offset = _offset
self.biz = _biz # 公眾號(hào)標(biāo)志
self.msg_token = _app_msg_token # 票據(jù)(非固定)
self.pass_ticket = _pass_ticket # 票據(jù)(非固定)
self.headers = {
'Cookie': _cookie, # Cookie(非固定)
'User-Agent': 'Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/57.0.2987.132 '
}
wx_mps = 'wxmps' # 這里數(shù)據(jù)庫(kù)、用戶、密碼一致(需替換成實(shí)際的)
self.postgres = pgs.Pgs(host='localhost', port='5432', db_name=wx_mps, user=wx_mps, password=wx_mps)

def start(self):
"""請(qǐng)求獲取公眾號(hào)的文章接口"""

offset = self.offset
while True:
api = 'https://mp.weixin.qq.com/mp/profile_ext?action=getmsg&__biz={0}&f=json&offset={1}' \
'&count=10&is_ok=1&scene=124&uin=777&key=777&pass_ticket={2}&wxtoken=&appmsg_token' \
'={3}&x5=1&f=json'.format(self.biz, offset, self.pass_ticket, self.msg_token)

resp = requests.get(api, headers=self.headers).json()
ret, status = resp.get('ret'), resp.get('errmsg') # 狀態(tài)信息
if ret == 0 or status == 'ok':
print('Crawl article: ' + api)
offset = resp['next_offset'] # 下一次請(qǐng)求偏移量
general_msg_list = resp['general_msg_list']
msg_list = json.loads(general_msg_list)['list'] # 獲取文章列表
for msg in msg_list:
comm_msg_info = msg['comm_msg_info'] # 該數(shù)據(jù)是本次推送多篇文章公共的
msg_id = comm_msg_info['id'] # 文章id
post_time = datetime.fromtimestamp(comm_msg_info['datetime']) # 發(fā)布時(shí)間
# msg_type = comm_msg_info['type'] # 文章類型
# msg_data = json.dumps(comm_msg_info, ensure_ascii=False) # msg原數(shù)據(jù)
app_msg_ext_info = msg.get('app_msg_ext_info') # article原數(shù)據(jù)
if app_msg_ext_info:
# 本次推送的首條文章
self._parse_articles(app_msg_ext_info, msg_id, post_time)
# 本次推送的其余文章
multi_app_msg_item_list = app_msg_ext_info.get('multi_app_msg_item_list')
if multi_app_msg_item_list:
for item in multi_app_msg_item_list:
msg_id = item['fileid'] # 文章id
if msg_id == 0:
msg_id = int(time.time() * 1000) # 設(shè)置唯一id,解決部分文章id=0出現(xiàn)唯一索引沖突的情況
self._parse_articles(item, msg_id, post_time)
print('next offset is %d' % offset)
else:
print('Before break , Current offset is %d' % offset)
break
def _parse_articles(self, info, msg_id, post_time):
"""解析嵌套文章數(shù)據(jù)并保存入庫(kù)"""
title = info.get('title') # 標(biāo)題
cover = info.get('cover') # 封面圖
author = info.get('author') # 作者
digest = info.get('digest') # 關(guān)鍵字
source_url = info.get('source_url') # 原文地址
content_url = info.get('content_url') # 微信地址
# ext_data = json.dumps(info, ensure_ascii=False) # 原始數(shù)據(jù)
self.postgres.handler(self._save_article(), (msg_id, title, author, cover, digest,
source_url, content_url, post_time,
datetime.now()), fetch=True)
@staticmethod
def _save_article():
sql = 'insert into tb_article(msg_id,title,author,cover,digest,source_url,content_url,post_time,create_time) ' \
'values(%s,%s,%s,%s,%s,%s,%s,%s,%s)'
return sql 
if __name__ == '__main__':
biz = 'MzI2NDk5NzA0Mw==' # "36氪"
pass_ticket = 'NDndxxaZ7p6Z9PYulWpLqMbI0i3ULFeCPIHBFu1sf5pX2IhkGfyxZ6b9JieSYRUy'
app_msg_token = '971_Z0lVNQBcGsWColSubRO9H13ZjrPhjuljyxLtiQ~~'
cookie = 'wap_sid2=CO3YwOQHEogBQnN4VTNhNmxQWmc3UHI2U3kteWhUeVExZHFVMnN0QXlsbzVJRUJKc1pkdVFUU2Y5UzhSVEtOZmt1VVlYTkR4SEllQ2huejlTTThJWndMQzZfYUw2SldLVGVMQUthUjc3QWdVMUdoaGN0Nml2SU05cXR1dTN2RkhRUVd1V2Y3SFJ5d01BQUF+fjCB1pLcBTgNQJVO'
# 以上信息不同公眾號(hào)每次抓取都需要借助抓包工具做修改
wxMps = WxMps(biz, pass_ticket, app_msg_token, cookie)
wxMps.start() # 開(kāi)始爬取文章

分析文章評(píng)論接口

獲取評(píng)論的思路大致是一樣的，只是會(huì)更加麻煩一點(diǎn)。首先在手機(jī)端點(diǎn)開(kāi)一篇有評(píng)論的文章，然后查看Fiddler抓取的請(qǐng)求：

Python爬取微信公眾號(hào)文章和評(píng)論的案例

公眾號(hào)文章評(píng)論

Python爬取微信公眾號(hào)文章和評(píng)論的案例

公眾號(hào)文章評(píng)論接口抓包請(qǐng)求

提取其中的URL和Cookie再次分析：

https://mp.weixin.qq.com/mp/appmsg_comment?action=getcomment&scene=0&__biz=MzI2NDk5NzA0Mw==&appmsgid=2247518723&idx=1&comment_id=433253969406607362&offset=0&limit=100&uin=777&key=777&pass_ticket=NDndxxaZ7p6Z9PYulWpLqMbI0i3ULFeCPIHBFu1sf5pX2IhkGfyxZ6b9JieSYRUy&wxtoken=777&devicetype=android-26&clientversion=26070237&appmsg_token=971_dLK7htA1j8LbMUk8pvJKRlC_o218HEgwDbS9uARPOyQ34_vfXv3iDstqYnq2gAyze1dBKm4ZMTlKeyfx&x5=1&f=json
Cookie: pgv_pvid=2027337976; pgv_info=ssid=s3015512850; rewardsn=; wxuin=2089823341; devicetype=android-26; version=26070237; lang=zh_CN; pass_ticket=NDndxxaZ7p6Z9PYulWpLqMbI0i3ULFeCPIHBFu1sf5pX2IhkGfyxZ6b9JieSYRUy; wap_sid2=CO3YwOQHEogBdENPSVdaS3pHOWc1V2QzY1NvZG9PYk1DMndPS3NfbGlHM0Vfal8zLU9kcUdkWTQxdUYwckFBT3RZM1VYUXFaWkFad3NVaWFXZ28zbEFIQ2pTa1lqZktfb01vcGdPLTQ0aGdJQ2xOSXoxTVFvNUg3SVpBMV9GRU1lbnotci1MWWl5d01BQUF+fjCj45PcBTgNQAE=; wxtokenkey=777

接著分析參數(shù)：

__biz：同上
pass_ticket：同上
Cookie：同上
offset和limit：代表偏移量和請(qǐng)求數(shù)量，由于公眾號(hào)評(píng)論最多展示100條，所以這兩個(gè)參數(shù)也不用改它
comment_id：獲取文章評(píng)論數(shù)據(jù)的標(biāo)記id，固定但需要從當(dāng)前文章結(jié)構(gòu)(Html)解析提取
appmsgid：票據(jù)id，非固定每次需要從當(dāng)前文章結(jié)構(gòu)(Html)解析提取
appmsg_token：票據(jù)token，非固定每次需要從當(dāng)前文章結(jié)構(gòu)(Html)解析提取

可以看到最后三個(gè)參數(shù)要解析html獲取(當(dāng)初真的找了好久才想到看文章網(wǎng)頁(yè)結(jié)構(gòu))。從文章請(qǐng)求接口可以獲得文章地址，對(duì)應(yīng)上面的content_url字段，但請(qǐng)求該地址前仍需要對(duì)url做相關(guān)處理，不然上面三個(gè)參數(shù)會(huì)有缺失，也就獲取不到后面評(píng)論內(nèi)容：

def _parse_article_detail(self, content_url, article_id):
"""從文章頁(yè)提取相關(guān)參數(shù)用于獲取評(píng)論,article_id是已保存的文章id"""
try:
api = content_url.replace('amp;', '').replace('#wechat_redirect', '').replace('http', 'https')
html = requests.get(api, headers=self.headers).text
except:
print('獲取評(píng)論失敗' + content_url)
else:
# group(0) is current line
str_comment = re.search(r'var comment_id = "(.*)" \|\| "(.*)" \* 1;', html)
str_msg = re.search(r"var appmsgid = '' \|\| '(.*)'\|\|", html)
str_token = re.search(r'window.appmsg_token = "(.*)";', html)
if str_comment and str_msg and str_token:
comment_id = str_comment.group(1) # 評(píng)論id(固定)
app_msg_id = str_msg.group(1) # 票據(jù)id(非固定)
appmsg_token = str_token.group(1) # 票據(jù)token(非固定)

再回來(lái)看該接口返回的json數(shù)據(jù)，分析結(jié)構(gòu)后然后定義數(shù)據(jù)表(含SQL)：

Python爬取微信公眾號(hào)文章和評(píng)論的案例

文章評(píng)論數(shù)據(jù)表

-- ----------------------------
-- Table structure for tb_article_comment
-- ----------------------------
DROP TABLE IF EXISTS "public"."tb_article_comment";
CREATE TABLE "public"."tb_article_comment" (
"id" serial4 PRIMARY KEY,
"article_id" int4 NOT NULL,
"comment_id" varchar(50) COLLATE "pg_catalog"."default",
"nick_name" varchar(50) COLLATE "pg_catalog"."default" NOT NULL,
"logo_url" varchar(300) COLLATE "pg_catalog"."default",
"content_id" varchar(50) COLLATE "pg_catalog"."default" NOT NULL,
"content" varchar(3000) COLLATE "pg_catalog"."default" NOT NULL,
"like_num" int2,
"comment_time" timestamp(6),
"create_time" timestamp(6) NOT NULL
)
;
COMMENT ON COLUMN "public"."tb_article_comment"."id" IS '自增主鍵';
COMMENT ON COLUMN "public"."tb_article_comment"."article_id" IS '文章外鍵id';
COMMENT ON COLUMN "public"."tb_article_comment"."comment_id" IS '評(píng)論接口id';
COMMENT ON COLUMN "public"."tb_article_comment"."nick_name" IS '用戶昵稱';
COMMENT ON COLUMN "public"."tb_article_comment"."logo_url" IS '頭像地址';
COMMENT ON COLUMN "public"."tb_article_comment"."content_id" IS '評(píng)論id (唯一)';
COMMENT ON COLUMN "public"."tb_article_comment"."content" IS '評(píng)論內(nèi)容';
COMMENT ON COLUMN "public"."tb_article_comment"."like_num" IS '點(diǎn)贊數(shù)';
COMMENT ON COLUMN "public"."tb_article_comment"."comment_time" IS '評(píng)論時(shí)間';
COMMENT ON COLUMN "public"."tb_article_comment"."create_time" IS '入庫(kù)時(shí)間';
COMMENT ON TABLE "public"."tb_article_comment" IS '公眾號(hào)文章評(píng)論表';
-- ----------------------------
-- Indexes structure for table tb_article_comment
-- ----------------------------
CREATE UNIQUE INDEX "unique_content_id" ON "public"."tb_article_comment" USING btree (
"content_id" COLLATE "pg_catalog"."default" "pg_catalog"."text_ops" ASC NULLS LAST
);

萬(wàn)里長(zhǎng)征快到頭了，最后貼上這部分代碼，由于要先獲取文章地址，所以和上面獲取文章數(shù)據(jù)的代碼是一起的：

import json
import re
import time
from datetime import datetime

import requests

from utils import pgs


class WxMps(object):
"""微信公眾號(hào)文章、評(píng)論抓取爬蟲(chóng)"""

def __init__(self, _biz, _pass_ticket, _app_msg_token, _cookie, _offset=0):
self.offset = _offset
self.biz = _biz # 公眾號(hào)標(biāo)志
self.msg_token = _app_msg_token # 票據(jù)(非固定)
self.pass_ticket = _pass_ticket # 票據(jù)(非固定)
self.headers = {
'Cookie': _cookie, # Cookie(非固定)
'User-Agent': 'Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/57.0.2987.132 '
}
wx_mps = 'wxmps' # 這里數(shù)據(jù)庫(kù)、用戶、密碼一致(需替換成實(shí)際的)
self.postgres = pgs.Pgs(host='localhost', port='5432', db_name=wx_mps, user=wx_mps, password=wx_mps)

def start(self):
"""請(qǐng)求獲取公眾號(hào)的文章接口"""

offset = self.offset
while True:
api = 'https://mp.weixin.qq.com/mp/profile_ext?action=getmsg&__biz={0}&f=json&offset={1}' \
'&count=10&is_ok=1&scene=124&uin=777&key=777&pass_ticket={2}&wxtoken=&appmsg_token' \
'={3}&x5=1&f=json'.format(self.biz, offset, self.pass_ticket, self.msg_token)

resp = requests.get(api, headers=self.headers).json()
ret, status = resp.get('ret'), resp.get('errmsg') # 狀態(tài)信息
if ret == 0 or status == 'ok':
print('Crawl article: ' + api)
offset = resp['next_offset'] # 下一次請(qǐng)求偏移量
general_msg_list = resp['general_msg_list']
msg_list = json.loads(general_msg_list)['list'] # 獲取文章列表
for msg in msg_list:
comm_msg_info = msg['comm_msg_info'] # 該數(shù)據(jù)是本次推送多篇文章公共的
msg_id = comm_msg_info['id'] # 文章id
post_time = datetime.fromtimestamp(comm_msg_info['datetime']) # 發(fā)布時(shí)間
# msg_type = comm_msg_info['type'] # 文章類型
# msg_data = json.dumps(comm_msg_info, ensure_ascii=False) # msg原數(shù)據(jù)

app_msg_ext_info = msg.get('app_msg_ext_info') # article原數(shù)據(jù)
if app_msg_ext_info:
# 本次推送的首條文章
self._parse_articles(app_msg_ext_info, msg_id, post_time)
# 本次推送的其余文章
multi_app_msg_item_list = app_msg_ext_info.get('multi_app_msg_item_list')
if multi_app_msg_item_list:
for item in multi_app_msg_item_list:
msg_id = item['fileid'] # 文章id
if msg_id == 0:
msg_id = int(time.time() * 1000) # 設(shè)置唯一id,解決部分文章id=0出現(xiàn)唯一索引沖突的情況
self._parse_articles(item, msg_id, post_time)
print('next offset is %d' % offset)
else:
print('Before break , Current offset is %d' % offset)
break

def _parse_articles(self, info, msg_id, post_time):
"""解析嵌套文章數(shù)據(jù)并保存入庫(kù)"""

title = info.get('title') # 標(biāo)題
cover = info.get('cover') # 封面圖
author = info.get('author') # 作者
digest = info.get('digest') # 關(guān)鍵字
source_url = info.get('source_url') # 原文地址
content_url = info.get('content_url') # 微信地址
# ext_data = json.dumps(info, ensure_ascii=False) # 原始數(shù)據(jù)

content_url = content_url.replace('amp;', '').replace('#wechat_redirect', '').replace('http', 'https')
article_id = self.postgres.handler(self._save_article(), (msg_id, title, author, cover, digest,
source_url, content_url, post_time,
datetime.now()), fetch=True)
if article_id:
self._parse_article_detail(content_url, article_id)

def _parse_article_detail(self, content_url, article_id):
"""從文章頁(yè)提取相關(guān)參數(shù)用于獲取評(píng)論,article_id是已保存的文章id"""

try:
html = requests.get(content_url, headers=self.headers).text
except:
print('獲取評(píng)論失敗' + content_url)
else:
# group(0) is current line
str_comment = re.search(r'var comment_id = "(.*)" \|\| "(.*)" \* 1;', html)
str_msg = re.search(r"var appmsgid = '' \|\| '(.*)'\|\|", html)
str_token = re.search(r'window.appmsg_token = "(.*)";', html)

if str_comment and str_msg and str_token:
comment_id = str_comment.group(1) # 評(píng)論id(固定)
app_msg_id = str_msg.group(1) # 票據(jù)id(非固定)
appmsg_token = str_token.group(1) # 票據(jù)token(非固定)

# 缺一不可
if appmsg_token and app_msg_id and comment_id:
print('Crawl article comments: ' + content_url)
self._crawl_comments(app_msg_id, comment_id, appmsg_token, article_id)

def _crawl_comments(self, app_msg_id, comment_id, appmsg_token, article_id):
"""抓取文章的評(píng)論"""

api = 'https://mp.weixin.qq.com/mp/appmsg_comment?action=getcomment&scene=0&__biz={0}' \
'&appmsgid={1}&idx=1&comment_id={2}&offset=0&limit=100&uin=777&key=777' \
'&pass_ticket={3}&wxtoken=777&devicetype=android-26&clientversion=26060739' \
'&appmsg_token={4}&x5=1&f=json'.format(self.biz, app_msg_id, comment_id,
self.pass_ticket, appmsg_token)
resp = requests.get(api, headers=self.headers).json()
ret, status = resp['base_resp']['ret'], resp['base_resp']['errmsg']
if ret == 0 or status == 'ok':
elected_comment = resp['elected_comment']
for comment in elected_comment:
nick_name = comment.get('nick_name') # 昵稱
logo_url = comment.get('logo_url') # 頭像
comment_time = datetime.fromtimestamp(comment.get('create_time')) # 評(píng)論時(shí)間
content = comment.get('content') # 評(píng)論內(nèi)容
content_id = comment.get('content_id') # id
like_num = comment.get('like_num') # 點(diǎn)贊數(shù)
# reply_list = comment.get('reply')['reply_list'] # 回復(fù)數(shù)據(jù)

self.postgres.handler(self._save_article_comment(), (article_id, comment_id, nick_name, logo_url,
content_id, content, like_num, comment_time,
datetime.now()))

@staticmethod
def _save_article():
sql = 'insert into tb_article(msg_id,title,author,cover,digest,source_url,content_url,post_time,create_time) ' \
'values(%s,%s,%s,%s,%s,%s,%s,%s,%s) returning id'
return sql

@staticmethod
def _save_article_comment():
sql = 'insert into tb_article_comment(article_id,comment_id,nick_name,logo_url,content_id,content,like_num,' \
'comment_time,create_time) values(%s,%s,%s,%s,%s,%s,%s,%s,%s)'
return sql

if __name__ == '__main__':
biz = 'MzI2NDk5NzA0Mw==' # "36氪"
pass_ticket = 'NDndxxaZ7p6Z9PYulWpLqMbI0i3ULFeCPIHBFu1sf5pX2IhkGfyxZ6b9JieSYRUy'
app_msg_token = '971_Z0lVNQBcGsWColSubRO9H13ZjrPhjuljyxLtiQ~~'
cookie = 'wap_sid2=CO3YwOQHEogBQnN4VTNhNmxQWmc3UHI2U3kteWhUeVExZHFVMnN0QXlsbzVJRUJKc1pkdVFUU2Y5UzhSVEtOZmt1VVlYTkR4SEllQ2huejlTTThJWndMQzZfYUw2SldLVGVMQUthUjc3QWdVMUdoaGN0Nml2SU05cXR1dTN2RkhRUVd1V2Y3SFJ5d01BQUF+fjCB1pLcBTgNQJVO'
# 以上信息不同公眾號(hào)每次抓取都需要借助抓包工具做修改
wxMps = WxMps(biz, pass_ticket, app_msg_token, cookie)
wxMps.start() # 開(kāi)始爬取文章及評(píng)論

文末小結(jié)

最后展示下數(shù)據(jù)庫(kù)里的數(shù)據(jù)，單線程爬的慢而且又沒(méi)這方面的數(shù)據(jù)需求，所以也只是隨便試了下手：

Python爬取微信公眾號(hào)文章和評(píng)論的案例

以上是“Python爬取微信公眾號(hào)文章和評(píng)論的案例”這篇文章的所有內(nèi)容，感謝各位的閱讀！相信大家都有了一定的了解，希望分享的內(nèi)容對(duì)大家有所幫助，如果還想學(xué)習(xí)更多知識(shí)，歡迎關(guān)注億速云行業(yè)資訊頻道！

向AI問(wèn)一下細(xì)節(jié)

推薦閱讀：

免責(zé)聲明：本站發(fā)布的內(nèi)容（圖片、視頻和文字）以原創(chuàng)、轉(zhuǎn)載和分享為主，文章觀點(diǎn)不代表本網(wǎng)站立場(chǎng)，如果涉及侵權(quán)請(qǐng)聯(lián)系站長(zhǎng)郵箱：is@yisu.com進(jìn)行舉報(bào)，并提供相關(guān)證據(jù)，一經(jīng)查實(shí)，將立刻刪除涉嫌侵權(quán)內(nèi)容。

上一篇新聞：
微信小程序利用getCurrentPages進(jìn)行頁(yè)面?zhèn)髦档姆椒?/a>

下一篇新聞：
使用Python標(biāo)準(zhǔn)庫(kù)進(jìn)行性能測(cè)試的方法

猜你喜歡

AI
助
手

產(chǎn)品服務(wù)

地區(qū)劃分

專題活動(dòng)

幫助支持

關(guān)于我們

售后咨詢

7*24小時(shí)在線電話：400-100-2938

7*24小時(shí)在線 QQ：800811969

關(guān)注億速云

億速云公眾號(hào)

手機(jī)網(wǎng)站二維碼

<center id="ewyci"></center>

<strike id="ewyci"></strike>