我的第一個Scrapy 程序 - 爬取當(dāng)當(dāng)網(wǎng)信息

發(fā)布時間：2020-07-09 19:55:34 來源：網(wǎng)絡(luò) 閱讀：3565 作者：beanxyz 欄目：MySQL數(shù)據(jù)庫

前面已經(jīng)安裝了Scrapy，下面來實現(xiàn)第一個測試程序。

概述

Scrapy是一個爬蟲框架，他的基本流程如下所示（下面截圖來自互聯(lián)網(wǎng)）

簡單的說，我們需要寫一個item文件，定義返回的數(shù)據(jù)結(jié)構(gòu)；寫一個spider文件，具體爬取的數(shù)據(jù)程序，以及一個管道 pipeline 文件，作為后續(xù)操作，比如保存數(shù)據(jù)等等。

下面以當(dāng)當(dāng)網(wǎng)為例，看看怎么實現(xiàn)。
這個例子里面我想爬取的內(nèi)容是前面20頁的羽絨服產(chǎn)品，包括產(chǎn)品名字，鏈接和評論數(shù)。

過程

1. 創(chuàng)建一個Scrapy的項目

scrapy startproject dangdang

2. 創(chuàng)建一個爬蟲文件**

scrapy genspider -t basic dd dangdang.com

這樣他會自動創(chuàng)建一個爬蟲文件，結(jié)構(gòu)如下所示：
我的第一個Scrapy 程序 - 爬取當(dāng)當(dāng)網(wǎng)信息

3. 編寫items.py

items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class DangdangItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()

    title=scrapy.Field()
    url=scrapy.Field()
    comment=scrapy.Field()

4. 編寫爬蟲文件dd.py

前面第二步已經(jīng)自動生成了一個模板，我們直接修改就行。
dd.py

# -*- coding: utf-8 -*-

import scrapy
from dangdang.items import DangdangItem
from scrapy.http import Request

class DdSpider(scrapy.Spider):
    name = 'dd'
    allowed_domains = ['dangdang.com']
    start_urls = ['http://category.dangdang.com/pg1-cid4010275.html']

    def parse(self, response):

        item=DangdangItem()
        item['title']=response.xpath(u"http://a[@dd_name='單品標(biāo)題']/text()").extract()
        item['url']=response.xpath("http://a[@dd_name='單品標(biāo)題']/@href").extract()
        item['comment']=response.xpath("http://a[@dd_name='單品評論']/text()").extract()
        text = response.body
        # content_type = chardet.detect(text)
        # if content_type['encoding'] != 'UTF-8':
        #     text = text.decode(content_type['encoding'])
        # text = text.encode('utf-8')
        # print(text)

        yield item

        for i in range(2,20):
            url='http://category.dangdang.com/pg%d-cid4010275.html'%i
            yield Request(url,callback=self.parse)

5. 編寫pipelines.py

為了使用pipeline，配置文件需要做個小修改，我順便關(guān)掉了對robot文件的確認(rèn)
settings.py

ROBOTSTXT_OBEY = False

ITEM_PIPELINES = {
   'dangdang.pipelines.DangdangPipeline': 300,
}

pipeline.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import pymysql

class DangdangPipeline(object):
    def process_item(self, item, spider):
        conn=pymysql.connect(host='127.0.0.1',user='root',passwd='root',db='dangdang',use_unicode=True,charset='utf8')
        for i in range(0,len(item['title'])):
            title=item['title'][i]
            link=item['url'][i]
            comment=item['comment'][i]

            print(type(title))
            print(title)
            # sql="insert into dd(title,link,comment) values ('"+title+"','"+link+"','"+comment+"')"
            sql = "insert into dd(title,link,comment) values('" + title + "','" + link + "','" + comment + "')"
            try:
                conn.query(sql)
            except Exception as err:
                pass
        conn.close()

        return item

6. 創(chuàng)建數(shù)據(jù)庫和表

我最后的數(shù)據(jù)要保存到mysql里面，python里面可以通過pymysql進行操作。我提前在mysql命令行界面里面創(chuàng)建了一個數(shù)據(jù)庫和空表

mysql> create database dangdang;
mysql> create table dd(id int auto_increment primary, title varchar(100), link varchar(100), comment varchar(32));

7. 執(zhí)行

scrapy crawl dd
如果不想看日志可以使用
scrapy crawl dd --nolog

8. 檢測結(jié)果

test.py

#！/usr/bin/env python
#! -*- coding:utf-8 -*-
# Author: Yuan Li
import pymysql
conn=pymysql.connect(host='127.0.0.1',user='root',passwd='root',db='dangdang',use_unicode=True,charset='utf8')

cursor = conn.cursor(cursor=pymysql.cursors.DictCursor)
#SQL查詢
cursor.execute("select * from dd")
row=cursor.fetchall()
for i in row:
    print(i)
conn.close()

結(jié)果測試成功

我的第一個Scrapy 程序 - 爬取當(dāng)當(dāng)網(wǎng)信息

向AI問一下細(xì)節(jié)

我的第一個Scrapy 程序 - 爬取當(dāng)當(dāng)網(wǎng)信息

概述

過程

1. 創(chuàng)建一個Scrapy的項目

2. 創(chuàng)建一個爬蟲文件**

3. 編寫items.py

4. 編寫爬蟲文件dd.py

5. 編寫pipelines.py

6. 創(chuàng)建數(shù)據(jù)庫和表

7. 執(zhí)行

8. 檢測結(jié)果

猜你喜歡

最新資訊

相關(guān)推薦

相關(guān)標(biāo)簽