#Python爬虫#--Scrapy之Item Pipeline组件(ITEM_PIPELINES)

user

雨橙

中国.四川.成都

世界之上、唯有远见、惟爱不变。


以爬取最受欢迎的豆瓣影评为例
生成爬虫模板代码
scrapy genspider -t basic douban https://movie.douban.com/review/best/
 
完善代码如下:
# -*- coding: utf-8 -*-
import scrapy
from .. items import DoubandemoItem

class DoubanSpider(scrapy.Spider):
    name = 'douban'
    allowed_domains = ['movie.douban.com']
    start_urls = ['https://movie.douban.com/review/best//']

    def parse(self, response):
        import sys
        reload(sys)
        sys.setdefaultencoding("utf-8")
        for sel in response.xpath("//div[@class='main-bd']"):
            title = sel.xpath("h2/a/text()").extract()
            content = sel.xpath("div[@class='review-short']//div[@class='short-content']//text()").extract()

            dbitem = DoubandemoItem()
            dbitem["title"] = title[0].strip()
            dbitem["content"] = content[0].strip()
            yield  dbitem
 
使用xpath获取title和content放入Items
 
item.py完整代码如下:
# -*- coding: utf-8 -*-
import scrapy

class DoubandemoItem(scrapy.Item):    
    title = scrapy.Field()
    content = scrapy.Field()

 
pipelines.py完整代码如下:
# -*- coding: utf-8 -*-

import json
from scrapy.exceptions import DropItem

class DoubandemotojsonPipeline(object):
    def __init__(self):
        self.file = open("db.jl","wb")

    def process_item(self, item, spider):
        line = json.dumps(dict(item))+"\n"
        self.file.write(line)
        return item


class DoubandemotonullPipeline(object):
    def process_item(self, item, spider):
        if item["content"]:
            return item
        else:
            raise DropItem("Duplicate item found: %s" % item)
 
settings.py关于Item Pipeline配置项如下:
ITEM_PIPELINES = {
   'DoubanDemo.pipelines.DoubandemotojsonPipeline': 300,
   'DoubanDemo.pipelines.DoubandemotonullPipeline': 400,
}

 
item按数字从低到高的顺序,通过pipeline,通常将这些数字定义在0-1000范围内。
 
首先启用第一个pipeline,将完整itmes数据生产输出到文件db.jl
然后使用命令scrapy crawl douban -o douban.json输出过滤后的itmes数据.
posted at