以爬取最受欢迎的豆瓣影评为例
生成爬虫模板代码
scrapy genspider -t basic douban https://movie.douban.com/review/best/ |
完善代码如下:
# -*- coding: utf-8 -*-
import scrapy
from .. items import DoubandemoItem
class DoubanSpider(scrapy.Spider):
name = 'douban'
allowed_domains = ['movie.douban.com']
start_urls = ['https://movie.douban.com/review/best//']
def parse(self, response):
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
for sel in response.xpath("//div[@class='main-bd']"):
title = sel.xpath("h2/a/text()").extract()
content = sel.xpath("div[@class='review-short']//div[@class='short-content']//text()").extract()
dbitem = DoubandemoItem()
dbitem["title"] = title[0].strip()
dbitem["content"] = content[0].strip()
yield dbitem
|
使用xpath获取title和content放入Items
item.py完整代码如下:
# -*- coding: utf-8 -*-
import scrapy
class DoubandemoItem(scrapy.Item):
title = scrapy.Field()
content = scrapy.Field()
|
pipelines.py完整代码如下:
# -*- coding: utf-8 -*-
import json
from scrapy.exceptions import DropItem
class DoubandemotojsonPipeline(object):
def __init__(self):
self.file = open("db.jl","wb")
def process_item(self, item, spider):
line = json.dumps(dict(item))+"\n"
self.file.write(line)
return item
class DoubandemotonullPipeline(object):
def process_item(self, item, spider):
if item["content"]:
return item
else:
raise DropItem("Duplicate item found: %s" % item)
|
settings.py关于Item Pipeline配置项如下:
ITEM_PIPELINES = {
'DoubanDemo.pipelines.DoubandemotojsonPipeline': 300,
'DoubanDemo.pipelines.DoubandemotonullPipeline': 400,
}
|
item按数字从低到高的顺序,通过pipeline,通常将这些数字定义在0-1000范围内。
首先启用第一个pipeline,将完整itmes数据生产输出到文件db.jl
然后使用命令scrapy crawl douban -o douban.json输出过滤后的itmes数据.