
前言
业余时间在接触 python,从兴趣点着手是个不错的办法。而爬虫正是我感兴趣的一个方向。我根据 younghz 的 Scrapy 教程一步一步来,试着将 Logdown 的博文相关数据抓去下来,作为练手之用,这里做个记录。
工具:Scrapy
Logdown博文相关数据
这里暂时包括如下4个数据:
- 博文名称 article_name
- 博文网址 article_url
- 博文日期 article_time
- 博文标签 article_tags
Let's Go!
1. 创建project
scrapy startproject LogdownBlog
2. items.py的编写
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class LogdownblogspiderItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
article_name = scrapy.Field()
article_url = scrapy.Field()
article_time = scrapy.Field()
article_tags = scrapy.Field()
pass
3. pipelines.py的编写
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import json
import codecs
class LogdownblogspiderPipeline(object):
def __init__(self):
self.file = codecs.open('LogdownBlogArticles.json', mode = 'w', encoding = 'utf-8')
def process_item(self, item, spider):
line = json.dumps(dict(item)) + '\n'
self.file.write(line.decode('unicode_escape'))
return item
将item通过管道输出到文件LogdownBlogArticles.json
中,模式为w
,用json形式覆盖写入,编码为utf-8
编码。
4. settings.py的编写
BOT_NAME = 'LogdownBlog'
SPIDER_MODULES = ['LogdownBlog.spiders']
NEWSPIDER_MODULE = 'LogdownBlog.spiders'
COOKIES_ENABLED = False
ITEM_PIPELINES = {
'LogdownBlog.pipelines.LogdownblogspiderPipeline':300
}
5. LogdownSpider.py的编写-爬虫分析部分
在 spider 文件夹下新建一个名字为 LogdownSpider.py 的文件,这时目录层次如下:
+ LogdownBlog
| + LogdownBlog
| | + spiders
| | | - __init__.py
| | | - LogdownSpider.py
| | - __init__.py
| | - items.py
| | - pipelines.py
| | - settings.py
| - scrapy.cfg
# -*- coding: utf-8 -*-
from scrapy.spider import Spider
from scrapy.http import Request
from scrapy.selector import Selector
from LogdownBlog.items import LogdownblogspiderItem
class LogdownSpider(Spider):
'''LogdownSpider'''
name = 'LogdownSpider'
download_delay = 1
allowed_domains = ["childhood.logdown.com"]
first_blog_url = raw_input("请输入您的第一篇博客地址: ")
start_urls = [
first_blog_url
]
def parse(self, response):
sel = Selector(response)
item = LogdownblogspiderItem()
article_name = sel.xpath('//article[@class="post"]/h2/a/text()').extract()[0]
article_url = sel.xpath('//article[@class="post"]/h2/a/@href').extract()[0]
article_time = sel.xpath('//article[@class="post"]/div[@class="meta"]/div[@class="date"]/time/@datetime').extract()[0]
article_tags = sel.xpath('//article[@class="post"]/div[@class="meta"]/div[@class="tags"]/a/text()').extract()
item['article_name'] = article_name.encode('utf-8')
item['article_url'] = article_url.encode('utf-8')
item['article_time'] = article_time.encode('utf-8')
item['article_tags'] = [n.encode('utf-8') for n in article_tags]
yield item
# get next article's url
nextUrl = sel.xpath('//nav[@id="pagenavi"]/a[@class="next"]/@href').extract()[0]
print nextUrl
yield Request(nextUrl, callback=self.parse)
有几个点注意下:
- 对 xpath 的理解要正确,因为直接关系到我们想要的数据在页面html里的提取。
对 xpath 的分析当然离不了对 html 的分析,我这里采用了 Chrome 浏览器,通过右键-审查元素来查看我们想要的数据在页面中的层次位置。下面以【查看下一篇博文】为例子。

所以通过 xpath 为'//nav[@id="pagenavi"]/a[@class="next"]/@href'
就能得到下一篇博文的 url 地址。
- 设置download_delay,减轻服务器的压力,防止被ban。
-
yield Request(nextUrl, callback=self.parse)
,获取每个页面的“下一篇博客“的网址返回给引擎,从而循环实现下一个网页的爬取。
6. 执行
scrapy crawl LogdownSpider
截图如文章开头图片所示,格式如下:
...
...
{"article_name": " PhysicsEditorExporter for QuickCocos2dx 使用说明", "article_tags": ["exporter", "Chipmunk", "physicseditor", "quick-x"], "article_url": "http://childhood.logdown.com/posts/196165/physicseditorexporter-for-quickcocos2dx-instructions-for-use", "article_time": "2014-04-28 14:08:00 UTC"}
...
...