HappyCoding:] ChildhoodAndy

思维世界的发展,在某种意义上说,就是对惊奇的不断摆脱。——爱因斯坦

Scrapy抓取Logdown博文相关数据

| Comments

前言
业余时间在接触 python,从兴趣点着手是个不错的办法。而爬虫正是我感兴趣的一个方向。我根据 younghz 的 Scrapy 教程一步一步来,试着将 Logdown 的博文相关数据抓去下来,作为练手之用,这里做个记录。

工具:Scrapy

Logdown博文相关数据

这里暂时包括如下4个数据:

  • 博文名称 article_name
  • 博文网址 article_url
  • 博文日期 article_time
  • 博文标签 article_tags

Let's Go!

1. 创建project

命令行cd到某个目录,然后运行如下命令
scrapy startproject LogdownBlog

2. items.py的编写

items.py
# -*- coding: utf-8 -*-


# Define here the models for your scraped items

#

# See documentation in:

# http://doc.scrapy.org/en/latest/topics/items.html


import scrapy

class LogdownblogspiderItem(scrapy.Item):
    # define the fields for your item here like:

    # name = scrapy.Field()

    article_name = scrapy.Field()
    article_url = scrapy.Field()
    article_time = scrapy.Field()
    article_tags = scrapy.Field()
    pass

3. pipelines.py的编写

pipelines.py
# -*- coding: utf-8 -*-


# Define your item pipelines here

#

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html


import json
import codecs

class LogdownblogspiderPipeline(object):
    def __init__(self):
        self.file = codecs.open('LogdownBlogArticles.json', mode = 'w', encoding = 'utf-8')

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + '\n'
        self.file.write(line.decode('unicode_escape'))
        return item

将item通过管道输出到文件LogdownBlogArticles.json中,模式为w,用json形式覆盖写入,编码为utf-8编码。

4. settings.py的编写

settings.py
BOT_NAME = 'LogdownBlog'

SPIDER_MODULES = ['LogdownBlog.spiders']
NEWSPIDER_MODULE = 'LogdownBlog.spiders'

COOKIES_ENABLED = False

ITEM_PIPELINES = {
    'LogdownBlog.pipelines.LogdownblogspiderPipeline':300
}

5. LogdownSpider.py的编写-爬虫分析部分

在 spider 文件夹下新建一个名字为 LogdownSpider.py 的文件,这时目录层次如下:

项目目录层次
+ LogdownBlog
|  + LogdownBlog
|  |  + spiders
|  |  |  - __init__.py
|  |  |  - LogdownSpider.py
|  |  - __init__.py
|  |  - items.py
|  |  - pipelines.py
|  |  - settings.py
|  - scrapy.cfg
LogdownSpider.py 爬虫代码
# -*- coding: utf-8 -*-


from scrapy.spider import Spider
from scrapy.http import Request
from scrapy.selector import Selector
from LogdownBlog.items import LogdownblogspiderItem

class LogdownSpider(Spider):
    '''LogdownSpider'''

    name = 'LogdownSpider'

    download_delay = 1
    allowed_domains = ["childhood.logdown.com"]
    
    first_blog_url = raw_input("请输入您的第一篇博客地址: ")
    start_urls = [
        first_blog_url
    ]

    def parse(self, response):
        sel = Selector(response)
        item = LogdownblogspiderItem()

        article_name = sel.xpath('//article[@class="post"]/h2/a/text()').extract()[0]
        article_url = sel.xpath('//article[@class="post"]/h2/a/@href').extract()[0]
        article_time = sel.xpath('//article[@class="post"]/div[@class="meta"]/div[@class="date"]/time/@datetime').extract()[0]
        article_tags = sel.xpath('//article[@class="post"]/div[@class="meta"]/div[@class="tags"]/a/text()').extract()

        item['article_name'] = article_name.encode('utf-8')
        item['article_url'] = article_url.encode('utf-8')
        item['article_time'] = article_time.encode('utf-8')
        item['article_tags'] = [n.encode('utf-8') for n in article_tags]

        yield item

        # get next article's url

        nextUrl = sel.xpath('//nav[@id="pagenavi"]/a[@class="next"]/@href').extract()[0]
        print nextUrl
        yield Request(nextUrl, callback=self.parse)

有几个点注意下:

  • xpath 的理解要正确,因为直接关系到我们想要的数据在页面html里的提取。

对 xpath 的分析当然离不了对 html 的分析,我这里采用了 Chrome 浏览器,通过右键-审查元素来查看我们想要的数据在页面中的层次位置。下面以【查看下一篇博文】为例子。

所以通过 xpath 为'//nav[@id="pagenavi"]/a[@class="next"]/@href'就能得到下一篇博文的 url 地址。

  • 设置download_delay,减轻服务器的压力,防止被ban。
  • yield Request(nextUrl, callback=self.parse),获取每个页面的“下一篇博客“的网址返回给引擎,从而循环实现下一个网页的爬取。

6. 执行

scrapy crawl LogdownSpider

截图如文章开头图片所示,格式如下:

...
...
{"article_name": "  PhysicsEditorExporter for QuickCocos2dx 使用说明", "article_tags": ["exporter", "Chipmunk", "physicseditor", "quick-x"], "article_url": "http://childhood.logdown.com/posts/196165/physicseditorexporter-for-quickcocos2dx-instructions-for-use", "article_time": "2014-04-28 14:08:00 UTC"}
...
...

相关资料查阅

Comments

comments powered by Disqus