【Scrapy】インストール→スパイダー実行まで操作手順メモ

- Python -
2020.05.21
Python[パイソン]

Python製Webスクレイピングフレームワーク「Scrapy」をUdemyの動画講座 ≫
で学習中。

Scrapy導入からセットアップまでの手順を忘れた時用のメモです。Mac環境用で、記事記載時のScrapyバージョンは2.1.0。

スクレイピング対象のサイトは、http://quotes.toscrape.com/を想定。

公式サイトhttps://scrapy.org/

Scrapy インストール〜実行まで

実行するコマンドだけ先にまとめておく。

  1. python -m venv venv
  2. pip install scrapy
  3. scrapy version
  4. scrapy startproject <project name>
  5. cd <project name> && scrapy genspider <file name> <domain>
  6. scrapy crawl <スパイダーファイル名> -o file.csv

以下、ログ含め順番に記載。

まず、適当なフォルダへ移りvenvでPython仮想環境を作成

scrapy $ python -m venv venv

仮装環境起動

scrapy $ source venv/bin/activate
(venv) scrapy $

Scrapyをインストール

(venv) scrapy $ pip install scrapy
Collecting scrapy
Downloading https://files.pythonhosted.org/packages/9a/d3/5af102af577f57f706fcb302ea47d40e09355778488de904b3594d4e48d2/Scrapy-2.1.0-py2.py3-none-any.whl (239kB)
100% |████████████████████████████████| 245kB 118kB/s
...省略
Successfully installed Automat-20.2.0 PyDispatcher-2.0.5 PyHamcrest-2.0.2 Twisted-20.3.0 attrs-19.3.0 cffi-1.14.0 constantly-15.1.0 cryptography-2.9.2 cssselect-1.1.0 hyperlink-19.0.0 idna-2.9 incremental-17.5.0 lxml-4.5.1 parsel-1.6.0 protego-0.1.16 pyOpenSSL-19.1.0 pyasn1-0.4.8 pyasn1-modules-0.2.8 pycparser-2.20 queuelib-1.5.0 scrapy-2.1.0 service-identity-18.1.0 six-1.14.0 w3lib-1.22.0 zope.interface-5.1.0
(venv) scrapy $

scrapy versionでインストール確認

(venv) scrapy $ scrapy version
Scrapy 2.1.0
(venv) scrapy $

scrapy startproject <project name>でScrapyプロジェクト作成

(venv) scrapy $ scrapy startproject pagescrape
New Scrapy project 'pagescrape', using template directory '/Users/me/Desktop/python-practice/scrapy/venv/lib/python3.7/site-packages/scrapy/templates/project', created in:
/Users/me/Desktop/python-practice/scrapy/pagescrape

You can start your first spider with:
cd pagescrape
scrapy genspider example example.com
(venv) scrapy $

scrapy genspider <file name> <domain>でscrapyプロジェクトのスパイダーファイル作成

(venv) scrapy $ cd pagescrape
(venv) pagescrape $
(venv) pagescrape $ scrapy genspider page_spider quotes.toscrape.com
Created spider 'page_spider' using template 'basic' in module:
pagescrape.spiders.page_spider
(venv) pagescrape $

ここまでの操作でVSCode上でこんな感じのフォルダ構成、こんなスパイダーファイルが出来上がる。

scrapy

parseメソッドを以下のようにして、

page_spider.py# -*- coding: utf-8 -*-
import scrapy

class PageSpiderSpider(scrapy.Spider):
    name = 'page_spider'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        h1_tag = response.xpath('//h1/a/text()').extract_first()
        tags = response.xpath('//*[@class="tag-item"]/a/text()').extract()

        yield {'H1 Tag': h1_tag, 'Tags': tags}

scrapy crawl <スパイダーファイル名> -o file.csvでクロール実行 + エクスポートができる。yieldした部分が出力される。json, xmlも可能。

(venv) pagescrape $ scrapy crawl page_spider
2020-05-21 00:34:38 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: pagescrape)
...省略...
2020-05-21 00:34:38 [scrapy.core.engine] INFO: Spider opened
2020-05-21 00:34:38 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-05-21 00:34:38 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-05-21 00:34:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/> (referer: None)
2020-05-21 00:34:40 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'H1 Tag': 'Quotes to Scrape', 'Tags': ['love', 'inspirational', 'life', 'humor', 'books', 'reading', 'friendship', 'friends', 'truth', 'simile']}
2020-05-21 00:34:40 [scrapy.core.engine] INFO: Closing spider (finished)
2020-05-21 00:34:40 [scrapy.extensions.feedexport] INFO: Stored csv feed (1 items) in: file.csv
2020-05-21 00:34:40 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 219,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 2342,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 1.375834,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 5, 20, 15, 34, 40, 190982),
'item_scraped_count': 1,
'log_count/DEBUG': 2,
'log_count/INFO': 10,
'memusage/max': 50552832,
'memusage/startup': 50552832,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2020, 5, 20, 15, 34, 38, 815148)}
2020-05-21 00:34:40 [scrapy.core.engine] INFO: Spider closed (finished)
(venv) pagescrape $

Scrapyでログイン時のスクリプト例

FormRequest()でpostする。

login_spider.py# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import FormRequest
from scrapy.utils.response import open_in_browser

class LoginSpiderSpider(scrapy.Spider):
    name = 'login_spider'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/login']

    def parse(self, response):
        csrf_token = response.xpath('//input[@name="csrf_token"]/@value').extract_first()
        yield FormRequest('http://quotes.toscrape.com/login',
                            formdata={'csrf_token': csrf_token,
                            'username': 'foobar',
                            'password': 'foobar' },
                            callback=self.parse_after_login)
    
    def parse_after_login(self, response):
        if response.xpath('//a[text()="Logout"]'):
            # self.log('You logged in!')
            # ログイン後の画面をブラウザで開く
            open_in_browser(response)

open_in_browser()は確認のために書いているだけ。

 Scrapy shellの操作

scrapy shellでモードに入る

(venv) pagescrape $ scrapy shell
2020-05-20 22:52:51 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: pagescrape)
...省略...
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x10f101f90>
[s] item {}
[s] settings <scrapy.settings.Settings object at 0x10f3cc090>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
>>>

fetch('<URL>')でページ情報を取得(もしくはscrapy shell '<URL>'で一気に)

>>> fetch("http://quotes.toscrape.com/")
2020-05-20 22:53:49 [scrapy.core.engine] INFO: Spider opened
2020-05-20 22:53:50 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2020-05-20 22:53:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/> (referer: None)
>>>

response.xpath('//xxx')で引っこ抜く

>>> response.xpath('//h1')
[<Selector xpath='//h1' data='<h1>\n <a href="/" ...'>]
>>>
>>> response.xpath('//h1/a')
[<Selector xpath='//h1/a' data='<a href="/" style="text-decoration: n...'>]
>>>
>>> response.xpath('//h1/a/text()')
[<Selector xpath='//h1/a/text()' data='Quotes to Scrape'>]
>>>
>>> response.xpath('//h1/a/text()').extract()
['Quotes to Scrape']
>>>
>>> response.xpath('//h1/a/text()').extract_first()
'Quotes to Scrape'
>>>

>>> response.xpath('//*[@class="tag-item"]/a/text()').extract_first()
'love'
>>> response.xpath('//*[@class="tag-item"]/a/text()').extract()
['love', 'inspirational', 'life', 'humor', 'books', 'reading', 'friendship', 'friends', 'truth', 'simile']
>>>