Scrapy框架的学习(一)

Scrapy框架的学习(一),第1张

1. Scrapy概述 1. 为什么要学习scrapy框架
  • 爬虫必备的技术,面试会问相关的知识。
  • 让我们的爬虫更快更强大。(支持异步爬虫)
2. 什么是Scrapy?

  • 异步爬虫框架:Scrapy是一个基于Python开发的爬虫框架,用于抓取网站并从其页面中提取结构化数据,也是当前Python爬虫生态中最流行的爬虫框架,Scrapy框架架构清晰,可扩展性强,可以灵活高效的完成各种爬虫需求。
    程序状态转换图:
3. 如何学习Scrapy?
  • 官网:https://scrapy.org/
  • 官方文档1(中文):https://scrapy-chs.readthedocs.io/zh_CN/0.24/intro/tutorial.html
  • 官方文档2(英文):https://docs.scrapy.org/en/latest/
4. Scrapy工作流程



分工介绍表
版块介绍要求
Scrapy engine(引擎)总指挥:负责数据和信号在不同模块之间的传递Scrapy已经实现
Scheduler(调度器)一个队列,存放引擎发过来的request请求Scrapy已经实现
Downoader(下载器)下载引擎发过来的requests请求的源码(即response),将源码返回给引擎scrapy已经实现
Spider(爬虫)处理引擎发来的response,提取数据,提取url,并交给引擎需要手写
Item Pipline(管道)处理引擎传过来的数据,比如存储数据需要手写
Downloader Middlewares(下载中间件)可以自定义的下载扩展,比如设置代理一般不用手写
Spider Middlewares(中间件)可以自定义requests请求和进行response过滤一般不用手写
2. Scrapy快速入门(小案例) 1. 安装
pip install scrapy
pip install scrapy==2.5.1   # 指定安装2.5.1版本的scrapy


在终端内输入“scrapy”命令验证是否安装好了:

以上显示就说明已经安装好了。

2. 创建项目
  • 需要进入到项目保存位置的cmd中。
# scrapy startproject 项目名称
scrapy startproject my_Scrapy

3. 项目结构分析

  • my_Scrapy
    • my_Scrapy
      • spiders
        • __init__.py
      • __init__.py
      • items.py
      • middlewares.py
      • piplelines.py
      • settings.py
    • scrapy.cfg

功能介绍:

  • scrapy.cfg:Scrapy项目配置文件,定义项目的配置文件的路径,部署信息。(一般不需要改)
  • items.py:定义了item的数据结构,所有item的定义都可以放在这里。(定义爬取的数据内容有哪些)
  • piplelines.py:定义item Pipeline的实现。
  • settings.py:定义项目的全局配置。
  • middlewares.py:中间件文件,定义了Spider Middlewares和Downloader Middlewares的实现。
  • spiders:里面包含一个个spider(爬虫)的实现。每一个spider对应一个.py文件。
4. 创建Spider
# 先进入项目目录:
cd my_Scrapy
# scrapy genspider 爬虫文件名 爬取网站的域名
scrapy genspider spider1 www.baidu.com


  • 修改spider1.py文件:
import scrapy


class Spider1Spider(scrapy.Spider):
    # spider(爬虫)名称,需要记住,通过名字来启动爬虫:
    name = 'spider1'
    # 允许爬取的域名:可更改(限制爬虫,不要爬到其他网站去了)
    allowed_domains = ['http://quotes.toscrape.com/']
    # 初始request请求:
    start_urls = ['http://quotes.toscrape.com/']

    # 解析方法:
    def parse(self, response):
        print(response.text)

官方参考案例网站:http://quotes.toscrape.com/

5. 创建item
  • item是保存爬取数据的容器,定义爬取的数据结构。
    修改项目中的items.py文件如下:
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy

class MyScrapyItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()

    # 采集的目标内容:名言、名人、分类标签
    # 名言:
    text = scrapy.Field()
    # 名人:
    author = scrapy.Field()
    # 标签:
    tags = scrapy.Field()
6. 解析Response 1. 仅仅爬取第一页的数据
  • 修改spider1.py文件中的parse()方法,该方法用于解析源码中的目标内容。
import scrapy
from lxml import etree


class Spider1Spider(scrapy.Spider):
    # spider(爬虫)名称,需要记住,通过名字来启动爬虫:
    name = 'spider1'
    # 允许爬取的域名:可更改(限制爬虫,不要爬到其他网站去了)
    allowed_domains = ['http://quotes.toscrape.com/']
    # 初始request请求:
    start_urls = ['http://quotes.toscrape.com/']

    # 解析方法:
    def parse(self, response):
        # print(response.text)
        """
        解析返回响应,提取数据或进一步生成要处理的请求。
        :param response:
        :return:
        """
        # 方法一:通过css选择器来进行解析
        """ 
        quotes = response.css('.quote')   # 列表
        for quote in quotes:
            # text = quote.css('span.text::text')   # ::text表示获取里面的文本(注意它是一个对象)
            # 旧方法:
            # extract_first()  返回第一个数据。(字符串)
            # extract()  返回全部的数据。(字符串列表)
            text = quote.css('span.text::text').extract_first()   # ::text表示获取里面的文本  (extract_first()方法可以获取到对象里面的文本内容)
            # author = quote.css('small.author::text')  # 作者  (得到一个对象)
            author = quote.css('small.author::text').extract_first()  # 作者  (得到文本内容)
            # tags = quote.css('div.tags a.tag::text')   # 标签(CSS对象)
            tags = quote.css('div.tags a.tag::text').extract()   # 返回全部的数据
            # print(tags)
            # print(text, '      ——————', author, tags)

            # 新方法:
            # get()   返回一条数据
            # getall()  返回全部的数据
            text = quote.css('span.text::text').get()
            author = quote.css('small.author::text').get()
            tags = quote.css('div.tags a.tag::text').getall()
            print(text, tags, '      ——————', author)
        """

        # 方法二:通过xpath进行解析
        html = etree.HTML(response.text)
        quotes_divs = html.xpath('//div[@]')
        for quote_div in quotes_divs:
            text = quote_div.xpath('./span[1]/text()')[0]
            author = quote_div.xpath('./span[2]/small/text()')[0]
            tags = quote_div.xpath('./div[@]/a/text()')
            print(text, tags, '    ------', author)

运行start.py文件的结果:

D:\Anaconda\python.exe C:/Users/lv/Desktop/scrapy框架的学习/my_Scrapy/start.py
2022-04-03 19:24:47 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: my_Scrapy)
2022-04-03 19:24:47 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 18.7.0, Python 3.7.0 (default, Jun 28 2018, 08:04:48) [MSC v.1912 64 bit (AMD64)], pyOpenSSL 18.0.0 (OpenSSL 1.0.2p  14 Aug 2018), cryptography 2.3.1, Platform Windows-10-10.0.19041-SP0
2022-04-03 19:24:47 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-04-03 19:24:47 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'my_Scrapy',
 'NEWSPIDER_MODULE': 'my_Scrapy.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['my_Scrapy.spiders']}
2022-04-03 19:24:47 [scrapy.extensions.telnet] INFO: Telnet Password: e2250e171a87ebd6
2022-04-03 19:24:47 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2022-04-03 19:24:47 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-04-03 19:24:47 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-04-03 19:24:47 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-04-03 19:24:47 [scrapy.core.engine] INFO: Spider opened
2022-04-03 19:24:47 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-04-03 19:24:47 [py.warnings] WARNING: D:\Anaconda\lib\site-packages\scrapy\spidermiddlewares\offsite.py:65: URLWarning: allowed_domains accepts only domains, not URLs. Ignoring URL entry http://quotes.toscrape.com/ in allowed_domains.
  warnings.warn(message, URLWarning)

2022-04-03 19:24:47 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-04-03 19:24:48 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2022-04-03 19:24:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/> (referer: None)
2022-04-03 19:24:48 [scrapy.core.engine] INFO: Closing spider (finished)
2022-04-03 19:24:48 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 448,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 2582,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/404': 1,
 'elapsed_time_seconds': 1.264597,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 4, 3, 11, 24, 48, 658256),
 'httpcompression/response_bytes': 11053,
 'httpcompression/response_count': 1,
 'log_count/DEBUG': 2,
 'log_count/INFO': 10,
 'log_count/WARNING': 1,
 'response_received_count': 2,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/404': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2022, 4, 3, 11, 24, 47, 393659)}
2022-04-03 19:24:48 [scrapy.core.engine] INFO: Spider closed (finished)
“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.['change', 'deep-thoughts', 'thinking', 'world']     ------ Albert Einstein
“It is our choices, Harry, that show what we truly are, far more than our abilities.['abilities', 'choices']     ------ J.K. Rowling
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.['inspirational', 'life', 'live', 'miracle', 'miracles']     ------ Albert Einstein
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.['aliteracy', 'books', 'classic', 'humor']     ------ Jane Austen
“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.” ['be-yourself', 'inspirational']     ------ Marilyn Monroe
“Try not to become a man of success. Rather become a man of value.['adulthood', 'success', 'value']     ------ Albert Einstein
“It is better to be hated for what you are than to be loved for what you are not.['life', 'love']     ------ André Gide
“I have not failed. I've just found 10,000 ways that won't work.['edison', 'failure', 'inspirational', 'paraphrased']     ------ Thomas A. Edison
“A woman is like a tea bag; you never know how strong it is until it's in hot water.” ['misattributed-eleanor-roosevelt']     ------ Eleanor Roosevelt
“A day without sunshine is like, you know, night.['humor', 'obvious', 'simile']     ------ Steve Martin

Process finished with exit code 0

2. 翻页爬取数据
  • 修改的主要是spider1.py文件中的部分语句。
import scrapy
from lxml import etree
from my_Scrapy.items import MyScrapyItem   # 从items文件中导入MyScrapyItem类


class Spider1Spider(scrapy.Spider):
    # spider(爬虫)名称,需要记住,通过名字来启动爬虫:
    name = 'spider1'
    # # 允许爬取的域名:可更改(限制爬虫,不要爬到其他网站去了)
    # allowed_domains = ['http://quotes.toscrape.com/']   # 不设置限制,就可以一直爬取下一页了
    # 初始request请求:
    start_urls = ['http://quotes.toscrape.com/']

    # 解析方法:
    def parse(self, response):
        # print(response.text)
        """
        解析返回响应,提取数据或进一步生成要处理的请求。
        :param response:
        :return:
        """
        # 方法一:通过css选择器来进行解析
        """ 
        quotes = response.css('.quote')   # 列表
        for quote in quotes:
            # text = quote.css('span.text::text')   # ::text表示获取里面的文本(注意它是一个对象)
            # 旧方法:
            # extract_first()  返回第一个数据。(字符串)
            # extract()  返回全部的数据。(字符串列表)
            text = quote.css('span.text::text').extract_first()   # ::text表示获取里面的文本  (extract_first()方法可以获取到对象里面的文本内容)
            # author = quote.css('small.author::text')  # 作者  (得到一个对象)
            author = quote.css('small.author::text').extract_first()  # 作者  (得到文本内容)
            # tags = quote.css('div.tags a.tag::text')   # 标签(CSS对象)
            tags = quote.css('div.tags a.tag::text').extract()   # 返回全部的数据
            # print(tags)
            # print(text, '      ——————', author, tags)

            # 新方法:
            # get()   返回一条数据
            # getall()  返回全部的数据
            text = quote.css('span.text::text').get()
            author = quote.css('small.author::text').get()
            tags = quote.css('div.tags a.tag::text').getall()
            print(text, tags, '      ——————', author)
        """

        # 方法二:通过xpath进行解析
        html = etree.HTML(response.text)
        quotes_divs = html.xpath('//div[@]')
        for quote_div in quotes_divs:
            text = quote_div.xpath('./span[1]/text()')[0]
            author = quote_div.xpath('./span[2]/small/text()')[0]
            tags = quote_div.xpath('./div[@]/a/text()')
            # print(text, tags, '    ------', author)

            # 将数据放入容器中,便于保存数据
            item = MyScrapyItem()
            item['text'] = text
            item['author'] = author
            item['tags'] = tags
            # print(item)

            # 通过迭代的方式将字典中的每一条数据交给Pipeline
            yield item

        # 定义翻页 *** 作
        next = response.css('ul.pager li.next a::attr("href")').get()   # 获得“下一页”a标签按钮上面href属性值
        print(next)  # /page/2/
        url = self.start_urls[0]   # 获取当前爬取的url(方法一)
        # print(url)
        url = response.url    # 获取当前爬取的url(方法二)
        # print(url)
        # 拼接形成下一页的url
        url = response.urljoin(next)
        print(url)
        # 将请求交给调度器,重新构造下一个请求
        yield scrapy.Request(url, callback=self.parse)   # 对于新的请求,仍然执行parse()解析方法

运行结果部分截图:

7. 保存数据 1. 通过执行scrapy命令进行保存数据 1. 方式一:在终端执行命令
# scrapy crawl 爬虫文件名 -o 数据保存文件名
scrapy crawl spider1 -o demo.csv

2. 方式二:通过修改start.py启动文件的cmd命令行语句
# 使用Scrapy框架的该爬虫是一个项目,不能在爬虫文件中用右键运行,需要在终端里面输入scrapy crawl 爬虫文件名“的命令进行启动。
# 如果不想用终端输入命令的启动方式,可以创建该“start.py”文件
from scrapy import cmdline

# cmdline.execute('scrapy crawl spider1'.split())   # 调用终端命令
cmdline.execute('scrapy crawl spider1 -o demo.csv'.split())

# 红色的不是报错,而是scrapy框架自行打印的初始化信息。白色的内容就是print()语句输出的内容。

2. 通过自定义的方式保存(修改pipelines.py文件)
  1. 修改pipelines.py文件如下:
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


class MyScrapyPipeline:
    def process_item(self, item, spider):
        with open('demo.txt', 'a', encoding="utf-8") as f:
            f.write(item['text'] + '           ——' + item['author'] + "\n")
        return item

  1. 将settings.py文件中pipelines.py文件对应的注释取消掉(否则就无法成功将数据保存在txt文件中)
# Scrapy settings for my_Scrapy project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'my_Scrapy'

SPIDER_MODULES = ['my_Scrapy.spiders']
NEWSPIDER_MODULE = 'my_Scrapy.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'my_Scrapy (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'my_Scrapy.middlewares.MyScrapySpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'my_Scrapy.middlewares.MyScrapyDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'my_Scrapy.pipelines.MyScrapyPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

  1. 运行结果截图:
8. 运行项目 1. 在终端内运行
# scrapy crawl 爬虫文件名
scrapy crawl spider1

最前面是爬虫运行的提示信息:

中间的就是网页源代码:

最后面是关闭爬虫的提示信息:

2. 通过PyCharm运行

需要在项目文件夹下创建一个启动项目的文件start.py:

# 使用Scrapy框架的该爬虫是一个项目,不能在爬虫文件中用右键运行,需要在终端里面输入scrapy crawl 爬虫文件名“的命令进行启动。
# 如果不想用终端输入命令的启动方式,可以创建该“start.py”文件
from scrapy import cmdline

cmdline.execute('scrapy crawl spider1'.split())   # 调用终端命令

# 红色的不是报错,而是爬虫的初始化信息。白色的内容就是print()语句输出的内容

运行结果:

D:\Anaconda\python.exe C:/Users/lv/Desktop/scrapy框架的学习/my_Scrapy/start.py
2022-04-03 16:34:35 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: my_Scrapy)
2022-04-03 16:34:35 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 18.7.0, Python 3.7.0 (default, Jun 28 2018, 08:04:48) [MSC v.1912 64 bit (AMD64)], pyOpenSSL 18.0.0 (OpenSSL 1.0.2p  14 Aug 2018), cryptography 2.3.1, Platform Windows-10-10.0.19041-SP0
2022-04-03 16:34:35 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-04-03 16:34:35 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'my_Scrapy',
 'NEWSPIDER_MODULE': 'my_Scrapy.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['my_Scrapy.spiders']}
2022-04-03 16:34:35 [scrapy.extensions.telnet] INFO: Telnet Password: b9d4a8fccbb5b978
2022-04-03 16:34:35 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2022-04-03 16:34:35 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-04-03 16:34:35 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-04-03 16:34:35 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-04-03 16:34:35 [scrapy.core.engine] INFO: Spider opened
2022-04-03 16:34:35 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-04-03 16:34:35 [py.warnings] WARNING: D:\Anaconda\lib\site-packages\scrapy\spidermiddlewares\offsite.py:65: URLWarning: allowed_domains accepts only domains, not URLs. Ignoring URL entry http://quotes.toscrape.com/ in allowed_domains.
  warnings.warn(message, URLWarning)

2022-04-03 16:34:35 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-04-03 16:34:36 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2022-04-03 16:34:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/> (referer: None)
<!DOCTYPE html>
<html lang="en">
<head>
	<meta charset="UTF-8">
	<title>Quotes to Scrape</title>
    <link rel="stylesheet" href="/static/bootstrap.min.css">
    <link rel="stylesheet" href="/static/main.css">
</head>
<body>
    <div class="container">
        <div class="row header-box">
            <div class="col-md-8">
                <h1>
                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>
                </h1>
            </div>
            <div class="col-md-4">
                <p>
                
                    <a href="/login">Login</a>
                
                </p>
            </div>
        </div>
    

<div class="row">
    <div class="col-md-8">

    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.</span>
        <span>by <small class="author" itemprop="author">Albert Einstein</small>
        <a href="/author/Albert-Einstein">(about)</a>
        </span>
        <div class="tags">
            Tags:
            <meta class="keywords" itemprop="keywords" content="change,deep-thoughts,thinking,world" /    > 
            
            <a class="tag" href="/tag/change/page/1/">change</a>
            
            <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
            
            <a class="tag" href="/tag/thinking/page/1/">thinking</a>
            
            <a class="tag" href="/tag/world/page/1/">world</a>
            
        </div>
    </div>

    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.</span>
        <span>by <small class="author" itemprop="author">J.K. Rowling</small>
        <a href="/author/J-K-Rowling">(about)</a>
        </span>
        <div class="tags">
            Tags:
            <meta class="keywords" itemprop="keywords" content="abilities,choices" /    > 
            
            <a class="tag" href="/tag/abilities/page/1/">abilities</a>
            
            <a class="tag" href="/tag/choices/page/1/">choices</a>
            
        </div>
    </div>

    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.</span>
        <span>by <small class="author" itemprop="author">Albert Einstein</small>
        <a href="/author/Albert-Einstein">(about)</a>
        </span>
        <div class="tags">
            Tags:
            <meta class="keywords" itemprop="keywords" content="inspirational,life,live,miracle,miracles" /    > 
            
            <a class="tag" href="/tag/inspirational/page/1/">inspirational</a>
            
            <a class="tag" href="/tag/life/page/1/">life</a>
            
            <a class="tag" href="/tag/live/page/1/">live</a>
            
            <a class="tag" href="/tag/miracle/page/1/">miracle</a>
            
            <a class="tag" href="/tag/miracles/page/1/">miracles</a>
            
        </div>
    </div>

    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.</span>
        <span>by <small class="author" itemprop="author">Jane Austen</small>
        <a href="/author/Jane-Austen">(about)</a>
        </span>
        <div class="tags">
            Tags:
            <meta class="keywords" itemprop="keywords" content="aliteracy,books,classic,humor" /    > 
            
            <a class="tag" href="/tag/aliteracy/page/1/">aliteracy</a>
            
            <a class="tag" href="/tag/books/page/1/">books</a>
            
            <a class="tag" href="/tag/classic/page/1/">classic</a>
            
            <a class="tag" href="/tag/humor/page/1/">humor</a>
            
        </div>
    </div>

    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“Imperfection is beauty, madness is genius and it&#39;s better to be absolutely ridiculous than absolutely boring.”
        <span>by <small class="author" itemprop="author">Marilyn Monroe</small>
        <a href="/author/Marilyn-Monroe">(about)</a>
        </span>
        <div class="tags">
            Tags:
            <meta class="keywords" itemprop="keywords" content="be-yourself,inspirational" /    > 
            
            <a class="tag" href="/tag/be-yourself/page/1/">be-yourself</a>
            
            <a class="tag" href="/tag/inspirational/page/1/">inspirational</a>
            
        </div>
    </div>

    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“Try not to become a man of success. Rather become a man of value.</span>
        <span>by <small class="author" itemprop="author">Albert Einstein</small>
        <a href="/author/Albert-Einstein">(about)</a>
        </span>
        <div class="tags">
            Tags:
            <meta class="keywords" itemprop="keywords" content="adulthood,success,value" /    > 
            
            <a class="tag" href="/tag/adulthood/page/1/">adulthood</a>
            
            <a class="tag" href="/tag/success/page/1/">success</a>
            
            <a class="tag" href="/tag/value/page/1/">value</a>
            
        </div>
    </div>

    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“It is better to be hated for what you are than to be loved for what you are not.</span>
        <span>by <small class="author" itemprop="author">André Gide</small>
        <a href="/author/Andre-Gide">(about)</a>
        </span>
        <div class="tags">
            Tags:
            <meta class="keywords" itemprop="keywords" content="life,love" /    > 
            
            <a class="tag" href="/tag/life/page/1/">life</a>
            
            <a class="tag" href="/tag/love/page/1/">love</a>
            
        </div>
    </div>

    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“I have not failed. I&#39;ve just found 10,000 ways that won't work.”
        <span>by <small class="author" itemprop="author">Thomas A. Edison</small>
        <a href="/author/Thomas-A-Edison">(about)</a>
        </span>
        <div class="tags">
            Tags:
            <meta class="keywords" itemprop="keywords" content="edison,failure,inspirational,paraphrased" /    > 
            
            <a class="tag" href="/tag/edison/page/1/">edison</a>
            
            <a class="tag" href="/tag/failure/page/1/">failure</a>
            
            <a class="tag" href="/tag/inspirational/page/1/">inspirational</a>
            
            <a class="tag" href="/tag/paraphrased/page/1/">paraphrased</a>
            
        </div>
    </div>

    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“A woman is like a tea bag; you never know how strong it is until it&#39;s in hot water.”
        <span>by <small class="author" itemprop="author">Eleanor Roosevelt</small>
        <a href="/author/Eleanor-Roosevelt">(about)</a>
        </span>
        <div class="tags">
            Tags:
            <meta class="keywords" itemprop="keywords" content="misattributed-eleanor-roosevelt" /    > 
            
            <a class="tag" href="/tag/misattributed-eleanor-roosevelt/page/1/">misattributed-eleanor-roosevelt</a>
            
        </div>
    </div>

    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“A day without sunshine is like, you know, night.</span>
        <span>by <small class="author" itemprop="author">Steve Martin</small>
        <a href="/author/Steve-Martin">(about)</a>
        </span>
        <div class="tags">
            Tags:
            <meta class="keywords" itemprop="keywords" content="humor,obvious,simile" /    > 
            
            <a class="tag" href="/tag/humor/page/1/">humor</a>
            
            <a class="tag" href="/tag/obvious/page/1/">obvious</a>
            
            <a class="tag" href="/tag/simile/page/1/">simile</a>
            
        </div>
    </div>

    <nav>
        <ul class="pager">
            
            
            <li class="next">
                <a href="/page/2/">Next <span aria-hidden="true">&rarr;</span></a>
            </li>
            
        </ul>
    </nav>
    </div>
    <div class="col-md-4 tags-box">
        
            <h2>Top Ten tags</h2>
            
            <span class="tag-item">
            <a class="tag" style="font-size: 28px" href="/tag/love/">love</a>
            </span>
            
            <span class="tag-item">
            <a class="tag" style="font-size: 26px" href="/tag/inspirational/">inspirational</a>
            </span>
            
            <span class="tag-item">
            <a class="tag" style="font-size: 26px" href="/tag/life/">life</a>
            </span>
            
            <span class="tag-item">
            <a class="tag" style="font-size: 24px" href="/tag/humor/">humor</a>
            </span>
            
            <span class="tag-item">
            <a class="tag" style="font-size: 22px" href="/tag/books/">books</a>
            </span>
            
            <span class="tag-item">
            <a class="tag" style="font-size: 14px" href="/tag/reading/">reading</a>
            </span>
            
            <span class="tag-item">
            <a class="tag" style="font-size: 10px" href="/tag/friendship/">friendship</a>
            </span>
            
            <span class="tag-item">
            <a class="tag" style="font-size: 8px" href="/tag/friends/">friends</a>
            </span>
            
            <span class="tag-item">
            <a class="tag" style="font-size: 8px" href="/tag/truth/">truth</a>
            </span>
            
            <span class="tag-item">
            <a class="tag" style="font-size: 6px" href="/tag/simile/">simile</a>
            </span>
            
        
    </div>
</div>

    </div>
    <footer class="footer">
        <div class="container">
            <p class="text-muted">
                Quotes by: <a href="https://www.goodreads.com/quotes">GoodReads.com</a>
            </p>
            <p class="copyright">
                Made with <span class='sh-red'></span> by <a href="https://scrapinghub.com">Scrapinghub</a>
            </p>
        </div>
    </footer>
</body>
</html>
2022-04-03 16:34:36 [scrapy.core.engine] INFO: Closing spider (finished)
2022-04-03 16:34:36 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 448,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 2578,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/404': 1,
 'elapsed_time_seconds': 1.29309,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 4, 3, 8, 34, 36, 608493),
 'httpcompression/response_bytes': 11053,
 'httpcompression/response_count': 1,
 'log_count/DEBUG': 2,
 'log_count/INFO': 10,
 'log_count/WARNING': 1,
 'response_received_count': 2,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/404': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2022, 4, 3, 8, 34, 35, 315403)}
2022-04-03 16:34:36 [scrapy.core.engine] INFO: Spider closed (finished)

Process finished with exit code 0

3. srcapy shell 的使用 1. 在终端内使用scrapy shell命令进行单次请求内容提取的测试

爬取网址:https://docs.scrapy.org/en/latest/_static/selectors-sample1.html
在终端内输入命令:

scrapy shell https://docs.scrapy.org/en/latest/_st
atic/selectors-sample1.html
Microsoft Windows [版本 10.0.19042.1586]
(c) Microsoft Corporation。保留所有权利。
(base) C:\Users\吕成鑫\Desktop\scrapy框架的学习\my_Scr
apy>scrapy shell https://docs.scrapy.org/en/latest/_st
atic/selectors-sample1.html
2022-04-04 20:04:11 [scrapy.utils.log] INFO: Scrapy 2.
5.1 started (bot: my_Scrapy)
2022-04-04 20:04:11 [scrapy.utils.log] INFO: Versions:
 lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.1.0, parsel
1.6.0, w3lib 1.22.0, Twisted 18.7.0, Python 3.7.0 (def
ault, Jun 28 2018, 08:04:48) [MSC v.1912 64 bit (AMD64
)], pyOpenSSL 18.0.0 (OpenSSL 1.0.2p  14 Aug 2018), cr
yptography 2.3.1, Platform Windows-10-10.0.19041-SP0
2022-04-04 20:04:11 [scrapy.utils.log] DEBUG: Using re
actor: twisted.internet.selectreactor.SelectReactor
2022-04-04 20:04:11 [scrapy.crawler] INFO: Overridden
settings:
{'BOT_NAME': 'my_Scrapy',
 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilte
r',
 'LOGSTATS_INTERVAL': 0,
 'NEWSPIDER_MODULE': 'my_Scrapy.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['my_Scrapy.spiders']}
2022-04-04 20:04:11 [scrapy.extensions.telnet] INFO: T
elnet Password: 358ca5f9dee7f2d7
2022-04-04 20:04:11 [scrapy.middleware] INFO: Enabled
extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole']
2022-04-04 20:04:11 [scrapy.middleware] INFO: Enabled
downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMidd
leware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddle
ware',
 'scrapy.downloadermiddlewares.downloadtimeout.Downloa
dTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultH
eadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMidd
leware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',

 'scrapy.downloadermiddlewares.redirect.MetaRefreshMid
dleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCom
pressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddle
ware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddlewa
re',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMidd
leware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']

2022-04-04 20:04:11 [scrapy.middleware] INFO: Enabled
spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddlewa
re',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',

 'scrapy.spidermiddlewares.referer.RefererMiddleware',

 'scrapy.spidermiddlewares.urllength.UrlLengthMiddlewa
re',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-04-04 20:04:11 [scrapy.middleware] INFO: Enabled
item pipelines:
['my_Scrapy.pipelines.MyScrapyPipeline']
2022-04-04 20:04:11 [scrapy.extensions.telnet] INFO: T
elnet console listening on 127.0.0.1:6023
2022-04-04 20:04:11 [scrapy.core.engine] INFO: Spider
opened
2022-04-04 20:04:12 [scrapy.core.engine] DEBUG: Crawle
d (200) <GET https://docs.scrapy.org/robots.txt> (refe
rer: None)
2022-04-04 20:04:14 [scrapy.core.engine] DEBUG: Crawle
d (200) <GET https://docs.scrapy.org/en/latest/_static
/selectors-sample1.html> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Reques
t, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x0
0000229978B6E80>
[s]   item       {}
[s]   request    <GET https://docs.scrapy.org/en/lates
t/_static/selectors-sample1.html>
[s]   response   <200 https://docs.scrapy.org/en/lates
t/_static/selectors-sample1.html>
[s]   settings   <scrapy.settings.Settings object at 0
x00000229978B6A20>
[s]   spider     <DefaultSpider 'default' at 0x22997d9
8898>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update
 local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Reque
st and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
In [1]: response
Out[1]: <200 https://docs.scrapy.org/en/latest/_static
/selectors-sample1.html>

In [2]: response.text
Out[2]: "<html>\n <head>\n  <base href='http://example
.com/' />\n  <title>Example website</title>\n </head>\
n <body>\n  <div id='images'>\n   <a href='image1.html
'>Name: My image 1 
image1_thumb.jpg' / ></a>\n <a href='image2.html'>Name: My image 2 <br / ><img src='image2_thumb.jpg' /></a>\n <a href='image 3.html'>Name: My image 3
image3_thumb. jpg' />\n image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>\n <a href= 'image5.html'>Name: My image 5 <br /><img src='image5_ thumb.jpg' /></a>\n </div>\n </body>\n</html>\n\n" In [3]: response.xpath('//a') Out[3]: [<Selector xpath='//a' data='<a href="image1.html">Nam e: My image ...'>, <Selector xpath='//a' data='<a href="image2.html">Nam e: My image ...'>, <Selector xpath='//a' data='<a href="image3.html">Nam e: My image ...'>, <Selector xpath='//a' data='<a href="image4.html">Nam e: My image ...'>, <Selector xpath='//a' data='<a href="image5.html">Nam e: My image ...'>] In [4]: response.xpath('//a').xpath('./img') Out[4]: [<Selector xpath='./img' data='<img src="image1_thumb. jpg">'>, <Selector xpath='./img' data='<img src="image2_thumb. jpg">'>, <Selector xpath='./img' data='<img src="image3_thumb. jpg">'>, <Selector xpath='./img' data='<img src="image4_thumb. jpg">'>, <Selector xpath='./img' data='<img src="image5_thumb. jpg">'>] In [5]: response.xpath('//a').xpath('./img')[0] Out[5]: <Selector xpath='./img' data='<img src="image1 _thumb.jpg">'> In [6]: response.xpath('//a').xpath('./img').getall() ...: Out[6]: ['', '', '', '', ''] In [7]: response.xpath('//a').xpath('./img').get() Out[7]: '' In [8]: result = response.xpath('//a') In [9]: result Out[9]: [<Selector xpath='//a' data='<a href="image1.html">Nam e: My image ...'>, <Selector xpath='//a' data='<a href="image2.html">Nam e: My image ...'>, <Selector xpath='//a' data='<a href="image3.html">Nam e: My image ...'>, <Selector xpath='//a' data='<a href="image4.html">Nam e: My image ...'>, <Selector xpath='//a' data='<a href="image5.html">Nam e: My image ...'>] In [10]: result.xpath('./img').getall() Out[10]: ['', '', '', '', ''] In [11]: response.xpath("//img") Out[11]: [<Selector xpath='//img' data='<img src="image1_thumb. jpg">'>, <Selector xpath='//img' data='<img src="image2_thumb. jpg">'>, <Selector xpath='//img' data='<img src="image3_thumb. jpg">'>, <Selector xpath='//img' data='<img src="image4_thumb. jpg">'>, <Selector xpath='//img' data='<img src="image5_thumb. jpg">'>] In [12]: response.css('a') Out[12]: [<Selector xpath='descendant-or-self::a' data='<a href ="image1.html">Name: My image ...'>, <Selector xpath='descendant-or-self::a' data='<a href ="image2.html">Name: My image ...'>, <Selector xpath='descendant-or-self::a' data='<a href ="image3.html">Name: My image ...'>, <Selector xpath='descendant-or-self::a' data='<a href ="image4.html">Name: My image ...'>, <Selector xpath='descendant-or-self::a' data='<a href ="image5.html">Name: My image ...'>] In [13]: response.css('div#images') Out[13]: [<Selector xpath="descendant-or-self::div[@id = 'images']" data='images">\n ima ge1....'>] In [14]: response.css('div#images').get() Out[14]: '<div id="images">\n <a href="image 1.html">Name: My image 1
image1_ thumb.jpg">\n image2.html">Name : My image 2 <br><img src="image2_thumb.jpg">< /a>\n <a href="image3.html">Name: My image 3 <br><img src="image3_thumb.jpg"></a>\n <a h html">Name: My image 5
image5_thumb.jpg"></a>\n </div>' In [15]: response.xpath('//a/text()').re('Name:\s(.*)') Out[15]: ['My image 1 ', 'My image 2 ', 'My image 3 ', 'My image 4 ', 'My image 5 '] In [16]: response.re('.*') # 不能这样直接使用re,需要在解析的内容后面使用re正则表达式 --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-16-a22dedc07090> in <module>() ----> 1 response.re('.*') AttributeError: 'HtmlResponse' object has no attribute 're' In [17]:
4. 实现翻页功能

如何翻页?

  • 回忆:

    • requests模块时如何发送翻页的请求的?
      • 1.找到下一页的地址
      • 2.之后调用requests.get(url)
  • 思路:

    • 1.找到下一页的地址
    • 2.构造一个关于下一页url地址的request请求传递给调度器
1. 通过在最后进行拼接成url和回调实现翻页
import scrapy
from lxml import etree
from my_Scrapy.items import MyScrapyItem   # 从items文件中导入MyScrapyItem类


class Spider2Spider(scrapy.Spider):
    # spider(爬虫)名称,需要记住,通过名字来启动爬虫:
    name = 'spider2'
    # # 允许爬取的域名:可更改(限制爬虫,不要爬到其他网站去了)
    # allowed_domains = ['quotes.toscrape.com/']   # 不设置限制,就可以一直爬取下一页了
    # 初始request请求:
    base_url = 'http://quotes.toscrape.com/page/{}/'
    page = 1
    start_urls = [base_url.format(page)]

    # 解析方法:
    def parse(self, response):
        # print(response.text)
        """
        解析返回响应,提取数据或进一步生成要处理的请求。
        :param response:
        :return:
        """
        # 方法一:通过css选择器来进行解析
        """ 
        quotes = response.css('.quote')   # 列表
        for quote in quotes:
            # text = quote.css('span.text::text')   # ::text表示获取里面的文本(注意它是一个对象)
            # 旧方法:
            # extract_first()  返回第一个数据。(字符串)
            # extract()  返回全部的数据。(字符串列表)
            text = quote.css('span.text::text').extract_first()   # ::text表示获取里面的文本  (extract_first()方法可以获取到对象里面的文本内容)
            # author = quote.css('small.author::text')  # 作者  (得到一个对象)
            author = quote.css('small.author::text').extract_first()  # 作者  (得到文本内容)
            # tags = quote.css('div.tags a.tag::text')   # 标签(CSS对象)
            tags = quote.css('div.tags a.tag::text').extract()   # 返回全部的数据
            # print(tags)
            # print(text, '      ——————', author, tags)

            # 新方法:
            # get()   返回一条数据
            # getall()  返回全部的数据
            text = quote.css('span.text::text').get()
            author = quote.css('small.author::text').get()
            tags = quote.css('div.tags a.tag::text').getall()
            print(text, tags, '      ——————', author)
        """

        # 方法二:通过xpath进行解析
        html = etree.HTML(response.text)
        quotes_divs = html.xpath('//div[@]')
        for quote_div in quotes_divs:
            text = quote_div.xpath('./span[1]/text()')[0]
            author = quote_div.xpath('./span[2]/small/text()')[0]
            tags = quote_div.xpath('./div[@]/a/text()')
            # print(text, tags, '    ------', author)

            # 将数据放入容器中,便于保存数据
            item = MyScrapyItem()
            item['text'] = text
            item['author'] = author
            item['tags'] = tags
            # print(item)
            # 通过迭代的方式将字典中的每一条数据交给Pipeline
            yield item

        self.page += 1
        # 注意:需要控制翻页的结束
        if self.page < 11:
            # 构造下一个请求:(方法一)
            # yield scrapy.Request(self.base_url.format(self.page), callback=self.parse)

            # 构造下一个请求:(方法二)
            yield from response.follow_all(response.css('.pager .next a::attr("href")'), callback=self.parse)
        """
        # 原本定义的翻页 *** 作:
        next = response.css('ul.pager li.next a::attr("href")').get()   # 获得“下一页”a标签按钮上面href属性值
        print(next)  # /page/2/
        url = self.start_urls[0]   # 获取当前爬取的url(方法一)
        # print(url)
        url = response.url    # 获取当前爬取的url(方法二)
        # print(url)
        # 拼接形成下一页的url
        url = response.urljoin(next)
        print(url)
        # 将请求交给调度器,重新构造下一个请求
        yield scrapy.Request(url, callback=self.parse)   # 对于新的请求,仍然执行parse()解析方法
        """
2. 通过重写strat_requests()方法实现翻页
import scrapy
from lxml import etree
from my_Scrapy.items import MyScrapyItem   # 从items文件中导入MyScrapyItem类


class Spider3Spider(scrapy.Spider):
    # spider(爬虫)名称,需要记住,通过名字来启动爬虫:
    name = 'spider3'
    # # 允许爬取的域名:可更改(限制爬虫,不要爬到其他网站去了)
    # allowed_domains = ['quotes.toscrape.com/']   # 不设置限制,就可以一直爬取下一页了
    # 初始request请求:
    base_url = 'http://quotes.toscrape.com/page/{}/'
    page = 1
    start_urls = [base_url.format(page)]

    # 通过封装方法的形式构造翻页功能:
    def start_requests(self):   # 在爬虫开始请求的时候会执行的 *** 作
        for page in range(1, 11):
            url = self.base_url.format(page)
            yield scrapy.Request(url, callback=self.parse)

    # 解析方法:
    def parse(self, response):
        # print(response.text)
        """
        解析返回响应,提取数据或进一步生成要处理的请求。
        :param response:
        :return:
        """
        # 方法一:通过css选择器来进行解析
        """ 
        quotes = response.css('.quote')   # 列表
        for quote in quotes:
            # text = quote.css('span.text::text')   # ::text表示获取里面的文本(注意它是一个对象)
            # 旧方法:
            # extract_first()  返回第一个数据。(字符串)
            # extract()  返回全部的数据。(字符串列表)
            text = quote.css('span.text::text').extract_first()   # ::text表示获取里面的文本  (extract_first()方法可以获取到对象里面的文本内容)
            # author = quote.css('small.author::text')  # 作者  (得到一个对象)
            author = quote.css('small.author::text').extract_first()  # 作者  (得到文本内容)
            # tags = quote.css('div.tags a.tag::text')   # 标签(CSS对象)
            tags = quote.css('div.tags a.tag::text').extract()   # 返回全部的数据
            # print(tags)
            # print(text, '      ——————', author, tags)

            # 新方法:
            # get()   返回一条数据
            # getall()  返回全部的数据
            text = quote.css('span.text::text').get()
            author = quote.css('small.author::text').get()
            tags = quote.css('div.tags a.tag::text').getall()
            print(text, tags, '      ——————', author)
        """

        # 方法二:通过xpath进行解析
        html = etree.HTML(response.text)
        quotes_divs = html.xpath('//div[@]')
        for quote_div in quotes_divs:
            text = quote_div.xpath('./span[1]/text()')[0]
            author = quote_div.xpath('./span[2]/small/text()')[0]
            tags = quote_div.xpath('./div[@]/a/text()')
            # print(text, tags, '    ------', author)

            # 将数据放入容器中,便于保存数据
            item = MyScrapyItem()
            item['text'] = text
            item['author'] = author
            item['tags'] = tags
            # print(item)
            # 通过迭代的方式将字典中的每一条数据交给Pipeline
            yield item

        """
        self.page += 1
        # 注意:需要控制翻页的结束
        if self.page < 11:
            # 构造下一个请求:(方法一)
            # yield scrapy.Request(self.base_url.format(self.page), callback=self.parse)

            # 构造下一个请求:(方法二)
            # 该方法是2.0版本之后出现的    拼接请求,进行回调
            yield from response.follow_all(response.css('.pager .next a::attr("href")'), callback=self.parse)
        """
        """
        # 原本定义的翻页 *** 作:
        next = response.css('ul.pager li.next a::attr("href")').get()   # 获得“下一页”a标签按钮上面href属性值
        print(next)  # /page/2/
        url = self.start_urls[0]   # 获取当前爬取的url(方法一)
        # print(url)
        url = response.url    # 获取当前爬取的url(方法二)
        # print(url)
        # 拼接形成下一页的url
        url = response.urljoin(next)
        print(url)
        # 将请求交给调度器,重新构造下一个请求
        yield scrapy.Request(url, callback=self.parse)   # 对于新的请求,仍然执行parse()解析方法
        """
3. 修改start.py文件保存数据
# 使用Scrapy框架的该爬虫是一个项目,不能在爬虫文件中用右键运行,需要在终端里面输入scrapy crawl 爬虫文件名“的命令进行启动。
# 如果不想用终端输入命令的启动方式,可以创建该“start.py”文件
from scrapy import cmdline

# cmdline.execute('scrapy crawl spider1'.split())   # 调用终端命令
# cmdline.execute('scrapy crawl spider1 -o demo.csv'.split())
# cmdline.execute('scrapy crawl spider2'.split())   # 调用终端命令
cmdline.execute('scrapy crawl spider3'.split())   # 调用终端命令

# 红色的不是报错,而是爬虫的初始化信息。白色的内容就是输出内容

5. Scrapy框架-案例2 1. 分析网站
  1. 目标网站:腾讯招聘网站
  2. 目标:
    1. 爬取招聘岗位信息
    2. 翻页
      虚假的url:https://talent.antgroup.com/off-campus
  3. 数据加载方式:动态和静态
    抓包获取的含有数据的data-url:
    第1页:
    https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1649127078399&countryId=&cityId=&bgIds=&productId=&categoryId=40001001,40001002,40001003,40001004,40001005,40001006&parentCategoryId=&attrId=&keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=cn
    第2页:
    https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1649127078399&countryId=&cityId=&bgIds=&productId=&categoryId=40001001,40001002,40001003,40001004,40001005,40001006&parentCategoryId=&attrId=&keyword=&pageIndex=2&pageSize=10&language=zh-cn&area=cn
    详情页:
    url:https://careers.tencent.com/jobdesc.html?postId=1310124481703845888
    data-url:https://careers.tencent.com/tencentcareer/api/post/ByPostId?timestamp=1649156817199&postId=1310124481703845888&language=zh-cn
  4. 爬取思路:
    1. 第一页url
    2. 解析第一页上每个岗位对应postid
    3. 构造url
2. 实现步骤
  1. 创建项目
scrapy startproject tencent
  1. 创建爬虫程序
cd tencent
scrapy gensipider spider1 tencent.com

运行结果:

C:\Users\lv\Desktop\scrapy框架的学习>scrapy startproject tencent
New Scrapy project 'tencent', using template directory 'd:\anaconda\lib\site-packages\scrapy\templates\project', created in: C:\Users\lv\Desktop\scrapy框架的学习\tencent

You can start your first spider with:
    cd tencent
    scrapy genspider example example.com

C:\Users\lv\Desktop\scrapy框架的学习>cd tencent

C:\Users\lv\Desktop\scrapy框架的学习\tencent>scrapy genspider spider1 tencent.com
Created spider 'spider1' using template 'basic' in module:
  tencent.spiders.spider1

C:\Users\lv\Desktop\scrapy框架的学习\tencent>
  1. 用PyCharm打开tencent项目:
  2. 在命令行使用如下命令生成一个spider1.py文件:
scrapy genspider spider1 tencent.com
  1. 编辑spider1.py文件如下:


  1. 打开settings.py文件下的pipelindes的注释:
补充一:Spider类的使用 1. Spider的运行流程
  1. 定义爬取网站的逻辑
  2. 分析爬取下来的页面
2. Spider类的分析
  • name:设置爬虫名称。
  • allowed_domains:允许访问的域名,防止爬虫爬到其他网址去。
  • start_urls:请求的url列表。
  • custom_settings:一个字典,专属于本spider的配置,这个配置会覆盖项目的全局配置,这个配置必须
  • crawler:该属性由from_crawler()方法设置,代表spider对应的爬虫对象。可以通过该属性来获取项目的配置信息。
  • closed:当前spider关闭时,方法会被调用,释放一些资源。
补充二:Request对象 1. 介绍
  • Request对象是在构造新的请求时需要用到的scrapy的一个对象。
    例如:
yield scrapy.Request(url=detail_url, callback=self.parse_detail)
2. 参数说明
  • url:新请求的url地址。该url会被放入队列中。
  • callback:回调的解析数据的函数。
  • priority:请求的优先级。(自定义队列中哪个url需要先被请求。)默认是0,调度器进行request调度时使用它。数值越大,越优先被调度执行。
  • method:请求方式,默认是“GET”。
  • dont_filter:是否需要重复请求,默认为False。
  • errback:设置请求发生错误后的处理方法,默认为None。(很少用到)
    例如:
    def parse(self, response):
    	...
        yield scrapy.Request(url=detail_url, callback=self.parse_detail, errback=self.func)

    def func(self):
        print("请求出现错误后执行的方法")
  • body:request内容。
  • headers:请求头。
  • cookies
  • meta:通过response携带参数进行传递。相当于额外附加的信息。
    例如:
    def parse(self, response):
        # 解析数据(由于获取到的不是网页源代码,而是数据包,是字典或json类型的)
        data = json.loads(response.text)
        for job in data['Data']['Posts']:
            item = TencentItem()

            post_id = job['PostId']
            # print(post_id)
            item['job_name'] = job['RecruitPostName']

            # 构建详情页url
            detail_url = self.two_url.format(post_id)
            print(detail_url)

            # 构造请求:
            yield scrapy.Request(url=detail_url, callback=self.parse_detail, meta={"item": item})

    # 解析详情页面的数据
    def parse_detail(self, response):
        item = response.meta.get('item')
        print(item)
  • encoding:编码格式,默认“utf-8”。
  • cb_kwargs:设置回调方法需要额外携带的参数,可以通过字典传递。
    例如:
	def parse(self, response):
			...
            # 构造请求:
            yield scrapy.Request(url=detail_url, callback=self.parse_detail, cb_kwargs={"num": 1})

    # 解析详情页面的数据
    def parse_detail(self, response, num):
        print(num)
补充三:CSS选择器

"""
解析工具:
    1. 正则表达式                效率高      语法难记
    2. xpath                   效率中等    语法中等
    3. BS4(bs语法和css选择器)     效率低     语法最简单
"""
from bs4 import BeautifulSoup
# 推荐一个第三方库: parsel
import parsel   # 内置了正则、xpath、css三种选择器


html = """
The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a wel.

...

"""
# 一、通过BeautifulSoup模块使用css选择器: # 解析 # lxml是第三方的解析器,比起默认的html.parser解析器速度快很多 soup = BeautifulSoup(html, features="lxml") # BeautifulSoup会自动补全不完整的html(例如加上、等) # print(soup) # 1. 通过标签名称进行查找 a_tags = soup.select('a') print(a_tags) # 2. 通过类名称进行查找 sister_class = soup.select('.sister') print(sister_class) # 3. 通过id名进行查找 link1_id = soup.select("#link1") print(link1_id) # 4. 组合查找 a_link2 = soup.select("p #link2") print(a_link2) a_link2 = soup.select("p > #link2") # > 代表直接的下一级 print(a_link2) p_sister_class = soup.select("p > .sister") print(p_sister_class) # 同一个标签的id和class不能同时用 # p_sister_class_id = soup.select("p > .sister#link1") # print(p_sister_class_id) # 5. 通过属性查找 a_href = soup.select('a[href="http://example.com/elsie"]') print(a_href) # 6. 获取标签内的文本内容 text1 = soup.select('title')[0].get_text() print(text1) # 7. 获取标签属性的值(如获取href属性的值) href = soup.select('a#link1')[0]['href'] print(href) print("---"*20) # 二、通过parsel模块使用CSS选择器: selector = parsel.Selector(html) # 创建选择器对象 # selector.re() # selector.xpath() # selector.css() # 1. 通过标签名查找 object_list = selector.css("a") print(object_list.getall()) # getall()方法获取全部满足的结果 # for item in object_list: # print(item.get()) # 2. 通过类名称进行查找 print(selector.css('.sister').get()) # get()方法获取第一个满足条件的结果 print(selector.css('.sister').getall()) # 3. 通过id名进行查找 print(selector.css('#link1').getall()) # 4. 组合查找 print(selector.css('p.story a#link2').getall()) # 5. 通过属性查找 print(selector.css('.story').get()) # 6. 获取标签内的文本内容 print(selector.css('p > #link1::text').get()) # 7. 获取标签属性的值(如获取href属性的值) print(selector.css('p > #link1::attr(href)').get()) # 8. 伪类选择器 print(selector.css('a').getall()[1]) print(selector.css('a:nth-child(1)').getall()) # 选择第几个

欢迎分享,转载请注明来源:内存溢出

原文地址:https://54852.com/langs/917617.html

(0)
打赏 微信扫一扫微信扫一扫 支付宝扫一扫支付宝扫一扫
上一篇 2022-05-16
下一篇2022-05-16

发表评论

登录后才能评论

评论列表(0条)

    保存