python多线程爬虫爬取顶点小说内容（BeautifulSoup+urllib）_框架

之前写过python爬取起点中文网小说，多线程则是先把爬取的章节链接存到一个列表里，然后写一个函数get_text每次调用这个函数就传一个章节链接，那么就需要调用n次该函数来获取n章的内容，所以可以用for循环创建n个线程，线程的target就是get_text，参数就是章节的url。

随便点开的，辣眼睛哈哈哈

个人感觉用了多线程之后速度并没有很大的提升，速度大致是20个txt文件/分钟，是否有单个机器上继续提升爬取速度的方法？

下一步打算搞点能被封ip的爬取行为，然后学学分布式爬虫。加油~

可以使用urllib

import urllib

response=urlliburlopen("网站地址")

page=responseread()

pos=pagefind("<a href=\"")

while ~pos:

page=page[pos+9:]

lim=pagefind('\"')

print "You've found a link:%s"%page[:lim]

pos=pagefind("<a href=\"")

python

打开APP

pergoods

关注

Python多线程爬取网站image的src属性实例原创

2017-05-16 11:18:51

pergoods

码龄6年

关注

# coding=utf-8

'''

Created on 2017年5月16日

@author: chenkai

Python多线程爬取某单无聊图地址(requests+BeautifulSoup+threading+Queue模块)

'''

import requests

from bs4 import BeautifulSoup

import threading

import Queue

import time

class Spider_Test(threadingThread):

def __init__(self,queue):

threadingThread__init__(self)

self__queue = queue

def run(self):

while not self__queueempty():

page_url=self__queueget() [color=red]#从队列中取出url[/color]

print page_url

selfspider(page_url)

def spider(self,url):

r=requestsget(url) [color=red]#请求url[/color]

soup=BeautifulSoup(rcontent,'lxml') [color=red]#rcontent就是响应内容，转换为lxml的bs对象[/color]

imgs = soupfind_all(name='img',attrs={}) #查找所有的img标签，并获取标签属性值（为列表类型）

for img in imgs:

if 'onload' in str(img): [color=red]#img属性集合中包含onload属性的为动态图gif,[/color]

print '>

以上就是关于python多线程爬虫爬取顶点小说内容（BeautifulSoup+urllib）全部的内容，包括:python多线程爬虫爬取顶点小说内容（BeautifulSoup+urllib）、怎么使用python获取网站的子链接、python如何才能获取src地址等相关内容解答，如果想了解更多相关内容，可以关注我们，你们的支持是我们更新的动力！

欢迎分享，转载请注明来源：内存溢出

原文地址:https://54852.com/web/9325406.html

python多线程爬虫爬取顶点小说内容（BeautifulSoup+urllib）

发表评论

评论列表（0条）