「python爬虫保姆级教学」urllib的使用以及页面解析_IT百科

使用urllib来获取百度首页的源码

get请求参数，如果是中文，需要对中文进行编码，如下面这样，如果不编码会报错。

urlencode应用场景：多个参数的时候。如下

为什么要学习handler？

为什么需要代理？因为有的网站是禁止爬虫的，如果用真实的ip去爬虫，容易被封掉。

2.解析技术

1.安装lxml库

2.导入lxml.etree

3.etree.parse() 解析本地文件

4.etree.HTML() 服务器响应文件

5.解析获取DOM元素

1.路径查询

2.谓词查询

3.属性查询

4.模糊查询

5.内容查询

6.逻辑运算

示例：

JsonPath只能解析本地文件。

pip安装：

jsonpath的使用：

示例：

解析上面的json数据

缺点：效率没有lxml的效率高

优点：接口设计人性化，使用方便

pip install bs4 -i https://pypi.douban.com/simple

from bs4 import BeautifulSoup

1.根据标签名查找节点

soup.a.attrs

2.函数

find(‘a’)：只找到第一个a标签

find(‘a’, title=‘名字’)

find(‘a’, class_=‘名字’)

find_all(‘a’) ：查找到所有的a

find_all([‘a’, ‘span’]) 返回所有的a和span

find_all(‘a’, limit=2) 只找前两个a

obj.string

obj.get_text()【推荐】

tag.name：获取标签名

tag.attrs：将属性值作为一个字典返回

obj.attrs.get(‘title’)【常用】

obj.get(‘title’)

obj[‘title’]

示例：

使用BeautifulSoup解析上面的html

from urllib import request

import ssl

url = ' http://www.baidu.com/'

"""

url, 请求的目标url地址

data=None,默认情况为None,表示发起的是一个get请求,不为None,则发起的是一个post请求

timeout=,设置请求的超时时间　

cafile=None, 设置证书

capath=None, 设置证书路径

cadefault=False, 是否要使用默认证书（默认为False）

context=None:是一个ssl值,表示忽略ssl认证

"""

content = ssl._create_unverified_context()

response = request.urlopen(url,timeout=10,content=content)

code = response.status

print(code)

b_html = response.read()

print(type(b_html),len(b_html))

res_headers = response.getheaders()

print(res_headers)

cookie_data = response.getheader('Set-Cookie')

print(cookie_data)

reason = response.reason

print(reason)

str_html = b_html.decode('utf-8')

print(type(str_html))

with open('b_baidu.page.html','w') as file:

# file.write(b_html)

file.write(str_html)

"""

url:发起请求的url地址

data=None, 默认情况为None,表示发起的是一个get请求,不为None,则发起的是一个post请求

headers={},设置请求头（headers对应的数据类型是一个字典）

origin_req_host=None, (指定发起请求的域)

unverifiable=False,忽略SSL认证

method=None：指定发起请求的方式

"""

req_header = {

'User-Agent':'Mozilla/5.0 (X11Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'

}

req = request.Request(url,headers=req_header)

response = request.urlopen(req)

response.status

response.read()

response.getheaders()

response.getheader('Server')

response.reason

python2中:对于字符串和bytes类型的数据没有明显的区分

python3中:对于字符串和bytes类型的数据有明显的区分

将bytes类型的数据转换为字符串使用decode('编码类型')

将字符串转换为bytes类型的数据使用encode('编码类型')

bytearray和bytes类型的数据是有区别的：前者是可变的,后者是不可变的

欢迎分享，转载请注明来源：内存溢出

原文地址:https://54852.com/bake/11736913.html

「python爬虫保姆级教学」urllib的使用以及页面解析

发表评论

评论列表（0条）