Python爬虫爬取图片这个报错怎么处理？

雷曼大冒险•2023-4-4•随笔•阅读21

你好！你的错误原因在于html页面获取到的img标签src属性中的链接，可能是因为src中的url格式是这样的：

<img src="//hao123.com/xxx/xxx/xxx/"></img>

这样获取到的链接都没有带上协议：http或者https。而导致程序抛出ValueError的错误异常。

因为正常的url格式应该类似这样的：https://www.baidu.com/

即协议://用户名:密码@子域名.域名.顶级域名:端口号/目录/文件名.文件后缀?参数=值#标志

参考网页链接

可将代码中第一个for循环中download_links.append修改为：

for pic_tag in soup.find_all('img'):

pic_link = pic_tag.get('src')

download_links.append('http:' + pic_link)

import os,re

def check_flag(flag):

regex = re.compile(r'images\/')

result = True if regex.match(flag) else False

return result

#soup = BeautifulSoup(open('index.html'))

from bs4 import BeautifulSoup

html_content = '''

<a href="https://xxx.com">测试01</a>

<a href="https://yyy.com/123">测试02</a>

<a href="https://xxx.com">测试01</a>

<a href="https://xxx.com">测试01</a>

'''

file = open(r'favour-en.html','r',encoding="UTF-8")

soup = BeautifulSoup(file, 'html.parser')

for element in soup.find_all('img'):

if 'src' in element.attrs:

print(element.attrs['src'])

if check_flag(element.attrs['src']):

#if element.attrs['src'].find("png"):

element.attrs['src'] = "michenxxxxxxxxxxxx" +'/'+ element.attrs['src']

print("##################################")

with open('index.html', 'w',encoding="UTF-8") as fp:

fp.write(soup.prettify()) # prettify()的作⽤是将sp美化⼀下，有可读性

欢迎分享，转载请注明来源：内存溢出

原文地址:https://54852.com/zaji/7330562.html

测试链接你的错误协议

打赏

微信扫一扫

支付宝扫一扫

雷曼大冒险一级用户组

怎样把网址在浏览器上命名呢

上一篇 2023-04-04

php站点建立

下一篇2023-04-04

发表评论

登录后才能评论

评论列表（0条）