
- 网址:http://www.ci123.com/baike/nbnc/31
输出结果:一个表(excel 或数据库)三个字段分别是 类型、标题、html富文本。 - 爬虫代码如下:
import requests
from bs4 import BeautifulSoup
import xlwt
url = 'http://www.ci123.com/baike/nbnc/'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36'
}
resp = requests.get(url, headers=headers)
# print(resp.text)
main_page = BeautifulSoup(resp.text, "html.parser")
aList = main_page.find("dl", class_="catagory").find_all('a')
# print(aList)
# cateName = []
cateSrc = {}
cateTitleSrc = {}
# cateTitle = []
index = 1
book = xlwt.Workbook(encoding='utf-8')
worksheet = book.add_sheet('sheet')
worksheet.write(0, 0, "序号")
worksheet.write(0, 1, "分类")
worksheet.write(0, 2, "标题")
worksheet.write(0, 3, "html富文本")
for a in aList:
src = a.get('href')
name = a.string
# 获取分类名称和链接
cateSrc[name] = src
for k, v in cateSrc.items():
url1 = v
resp1 = requests.get(url1, headers=headers)
main_page1 = BeautifulSoup(resp1.text, "html.parser") # 指定html解析器
aCate = main_page1.find('ul', class_="food-list").find_all('div', class_="detail") # class是python的关键字
for cate in aCate:
cate_src = cate.find('a').get('href')
cate_title = cate.find('a').string
# 获取每个分类里的标题以及点击标题进去的链接,为后续爬取链接里面的内容做准备
cateTitleSrc[cate_title] = cate_src
for k1, v1 in cateTitleSrc.items():
url1_1 = v1
resp1_1 = requests.get(url1_1, headers=headers)
main_page1_1 = BeautifulSoup(resp1_1.text, "html.parser")
# 爬取html富文本内容
cateHtml = main_page1_1.find('div', class_="container")
print("分类:" + k, '标题:' + k1)
print('\n')
print(cateHtml)
worksheet.write(index, 0, index)
worksheet.write(index, 1, k)
worksheet.write(index, 2, k1)
worksheet.write(index, 3, str(cateHtml))
index += 1
book.save('test1.xls')
- 以上代码可以将访问网址获取到的text封装成一个方法。
欢迎分享,转载请注明来源:内存溢出
微信扫一扫
支付宝扫一扫
评论列表(0条)