
方法1:BS版
简单写了个,只是爬链接的,加上标题老报错,暂时没看出来原因,先给你粘上来吧(方法2无问题)
from
BeautifulSoup
import
BeautifulSoup
import
urllib2
import
re
def
grabHref(url,localfile):
html
=
urllib2urlopen(url)read()
html
=
unicode(html,'gb2312','ignore')encode('utf-8','ignore')
content
=
BeautifulSoup(html)findAll('a')
myfile
=
open(localfile,'w')
pat
=
recompile(r'href="([^"])"')
pat2
=
recompile(r'/tools/')
for
item
in
content:
h
=
patsearch(str(item))
href
=
hgroup(1)
if
pat2search(href):
#
s
=
BeautifulSoup(item)
#
myfilewrite(sastring)
#
myfilewrite('\r\n')
myfilewrite(href)
myfilewrite('\r\n')
#
sasting
href
myfileclose()
def
main():
url
=
">
以百度为例
# -- coding:utf-8 --import requests
import urlparse
import os
from bs4 import BeautifulSoup
def process(url):
headers = {'content-type': 'application/json',
'User-Agent': 'Mozilla/50 (X11; Ubuntu; Linux x86_64; rv:220) Gecko/20100101 Firefox/220'}
pageSourse = requestsget(url, headers=headers)text
page_soup = BeautifulSoup(pageSourse)
a_all = page_soupfindAll("a")
link_urls=[iget('href') for i in a_all]#有些是javascript触发事件,过滤方法自己写下。
img_all = page_soupfindAll("img")
img_urls=[iget("src") for i in img_all]
print link_urls,img_urls
return (link_urls, img_urls)
process("")
结果如下:
[u'/', u'javascript:;', u'javascript:;', u'javascript:;', u'/', u'javascript:;', u'']有问题可指出,
以上就是关于Python提取网页链接和标题全部的内容,包括:Python提取网页链接和标题、Python网页解析库:用requests-html爬取网页、python 中关于用beautifulsoup4库解析网页源代码标签的问题,急求解答等相关内容解答,如果想了解更多相关内容,可以关注我们,你们的支持是我们更新的动力!
欢迎分享,转载请注明来源:内存溢出
微信扫一扫
支付宝扫一扫
评论列表(0条)