Python提取网页链接和标题_框架

方法1：BS版

简单写了个，只是爬链接的，加上标题老报错，暂时没看出来原因，先给你粘上来吧（方法2无问题）

from

BeautifulSoup

import

BeautifulSoup

import

urllib2

import

def

grabHref(url,localfile):

html

urllib2urlopen(url)read()

html

unicode(html,'gb2312','ignore')encode('utf-8','ignore')

content

BeautifulSoup(html)findAll('a')

myfile

open(localfile,'w')

pat

recompile(r'href="([^"])"')

pat2

recompile(r'/tools/')

for

item

content:

patsearch(str(item))

href

hgroup(1)

pat2search(href):

BeautifulSoup(item)

myfilewrite(sastring)

myfilewrite('\r\n')

myfilewrite(href)

myfilewrite('\r\n')

sasting

href

myfileclose()

def

main():

url

以百度为例

# -- coding:utf-8 --

import requests

import urlparse

import os

from bs4 import BeautifulSoup

def process(url):

headers = {'content-type': 'application/json',

'User-Agent': 'Mozilla/50 (X11; Ubuntu; Linux x86_64; rv:220) Gecko/20100101 Firefox/220'}

pageSourse = requestsget(url, headers=headers)text

page_soup = BeautifulSoup(pageSourse)

a_all = page_soupfindAll("a")

link_urls=[iget('href') for i in a_all]#有些是javascript触发事件，过滤方法自己写下。

img_all = page_soupfindAll("img")

img_urls=[iget("src") for i in img_all]

print link_urls,img_urls

return (link_urls, img_urls)

process("")

结果如下：

[u'/', u'javascript:;', u'javascript:;', u'javascript:;', u'/', u'javascript:;', u'']

有问题可指出，

以上就是关于Python提取网页链接和标题全部的内容，包括:Python提取网页链接和标题、Python网页解析库：用requests-html爬取网页、python 中关于用beautifulsoup4库解析网页源代码标签的问题，急求解答等相关内容解答，如果想了解更多相关内容，可以关注我们，你们的支持是我们更新的动力！

欢迎分享，转载请注明来源：内存溢出

原文地址:https://54852.com/web/9754628.html

Python提取网页链接和标题

发表评论

评论列表（0条）