怎么用Python读取本地网站的内容_框架

思路如下：

使用urllib2库，打开页面，获取页面内容，再用正则表达式提取需要的数据就可以了。

下面给你个示例代码供参考，从百度贴吧抓取帖子内容，并保存在文件中。

# -- coding:utf-8 --

import urllib2

import re

url='

page=urllib2urlopen(url)read()decode('gbk')

none_re=recompile('<a href=>|</a>|<img>')

br_re=recompile('<br>')

title_re=recompile('<h1 class="core_title_txt " title="()"')

content_re=recompile('<div id="post_content_\d" class="d_post_content j_d_post_content ">()</div>')

title=research(title_re,page)

title=titlegroup(1)replace('\\','')replace('/','')replace(':','')replace('','')replace('','')replace('"','')replace('>','')replace('<','')replace('|','')

content=refindall(content_re,page)

with open('%stxt'%title,'w') as f:

for i in content:

i=resub(none_re, '', i)

i=resub(br_re, '\n', i)

fwrite(iencode('utf-8')strip()+'\n')

首先我们可以先获取要下载的整个页面信息。

getjpgpy

#coding=utf-8

import urllib

def getHtml(url):

page = urlliburlopen(url)

html = pageread()

return html

print html

Urllib 模块提供了读取web页面数据的接口，我们可以像读取本地文件一样读取>

一、认识网页

网页分为三个部分：HTML(结构）、CSS（样式）、JavaScript（功能）。

二、爬取网站信息入门

1、Soup = BeautifulSoup (html, 'lxml')，使用beautifulsoup来解析网页。

2、使用copy CSS selector来复制网页元素的位置。

三、爬取房天下网站信息

1、导入requests和beautifulsoup

2、定义函数spider_ftx，把所需要爬取的信息都定义出来

3、调用函数spider_ftx

4、翻页爬取二手房信息

由于每页最多只能显示40条信息，观察每一页网址的变化规律，写一个循环调用的语句，把全部100页的信息全都爬取下来。

四、小结:

目前只能爬取到网站的100页信息，网站为了反爬，设置了可浏览的页面量100。要想爬取网站的所有信息，可以通过分类去获取，但是如何用python实现呢，请看下集。

使用selenium的chrome或firefox的webdriver打开浏览器

driverget(url) #访问你的网页

from=driverfind_elements_by_xpath("xxx")

通过xpath或id等方法锁定到网页上表单的那个元素后，用

fromsend_keys("xxx")

用while true无限循环先判断是否有下一页，如果有则继续调用get_next_pages方法，如果没有则跳出循环

url = "第一页网址"

while true:

next_page =get_next_pages(url)

if next_page:

get_next_pages(next_page)

else:

break

最简单的办法，不需要任何第三方库，获取网页源代码，进行正则匹配：

import

urllib,re

url

以上就是关于怎么用Python读取本地网站的内容全部的内容，包括:怎么用Python读取本地网站的内容、python爬虫怎么获取到的网站的所有url、零基础学python（1）——爬取房天下网站信息等相关内容解答，如果想了解更多相关内容，可以关注我们，你们的支持是我们更新的动力！

欢迎分享，转载请注明来源：内存溢出

原文地址:https://54852.com/web/9344386.html

怎么用Python读取本地网站的内容

发表评论

评论列表（0条）