
# 代码
html_text = '''
<p>When ES cells differentiate, they migrate out from colonies on gelatin-coated dishes, similar to the ES cells on the
<xref ref-type="bibr" rid="pone0000015-Rogers1">[17]</xref> and <italic>nanog</italic> ,
,<xref ref-type="bibr" rid="pone0000015-Chambers1">[19]</xref> well-known markers for undifferentiated ES cells </p>
<p>(A) R1 cells were cultured for 5 days in the presence of
<xref ref-type="bibr" rid="pone0000015-Rogers1">[1]</xref> and <italic>nanog</italic>
<xref ref-type="bibr" rid="pone0000015-Mitsui1">[2]</xref>, <xref ref-type="bibr" rid="pone0000015-Chambers1">[3]</xref> various doses of LIF (0–1,000 units/ml) </p>
'''
pattern = r'(<p></p>)'
html_text = resub('\n', '', html_text)
text = refindall(pattern, html_text)
print(text)# 输出
['<p>When ES cells differentiate, they migrate out from colonies on gelatin-coated dishes, similar to the ES cells on the <xref ref-type="bibr" rid="pone0000015-Rogers1">[17]</xref> and <italic>nanog</italic> ,,<xref ref-type="bibr" rid="pone0000015-Chambers1">[19]</xref> well-known markers for undifferentiated ES cells </p>',
'<p>(A) R1 cells were cultured for 5 days in the presence of <xref ref-type="bibr" rid="pone0000015-Rogers1">[1]</xref> and <italic>nanog</italic> <xref ref-type="bibr" rid="pone0000015-Mitsui1">[2]</xref>, <xref ref-type="bibr" rid="pone0000015-Chambers1">[3]</xref> various doses of LIF (0–1,000 units/ml) </p>']
输出:
解释一下:
1正则匹配串前加了r就是为了使得里面的特殊符号不用写反斜杠了。
2[ ]具有去特殊符号的作用,也就是说[(]里的(只是平凡的括号
3正则匹配串里的()是为了提取整个正则串中符合括号里的正则的内容
输出:
PS:这里再为大家提供2款非常方便的正则表达式工具供大家参考使用:
JavaScript正则表达式在线测试工具: >
>>> s='''21899 6% S 15 173928K 38024K fg app_108 comtencentqq
21899 34% S 14 191436K 50888K fg app_108 comtencentqq
21899 49% S 14 183928K 41584K fg app_108 comtencentqq
21899 28% S 15 176984K 40240K fg app_108 comtencentqq
21899 6% S 15 177004K 40448K fg app_108 comtencentqq
21899 6% S 14 176048K 40564K fg app_108 comtencentqq
21899 10% S 14 176196K 40472K fg app_108 comtencentqq
21899 9% S 14 176232K 40712K fg app_108 comtencentqq
21899 12% S 14 176288K 40820K fg app_108 comtencentqq
21899 10% S 14 176288K 40820K fg app_108 comtencentqq
21899 12% S 16 179376K 40904K fg app_108 comtencentqq'''
>>> open('atxt','w')write(s)
>>> f=open('atxt')
>>> fread()
'21899 6% S 15 173928K 38024K fg app_108 comtencentqq\n21899 34% S 14 191436K 50888K fg app_108 comtencentqq\n21899 49% S 14 183928K 41584K fg app_108 comtencentqq\n21899 28% S 15 176984K 40240K fg app_108 comtencentqq\n21899 6% S 15 177004K 40448K fg app_108 comtencentqq\n21899 6% S 14 176048K 40564K fg app_108 comtencentqq\n21899 10% S 14 176196K 40472K fg app_108 comtencentqq\n21899 9% S 14 176232K 40712K fg app_108 comtencentqq\n21899 12% S 14 176288K 40820K fg app_108 comtencentqq\n21899 10% S 14 176288K 40820K fg app_108 comtencentqq\n21899 12% S 16 179376K 40904K fg app_108 comtencentqq'
>>> pprintpprint(map(lambda x:refindall('\d+ +(\d+%) +S +\d+ +(\d+K) +(\d+K)',x),ssplit('\n')))
[[('6%', '173928K', '38024K')],
[('34%', '191436K', '50888K')],
[('49%', '183928K', '41584K')],
[('28%', '176984K', '40240K')],
[('6%', '177004K', '40448K')],
[('6%', '176048K', '40564K')],
[('10%', '176196K', '40472K')],
[('9%', '176232K', '40712K')],
[('12%', '176288K', '40820K')],
[('10%', '176288K', '40820K')],
[('12%', '179376K', '40904K')]]
>>> pprintpprint(map(lambda x:refindall('\d+ +(\d+%) +S +\d+ +(\d+K) +(\d+K)',x),open('atxt')read()split('\n')))
[[('6%', '173928K', '38024K')],
[('34%', '191436K', '50888K')],
[('49%', '183928K', '41584K')],
[('28%', '176984K', '40240K')],
[('6%', '177004K', '40448K')],
[('6%', '176048K', '40564K')],
[('10%', '176196K', '40472K')],
[('9%', '176232K', '40712K')],
[('12%', '176288K', '40820K')],
[('10%', '176288K', '40820K')],
[('12%', '179376K', '40904K')]]
>>>
#输入百度贴吧地址,及当期目录将要新建的文件夹名称。即可下载并存入新建文件夹中。
#算是自己这几天来写的第一个小程序吧。不过程序还存在几个bug
#比如:url地址不合法,同名的文件夹已经存在等问题没有处理
#其中只有:url地址匹配用到了一点re的内容。
#说实话,Python真是简单,你所想要的功能基本都有了!!!
#还小有成就感呢,(^__^) 嘻嘻……
#coding:utf-8
import urllib
import re
import os
def getHtml(url):
page = urlliburlopen(url)
html = pageread()
return html
def getImg(html):
reg = r'src="(
\jpg)"'
imgre = recompile(reg)
imglist = refindall(imgre,html)
return imglist
def main():
url = r'
url = str(raw_input('input the url : '))
forder = r'test'
forder = str(raw_input('input the forder name : '))
osmkdir(forder)
html = getHtml(url)
count = 0
for imgurl in getImg(html):
count += 1
print imgurl
urlliburlretrieve(imgurl,'%s/%sjpg' % (forder,count))
print 'total saved : %s pictures to : %s ' % (forder,count)
if __name__ == '__main__':
main()
类似的一个正则,加个括号就选出来了。
那是因为正则表达式r'a(+)b|wz's和sd的结果在第一捕获组中,而wz在第0捕获组中,所以你要不然把wz也用小括号括起来r'a(+)b|(wz)'这样分别取第一和第二捕获组的数据,要不然你用r'(<=a)+(=b)|wz 前向预搜索(<=)和后向预搜索(=)这样没有捕获组,结果都在第0捕获组中
完整的两种方法的Python程序如下
#!/usr/bin/pythonimport re
text='asb,fasdbwz'
u=r'a(+)b|(wz)'
result = refindall(u,text)
for i in range(0,len(result)):
if result[i][0]=='':
print(result[i][1])
else:
print(result[i][0])
运行结果
s
sd
wz
第二种方法
#!/usr/bin/pythonimport re
text='asb,fasdbwz'
u=r'(<=a)+(=b)|wz'
result = refindall(u,text)
for i in range(0,len(result)):
print(result[i])
运行结果
s
sd
wz
以上就是关于在Python中如何用正则表达式提取xml中的<p>之间的内容全部的内容,包括:在Python中如何用正则表达式提取xml中的<p>之间的内容、python之re提取字符串括号内的内容、如何在python中使用正则表达式提取每行中需要的信息等相关内容解答,如果想了解更多相关内容,可以关注我们,你们的支持是我们更新的动力!
欢迎分享,转载请注明来源:内存溢出
微信扫一扫
支付宝扫一扫
评论列表(0条)