在Python中如何用正则表达式提取xml中的之间的内容_框架

# 代码

html_text = '''

When ES cells differentiate, they migrate out from colonies on gelatin-coated dishes, similar to the ES cells on the

<xref ref-type="bibr" rid="pone0000015-Rogers1">[17]</xref> and <italic>nanog</italic> ,

,<xref ref-type="bibr" rid="pone0000015-Chambers1">[19]</xref> well-known markers for undifferentiated ES cells

(A) R1 cells were cultured for 5 days in the presence of

<xref ref-type="bibr" rid="pone0000015-Rogers1">[1]</xref> and <italic>nanog</italic>

<xref ref-type="bibr" rid="pone0000015-Mitsui1">[2]</xref>, <xref ref-type="bibr" rid="pone0000015-Chambers1">[3]</xref> various doses of LIF (0–1,000 units/ml)

'''

pattern = r'()'

html_text = resub('\n', '', html_text)

text = refindall(pattern, html_text)

print(text)# 输出

['When ES cells differentiate, they migrate out from colonies on gelatin-coated dishes, similar to the ES cells on the <xref ref-type="bibr" rid="pone0000015-Rogers1">[17]</xref> and <italic>nanog</italic> ,,<xref ref-type="bibr" rid="pone0000015-Chambers1">[19]</xref> well-known markers for undifferentiated ES cells ',

'(A) R1 cells were cultured for 5 days in the presence of <xref ref-type="bibr" rid="pone0000015-Rogers1">[1]</xref> and <italic>nanog</italic> <xref ref-type="bibr" rid="pone0000015-Mitsui1">[2]</xref>, <xref ref-type="bibr" rid="pone0000015-Chambers1">[3]</xref> various doses of LIF (0–1,000 units/ml) ']

输出：

解释一下：

1正则匹配串前加了r就是为了使得里面的特殊符号不用写反斜杠了。

2[ ]具有去特殊符号的作用,也就是说[(]里的(只是平凡的括号

3正则匹配串里的()是为了提取整个正则串中符合括号里的正则的内容

输出：

PS：这里再为大家提供2款非常方便的正则表达式工具供大家参考使用：

JavaScript正则表达式在线测试工具： >

>>> s='''21899 6% S 15 173928K 38024K fg app_108 comtencentqq

21899 34% S 14 191436K 50888K fg app_108 comtencentqq

21899 49% S 14 183928K 41584K fg app_108 comtencentqq

21899 28% S 15 176984K 40240K fg app_108 comtencentqq

21899 6% S 15 177004K 40448K fg app_108 comtencentqq

21899 6% S 14 176048K 40564K fg app_108 comtencentqq

21899 10% S 14 176196K 40472K fg app_108 comtencentqq

21899 9% S 14 176232K 40712K fg app_108 comtencentqq

21899 12% S 14 176288K 40820K fg app_108 comtencentqq

21899 10% S 14 176288K 40820K fg app_108 comtencentqq

21899 12% S 16 179376K 40904K fg app_108 comtencentqq'''

>>> open('atxt','w')write(s)

>>> f=open('atxt')

>>> fread()

'21899 6% S 15 173928K 38024K fg app_108 comtencentqq\n21899 34% S 14 191436K 50888K fg app_108 comtencentqq\n21899 49% S 14 183928K 41584K fg app_108 comtencentqq\n21899 28% S 15 176984K 40240K fg app_108 comtencentqq\n21899 6% S 15 177004K 40448K fg app_108 comtencentqq\n21899 6% S 14 176048K 40564K fg app_108 comtencentqq\n21899 10% S 14 176196K 40472K fg app_108 comtencentqq\n21899 9% S 14 176232K 40712K fg app_108 comtencentqq\n21899 12% S 14 176288K 40820K fg app_108 comtencentqq\n21899 10% S 14 176288K 40820K fg app_108 comtencentqq\n21899 12% S 16 179376K 40904K fg app_108 comtencentqq'

>>> pprintpprint(map(lambda x:refindall('\d+ +(\d+%) +S +\d+ +(\d+K) +(\d+K)',x),ssplit('\n')))

[[('6%', '173928K', '38024K')],

[('34%', '191436K', '50888K')],

[('49%', '183928K', '41584K')],

[('28%', '176984K', '40240K')],

[('6%', '177004K', '40448K')],

[('6%', '176048K', '40564K')],

[('10%', '176196K', '40472K')],

[('9%', '176232K', '40712K')],

[('12%', '176288K', '40820K')],

[('10%', '176288K', '40820K')],

[('12%', '179376K', '40904K')]]

>>> pprintpprint(map(lambda x:refindall('\d+ +(\d+%) +S +\d+ +(\d+K) +(\d+K)',x),open('atxt')read()split('\n')))

[[('6%', '173928K', '38024K')],

[('34%', '191436K', '50888K')],

[('49%', '183928K', '41584K')],

[('28%', '176984K', '40240K')],

[('6%', '177004K', '40448K')],

[('6%', '176048K', '40564K')],

[('10%', '176196K', '40472K')],

[('9%', '176232K', '40712K')],

[('12%', '176288K', '40820K')],

[('10%', '176288K', '40820K')],

[('12%', '179376K', '40904K')]]

>>>

#输入百度贴吧地址，及当期目录将要新建的文件夹名称。即可下载并存入新建文件夹中。

#算是自己这几天来写的第一个小程序吧。不过程序还存在几个bug

#比如：url地址不合法，同名的文件夹已经存在等问题没有处理

#其中只有：url地址匹配用到了一点re的内容。

#说实话，Python真是简单，你所想要的功能基本都有了！！！

#还小有成就感呢，(^__^) 嘻嘻……

#coding:utf-8

import urllib

import re

import os

def getHtml(url):

page = urlliburlopen(url)

html = pageread()

return html

def getImg(html):

reg = r'src="(

\jpg)"'

imgre = recompile(reg)

imglist = refindall(imgre,html)

return imglist

def main():

url = r'

url = str(raw_input('input the url : '))

forder = r'test'

forder = str(raw_input('input the forder name : '))

osmkdir(forder)

html = getHtml(url)

count = 0

for imgurl in getImg(html):

count += 1

print imgurl

urlliburlretrieve(imgurl,'%s/%sjpg' % (forder,count))

print 'total saved : %s pictures to : %s ' % (forder,count)

if __name__ == '__main__':

main()

类似的一个正则，加个括号就选出来了。

那是因为正则表达式r'a(+)b|wz's和sd的结果在第一捕获组中,而wz在第0捕获组中,所以你要不然把wz也用小括号括起来r'a(+)b|(wz)'这样分别取第一和第二捕获组的数据,要不然你用r'(<=a)+(=b)|wz 前向预搜索(<=)和后向预搜索(=)这样没有捕获组,结果都在第0捕获组中

完整的两种方法的Python程序如下

#!/usr/bin/python

import re

text='asb,fasdbwz'

u=r'a(+)b|(wz)'

result = refindall(u,text)

for i in range(0,len(result)):

if result[i][0]=='':

print(result[i][1])

else:

print(result[i][0])

运行结果

第二种方法

#!/usr/bin/python

import re

text='asb,fasdbwz'

u=r'(<=a)+(=b)|wz'

result = refindall(u,text)

for i in range(0,len(result)):

print(result[i])

运行结果

以上就是关于在Python中如何用正则表达式提取xml中的之间的内容全部的内容，包括:在Python中如何用正则表达式提取xml中的之间的内容、python之re提取字符串括号内的内容、如何在python中使用正则表达式提取每行中需要的信息等相关内容解答，如果想了解更多相关内容，可以关注我们，你们的支持是我们更新的动力！

欢迎分享，转载请注明来源：内存溢出

原文地址:https://54852.com/web/9289985.html

在Python中如何用正则表达式提取xml中的<p>之间的内容

发表评论

评论列表（0条）

在Python中如何用正则表达式提取xml中的&lt;p&gt;之间的内容

发表评论

评论列表（0条）

在Python中如何用正则表达式提取xml中的<p>之间的内容