
我之前使用automator从网站下载了HTML文件,现在我正在努力解析源代码.
最好,我想获取表格的信息,我需要为1800个不同的HTML文件重复此 *** 作.
以下是源代码示例:
</head><body><div ID="header"> <div > <span > <div ID="fb-root"></div> <span > Gold Account: <a title="Account Details" href="http://www.hedge-professionals.com/account-details.HTML" >Active </a> Logged in as EDWard | <a href="JavaScript:voID(0);" onclick='logout()' >Sign Out</a> </span> </span> </div><!-- /wrapper --></div><!-- /header --><div ID="masthead"> <div > <a href="http://www.hedge-professionals.com" ><img src="http://www.hedge-professionals.com/images/hedgep_logo_white.png" alt="Hedge Professionals Database" wIDth="333" height="46" border="0" /></a> <div ID="navigation"> <ul><li ><a href='http://www.hedge-professionals.com/dashboard.HTML' >Dashboard</a></li> <li ><a href='http://www.hedge-professionals.com/people.HTML'class='current' >People</a></li><li ><a href='http://www.hedge-professionals.com/watchLists.HTML' >My WatchLists</a></li><li ><a href='http://www.hedge-professionals.com/my-searches.HTML' >My Searches</a></li><li ><a href='http://www.hedge-professionals.com/my-profile.HTML' >My Profile</a></li></ul> </div><!-- /navigation --> </div><!-- /wrapper --> </div><!-- /masthead --><div ID="content"> <div > <div ID="main-content"> <!-- per Project stuff --> <span > <img src="http://www.hedge-professionals.com/images/people/noimage_53x53.jpg" alt="Christian SIEling" wIDth="52" height="53" ID="profile-pic-104947"/> <h1><span ID="profile-name-104947" >Christian SIEling</span></h1> <ul > <li><a href="http://www.hedge-professionals.com/people.HTML">« Back </a></li> <li><a href="http://www.hedge-professionals.com/addtoWatchList.PHP?usr=114752" ID="row-104947" Title='Add to WatchList' >Add to WatchList</a></li> </ul> <div > <span ID="profile-updated-date" >Updated On: 4 Aug,2010</span><br/> <a href="http://www.hedge-professionals.com/profile/suggest/people/104947/Christian-SIEling" Title='Report Inaccurate Data' >Report Inaccurate Data</a> </div> <h2><span ID="profile-details-104947" > at <a href="http://www.hedge-professionals.com/quicksearch/search/Lumix+CAPItal+Management+Ltd." ><span Title='Lumix CAPItal Management Ltd.' >Lumix CAPItal Management Ltd.</span></a></span><input type="hIDden" name="sub-ID" ID="sub-ID" value="114752"></h2> </span> <table wIDth="100%" border="0" cellspacing="0" cellpadding="0" ID="profile-table"> <tr> <th>Role</th> <td> <p>Other</p> </td> </tr> <tr> <th>Organisation Type</th> <td> <p>Asset Manager</p> </td> </tr> <tr> <th>Email</th> <td><a href="mailto:cs@lumixcAPItal.com" title="cs@lumixcAPItal.com" >cs@lumixcAPItal.com</a></td> </tr> <tr> <th>Website</th> <td><a href="http://www.lumixcAPItal.com/" target="_new" title="http://www.lumixcAPItal.com/" >http://www.lumixcAPItal.com/</a></td> </tr> <tr> <th>Phone</th> <td>41 78 616 7334</td> </tr> <tr> <th>Fax</th> <td></td> </tr> <tr> <th>Mailing Address</th> <td>Birrenstrasse 30</td> </tr> <tr> <th>City</th> <td>Schindellegi</td> </tr> <tr> <th>State</th> <td>CH</td> </tr> <tr> <th>Country</th> <td>Switzerland</td> </tr> <tr> <th >Zip/ Postal Code</th> <td >8834</td> </tr> </table> </div><!-- /main-content --> <div ID="sIDebar" > </div> <div ID="similar_sIDebar" > </div> </div><!-- /wrapper --></div><!-- /content --><div ID="footer"></div>
我的AppleScript尝试使用文本项分隔符以类似的方式提取表:
set p to inputset ex to extractBetween(p,"<table>","</table>") -- extract the URLto extractBetween(SearchText,startText,endText)set tID to AppleScript's text item delimitersset AppleScript's text item delimiters to startTextset endItems to text of text item -1 of SearchTextset AppleScript's text item delimiters to endTextset beginningToEnd to text of text item 1 of endItemsset AppleScript's text item delimiters to tIDreturn beginningToEndend extractBetween
如何从HTML文件中解析表格?
解决方法 你真的很亲密问题是你的startText变量.起始表标记不在HTML文本中,因此无法找到.启动表的行实际上是……<table wIDth="100%" border="0" cellspacing="0" cellpadding="0" ID="profile-table">
所以我修改了你的代码,分两步寻找那个标签.第一…
<table
然后分开……
>
通过这种方式,我们可以忽略表标签附带的所有代码(宽度,边框等),因为我认为它们会在文件之间变化.执行此 *** 作后,我们只获取表的代码.尝试这个…
set p to inputset ex to extractBetween(p,"<table",">","</table>")to extractBetween(SearchText,startText1,startText2,endText) set tID to AppleScript's text item delimiters set AppleScript's text item delimiters to startText1 set endItems to text item -1 of SearchText set AppleScript's text item delimiters to endText set beginningToEnd to text item 1 of endItems set AppleScript's text item delimiters to startText2 set finalText to (text items 2 thru -1 of beginningToEnd) as text set AppleScript's text item delimiters to tID return finalTextend extractBetween总结
以上是内存溢出为你收集整理的使用AppleScript解析HTML源代码全部内容,希望文章能够帮你解决使用AppleScript解析HTML源代码所遇到的程序开发问题。
如果觉得内存溢出网站内容还不错,欢迎将内存溢出网站推荐给程序员好友。
欢迎分享,转载请注明来源:内存溢出
微信扫一扫
支付宝扫一扫
评论列表(0条)