使用AppleScript解析HTML源代码

使用AppleScript解析HTML源代码,第1张

概述我正在尝试解析我已转换为Automator内部的TXT文件的HTML文件. 我之前使用Automator从网站下载了HTML文件,现在我正在努力解析源代码. 最好,我想获取表格的信息,我需要为1800个不同的HTML文件重复此 *** 作. 以下是源代码示例: </head><body><div id="header"> <div class="wrapper"> <span 我正在尝试解析我已转换为automator内部的TXT文件的HTML文件.

我之前使用automator从网站下载了HTML文件,现在我正在努力解析源代码.

最好,我想获取表格的信息,我需要为1800个不同的HTML文件重复此 *** 作.

以下是源代码示例:

</head><body><div ID="header">    <div >        <span >        <div ID="fb-root"></div>    <span >     Gold Account: <a  title="Account Details" href="http://www.hedge-professionals.com/account-details.HTML" >Active </a>       Logged in as EDWard&nbsp;&nbsp; | &nbsp;&nbsp;<a href="JavaScript:voID(0);" onclick='logout()' >Sign Out</a>    </span>                                    </span>    </div><!-- /wrapper --></div><!-- /header --><div ID="masthead">    <div >           <a href="http://www.hedge-professionals.com" ><img src="http://www.hedge-professionals.com/images/hedgep_logo_white.png" alt="Hedge Professionals Database" wIDth="333" height="46"  border="0" /></a>        <div ID="navigation">            <ul><li ><a href='http://www.hedge-professionals.com/dashboard.HTML' >Dashboard</a></li>    <li ><a href='http://www.hedge-professionals.com/people.HTML'class='current' >People</a></li><li ><a href='http://www.hedge-professionals.com/watchLists.HTML' >My WatchLists</a></li><li ><a href='http://www.hedge-professionals.com/my-searches.HTML' >My Searches</a></li><li ><a href='http://www.hedge-professionals.com/my-profile.HTML' >My Profile</a></li></ul>                       </div><!-- /navigation -->    </div><!-- /wrapper -->     </div><!-- /masthead --><div ID="content">    <div >        <div ID="main-content"> <!-- per Project stuff -->    <span >                <img src="http://www.hedge-professionals.com/images/people/noimage_53x53.jpg" alt="Christian SIEling" wIDth="52" height="53"  ID="profile-pic-104947"/>                <h1><span ID="profile-name-104947" >Christian SIEling</span></h1>                                    <ul >                    <li><a  href="http://www.hedge-professionals.com/people.HTML">&laquo; Back </a></li>                    <li><a  href="http://www.hedge-professionals.com/addtoWatchList.PHP?usr=114752"  ID="row-104947" Title='Add to WatchList' >Add to WatchList</a></li>                </ul>                <div  >                <span ID="profile-updated-date" >Updated On: 4 Aug,2010</span><br/>                <a  href="http://www.hedge-professionals.com/profile/suggest/people/104947/Christian-SIEling"  Title='Report Inaccurate Data' >Report Inaccurate Data</a>                </div>                                    <h2><span ID="profile-details-104947" > at <a href="http://www.hedge-professionals.com/quicksearch/search/Lumix+CAPItal+Management+Ltd." ><span Title='Lumix CAPItal Management Ltd.' >Lumix CAPItal Management Ltd.</span></a></span><input type="hIDden" name="sub-ID" ID="sub-ID" value="114752"></h2>            </span>            <table wIDth="100%" border="0" cellspacing="0" cellpadding="0" ID="profile-table">                                                        <tr>                    <th>Role</th>                    <td>                    <p>Other</p>                            </td>                </tr>                <tr>                      <th>Organisation Type</th>                    <td>                    <p>Asset Manager</p>                        </td>                </tr>                <tr>                    <th>Email</th>                    <td><a href="mailto:cs@lumixcAPItal.com" title="cs@lumixcAPItal.com" >cs@lumixcAPItal.com</a></td>                </tr>                <tr>                    <th>Website</th>                    <td><a href="http://www.lumixcAPItal.com/" target="_new" title="http://www.lumixcAPItal.com/" >http://www.lumixcAPItal.com/</a></td>                </tr>                <tr>                    <th>Phone</th>                    <td>41 78 616 7334</td>                </tr>                <tr>                    <th>Fax</th>                    <td></td>                 </tr>                <tr>                    <th>Mailing Address</th>                    <td>Birrenstrasse 30</td>                </tr>                <tr>                    <th>City</th>                    <td>Schindellegi</td>                </tr>                <tr>                    <th>State</th>                    <td>CH</td>                </tr>                <tr>                    <th>Country</th>                    <td>Switzerland</td>                </tr>                <tr>                    <th  >Zip/ Postal Code</th>                    <td  >8834</td>                </tr>        </table>                </div><!-- /main-content -->                    <div ID="sIDebar"  >                    </div>            <div ID="similar_sIDebar"  >            </div>                            </div><!-- /wrapper --></div><!-- /content --><div ID="footer"></div>

我的AppleScript尝试使用文本项分隔符以类似的方式提取表:

set p to inputset ex to extractBetween(p,"<table>","</table>") -- extract the URLto extractBetween(SearchText,startText,endText)set tID to AppleScript's text item delimitersset AppleScript's text item delimiters to startTextset endItems to text of text item -1 of SearchTextset AppleScript's text item delimiters to endTextset beginningToEnd to text of text item 1 of endItemsset AppleScript's text item delimiters to tIDreturn beginningToEndend extractBetween

如何从HTML文件中解析表格?

解决方法 你真的很亲密问题是你的startText变量.起始表标记不在HTML文本中,因此无法找到.启动表的行实际上是……
<table wIDth="100%" border="0" cellspacing="0" cellpadding="0" ID="profile-table">

所以我修改了你的代码,分两步寻找那个标签.第一…

<table

然后分开……

>

通过这种方式,我们可以忽略表标签附带的所有代码(宽度,边框等),因为我认为它们会在文件之间变化.执行此 *** 作后,我们只获取表的代码.尝试这个…

set p to inputset ex to extractBetween(p,"<table",">","</table>")to extractBetween(SearchText,startText1,startText2,endText)    set tID to AppleScript's text item delimiters    set AppleScript's text item delimiters to startText1    set endItems to text item -1 of SearchText    set AppleScript's text item delimiters to endText    set beginningToEnd to text item 1 of endItems    set AppleScript's text item delimiters to startText2    set finalText to (text items 2 thru -1 of beginningToEnd) as text    set AppleScript's text item delimiters to tID    return finalTextend extractBetween
总结

以上是内存溢出为你收集整理的使用AppleScript解析HTML源代码全部内容,希望文章能够帮你解决使用AppleScript解析HTML源代码所遇到的程序开发问题。

如果觉得内存溢出网站内容还不错,欢迎将内存溢出网站推荐给程序员好友。

欢迎分享,转载请注明来源:内存溢出

原文地址:https://54852.com/web/1103778.html

(0)
打赏 微信扫一扫微信扫一扫 支付宝扫一扫支付宝扫一扫
上一篇 2022-05-28
下一篇2022-05-28

发表评论

登录后才能评论

评论列表(0条)

    保存