
我想在嵌套表格布局中提取第三个表格,其中包含一系列嵌套表格.每个发布一个结果.但代码不起作用
include('simple_HTML_dom.PHP');$url = 'http://exams.keralauniversity.ac.in/Login/index.PHP?reslt=1';$HTML = file_get_contents($url);$result =$HTML->find("table",2);echo $result; 我使用Curl来提取网站,但问题是它的标签是乱序的,因此无法使用简单的dom元素提取它.
function curl($url) { $ch = curl_init(); // Initialising cURL curl_setopt($ch,CURLOPT_URL,$url); // Setting cURL's URL option with the $url variable passed into the function curl_setopt($ch,CURLOPT_RETURNTRANSFER,TRUE); // Setting cURL's option to return the webpage data $data = curl_exec($ch); // Executing the cURL request and assigning the returned data to the $data variable curl_close($ch); // Closing cURL return $data; // Returning the data from the function } function scrape_between($data,$start,$end){ $data = stristr($data,$start); // StripPing all data from before $start $data = substr($data,strlen($start)); // StripPing $start $stop = stripos($data,$end); // Getting the position of the $end of the data to scrape $data = substr($data,$stop); // StripPing all data from after and including the $end of the data to scrape return $data; // Returning the scraped data from the function } $scraped_page = curl($url); // Executing our curl function to scrape the webpage http://www.example.com and return the results into the $scraped_website variable $scraped_data = scrape_between($scraped_page,' </HTML>','</table></td><td></td></tr> </table>'); echo $scraped_data; $myfile = fopen("newfile.HTML","w") or dIE("Unable to open file!");fwrite($myfile,$scraped_data);fclose($myfile); 如何刮取结果并保存pdf
解决方法 简单的HTML Dom无法处理该HTML.所以先切换到 this library,然后做:
require_once('advanced_HTML_dom.PHP');$dom = file_get_HTML('http://exams.keralauniversity.ac.in/Login/index.PHP?reslt=1');$rows = array();foreach($dom->find('tr.Function_Text_normal:has(td[3])') as $tr){ $row['num'] = $tr->find('td[2]',0)->text; $row['text'] = $tr->find('td[3]',0)->text; $row['pdf'] = $tr->find('td[3] a',0)->href; if(preg_match_all('/\d+/',$tr->parent->find('u',0)->text,$m)){ List($row['day'],$row['month'],$row['year']) = $m[0]; } // uncomment next 2 lines to save the pdf // $filename = preg_replace('/.*\//','',$row['pdf']); // file_put_contents($filename,file_get_contents($row['pdf'])); $rows[] = $row;}var_dump($rows); 总结 以上是内存溢出为你收集整理的使用简单的html dom的php webscraping在输出无序的html标签时不起作用全部内容,希望文章能够帮你解决使用简单的html dom的php webscraping在输出无序的html标签时不起作用所遇到的程序开发问题。
如果觉得内存溢出网站内容还不错,欢迎将内存溢出网站推荐给程序员好友。
欢迎分享,转载请注明来源:内存溢出
微信扫一扫
支付宝扫一扫
评论列表(0条)