
用HTTPclient或者htmlunit工具包,他们都可以做爬虫获取网页的工具。比坦弯如htmlunit,楼主可以这样获取网页源码:
import com.gargoylesoftware.htmlunit.WebClientimport com.gargoylesoftware.htmlunit.html.HtmlPage
import com.gargoylesoftware.htmlunit.BrowserVersion
import com.gargoylesoftware.htmlunit.html.HtmlDivision
import com.gargoylesoftware.htmlunit.html.HtmlAnchor
import com.gargoylesoftware.htmlunit.*
import com.gargoylesoftware.htmlunit.WebClientOptions
import com.gargoylesoftware.htmlunit.html.HtmlInput
import com.gargoylesoftware.htmlunit.html.HtmlBody
import java.util.List
public class helloHtmlUnit{
public static void main(String[] args) throws Exception{
String str
//创建一个webclient
WebClient webClient = new WebClient()
//htmlunit 对css和javascript的支持不好,所以请关闭之
webClient.getOptions().setJavaScriptEnabled(false)
webClient.getOptions().setCssEnabled(false)
指闹//获取页面
HtmlPage page = webClient.getPage("http://www.baidu.com/")
//获取页面的TITLE
str = page.getTitleText()
System.out.println(str)
//获取页面的XML代码
str = page.asXml()
System.out.println(str)
//获取页面的文本
str = page.asText()
System.out.println(str)
//关闭webclient
webClient.closeAllWindows()
}
}
如果用HTTPclient,楼主可以百度它的教程,有本唯信罩书叫做《自己动手写网络爬虫》,里面是以java语言为基础讲的,作为一个爬虫入门者可以去看看
import java.awt.*import java.awt.event.*
import java.io.*
import java.net.*
import java.util.*
import java.util.regex.*
import javax.swing.*
import javax.swing.table.*//一个Web的爬行者(注:爬行在这里的意思与抓取,捕获相同)
public class SearchCrawler extends JFrame{
//最大URL保存值
private static final String[] MAX_URLS={"50","100","500","1000"}
//缓存robot禁止爬行列表
private HashMap disallowListCache=new HashMap()
/敏山滑/搜索GUI控件
private JTextField startTextField
private JComboBox maxComboBox
private JCheckBox limitCheckBox
private JTextField logTextField
private JTextField searchTextField
private JCheckBox caseCheckBox
private JButton searchButton
//搜索状态GUI控件
private JLabel crawlingLabel2
private JLabel crawledLabel2
private JLabel toCrawlLabel2
private JProgressBar progressBar
private JLabel matchesLabel2
//搜索匹配项表格列表
private JTable table
//标记爬行桥腊机器是否正在爬行
private boolean crawling
//写日志匹配文件的引用
private PrintWriter logFileWriter
//网络爬行者的构造函数
public SearchCrawler(){
//设置应用程序标题栏
setTitle("搜索爬行者")
//设置窗体大小
setSize(600,600)
//处唯孝理窗体关闭事件
addWindowListener(new WindowAdapter(){
public void windowClosing(WindowEvent e){
actionExit()
}
})
//设置文件菜单
JMenuBar menuBar=new JMenuBar()
JMenu fileMenu=new JMenu("文件")
fileMenu.setMnemonic(KeyEvent.VK_F)
JMenuItem fileExitMenuItem=new JMenuItem("退出",KeyEvent.VK_X)
fileExitMenuItem.addActionListener(new ActionListener(){
public void actionPerformed(ActionEvent e){
actionExit()
}
})
fileMenu.add(fileExitMenuItem)
menuBar.add(fileMenu)
setJMenuBar(menuBar)
欢迎分享,转载请注明来源:内存溢出
微信扫一扫
支付宝扫一扫
评论列表(0条)