java爬虫抓取指定数据_数字化

根据java网络编程相关的内容，使用jdk提供的相关类可以得到url对应网页的html页面代码。

针对得到的html代码，通过使用正则表达式即可得到我们想要的内容。

比如，我们如果想得到一个网页上所有包括“java”关键字的文本内容，就可以逐行对网页代码进行正则表达式的匹配。最后达到去除html标签和不相关的内容，只得到包括“java”这个关键字的内容的效果。

从网页上爬取的流程和爬取内容的流程基本相同，但是爬取的步骤会多一步。

需要先用img标签的正则表达式匹配获取到img标签，再用src属性的正则表达式获取这个img标签中的src属性的url，然后再通过缓冲输入流对象读取到这个url的信息，配合文件输出流将读到的信息写入到本地即可。

原理即是保存cookie数据保存登陆后的cookie以后每次抓取页面把cookie在头部信息里面发送过去。系统是根据cookie来判断用户的。有了cookie就有了登录状态，以后的访问都是基于这个cookie对应的用户的。补充：Java是一种可以撰写跨平台应用软件的面向对象的程序设计语言。Java技术具有卓越的通用性、高效性、平台移植性和安全性，广泛应用于PC、数据中心、游戏控制台、科学超级计算机、移动电话和互联网，同时拥有全球最大的开发者专业社群。

下面是源代码,希望可以帮到你~~

package comlymainprocess;

import javaioBufferedReader;

import javaioInputStreamReader;

import javautilArrayList;

import javautilList;

import orgapachehttpConsts;

import orgapachehttpHeader;

import orgapachehttpHttpEntity;

import orgapachehttpHttpResponse;

import orgapachehttpNameValuePair;

import orgapachehttpStatusLine;

import orgapachehttpcliententityUrlEncodedFormEntity;

import orgapachehttpclientmethodsHttpGet;

import orgapachehttpclientmethodsHttpPost;

import orgapachehttpcookieCookie;

import orgapachehttpimplclientDefaultHttpClient;

import orgapachehttpmessageBasicNameValuePair;

import orgapachehttputilEntityUtils;

public class Test1 {

public static void main(String[] args){

Test1 test1 = new Test1();

Systemoutprintln(test1process("",""));

}

@SuppressWarnings("deprecation")

public boolean process(String username,String password) {

boolean ret=false;

DefaultHttpClient httpclient = new DefaultHttpClient();

try {

HttpGet httpget;

HttpResponse response;

HttpEntity entity;

List<Cookie> cookies;

//组建登录的post包

HttpPost httppost = new HttpPost("http://loginhimopcom/Logindo"); // 用户登录

List<NameValuePair> nvps = new ArrayList<NameValuePair>();

nvpsadd(new BasicNameValuePair("nickname", username));

nvpsadd(new BasicNameValuePair("password", password));

nvpsadd(new BasicNameValuePair("origURL", "http://himopcom/SysHomedo"));

nvpsadd(new BasicNameValuePair("loginregFrom", "index"));

nvpsadd(new BasicNameValuePair("ss", "10101"));

httppostsetEntity(new UrlEncodedFormEntity(nvps, ConstsUTF_8));

httppostaddHeader("Referer", "http://himopcom/SysHomedo");

httppostaddHeader("Connection", "keep-alive");

httppostaddHeader("Content-Type", "application/x-www-form-urlencoded");

httppostaddHeader("Accept-Language", "zh-CN,zh;q=08");

httppostaddHeader("Origin", "http://himopcom");

httppostaddHeader("User-Agent", "Mozilla/50 (Windows NT 61; WOW64) AppleWebKit/53736 (KHTML, like Gecko) Chrome/3001599101 Safari/53736");

response = httpclientexecute(httppost);

entity = responsegetEntity();

// Systemoutprintln("Login form get: " + responsegetStatusLine());

EntityUtilsconsume(entity);

// Systemoutprintln("Post logon cookies:");

cookies = httpclientgetCookieStore()getCookies();

if (cookiesisEmpty()) {

// Systemoutprintln("None");

} else {

for (int i = 0; i < cookiessize(); i++) {

// Systemoutprintln("- " + cookiesget(i)toString());

}

//进行页面跳转

String url = ""; // 页面跳转

Header locationHeader = responsegetFirstHeader("Location");

// Systemoutprintln(locationHeadergetValue());

if (locationHeader != null) {

url = locationHeadergetValue(); // 得到跳转href

HttpGet httpget1 = new HttpGet(url);

response = httpclientexecute(httpget1);

// 登陆成功。。。hoho

}

entity = responsegetEntity();

// Systemoutprintln(responsegetStatusLine());

if (entity != null) {

// Systemoutprintln("Response content length: " + entitygetContentLength());

}

// 显示结果

BufferedReader reader = new BufferedReader(new InputStreamReader(entitygetContent(), "UTF-8"));

String line = null;

while ((line = readerreadLine()) != null) {

// Systemoutprintln(line);

}

//自动打卡

// 访问网站的子网页。

HttpPost httppost1 = new HttpPost("http://homehimopcom/ajaxGetContinusLoginAwarddo"); // 设置个人信息页面

httppost1addHeader("Content-Type", "text/plain;charset=UTF-8");

httppost1addHeader("Accept", "text/plain, /");

httppost1addHeader("X-Requested-With", "XMLHttpRequest");

httppost1addHeader("Referer", "http://homehimopcom/Homedo");

response = httpclientexecute(httppost1);

entity = responsegetEntity();

// Systemoutprintln(responsegetStatusLine());

if(responsegetStatusLine()toString()indexOf("HTTP/11 200 OK")>=0){

ret = true;

}

if (entity != null) {

// Systemoutprintln("Response content length: " + entitygetContentLength());

}

// 显示结果

reader = new BufferedReader(new InputStreamReader(entitygetContent(), "UTF-8"));

line = null;

while ((line = readerreadLine()) != null) {

Systemoutprintln(line);

}

} catch (Exception e) {

} finally {

httpclientgetConnectionManager()shutdown();

}

return ret;

}

这种是用js实现的。所以后面的内容实际上是动态生成的，网络爬虫抓取的是静态页面。

至于解决办法，网上有几种：

一种是使用自动化测试工具去做，比如selenium，可以模拟点击等 *** 作，但是这个其实和爬虫还是有很大区别的。

二是利用特定的类库在后端调用js，python的倒是有，但是java的我就不清楚了。

三是自己找到相关的页面的js代码，分析出来相关的请求url，直接调新的url就行了，但是一般的js都是加密压缩的，但是你可以试试。

欢迎分享，转载请注明来源：内存溢出

原文地址:https://54852.com/zaji/12459812.html

java爬虫抓取指定数据

发表评论

评论列表（0条）