java中几种解析html的工具_随笔

HTML分析是一个比较复杂的工作，Java世界主要有几款比较方便的分析工具：

1.Jsoup

Jsoup是一个集强大和便利于一体的HTML解析工具。它方便的地方是，可以用于支持用jQuery中css selector的方式选取元素，这对于熟悉js的开发者来说基本没有学习成本。

String content = "blabla"

Document doc = JSoup.parse(content)

Elements links = doc.select("a[href]")

Jsoup还支持白名单过滤机制，对于网站防止XSS攻击也是很好的。

2.HtmlParser

HtmlParser的功能比较完备，也挺灵活，但谈不上方便。这个项目很久没有维护了，最新版本是2.1。HtmlParser的核心元素是Node，对应一个HTML标签，支持getChildren()等树状遍历方式。HtmlParser另外一个核心元素是NodeFilter，通过实现NodeFilter接口，可以对页面元素进行筛选。这里有一篇HtmlParser的使用文章：使用 HttpClient 和 HtmlParser 实现简易爬虫。

3.Apache tika

tika是专为抽取而生的工具，还支持PDF、Zip甚至是Java Class。使用tika分析HTML，需要自己定义一个抽取内容的Handler并继承org.xml.sax.helpers.DefaultHandler，解析方式就是xml标准的方式。crawler4j中就使用了tika作为解析工具。SAX这种流式的解析方式对于分析大文件很有用，我个人倒是认为对于解析html意义不是很大。

InputStream inputStream = null

HtmlParser htmlParser = new HtmlParser()

htmlParser.parse(new ByteArrayInputStream(page.getContentData()),

contentHandler, metadata, new ParseContext())

4.HtmlCleaner与XPath

HtmlCleaner最大的优点是：支持XPath的方式选取元素。XPath是一门在XML中查找信息的语言，也可以用于抽取HTML元素。XPath与CSS Selector大部分功能都是重合的，但是CSS Selector专门针对HTML，写法更简洁，而XPath则是通用的标准，可以精确到属性值。XPath有一定的学习成本，但是对经常需要编写爬虫的人来说，这点投入绝对是值得的。

浏览器渲染引擎从网络层取得请求的文档，一般情况下文档会分成 8KB 大小的分块传输。

HTML 解析器的主要工作是对 HTML 文档进行解析，生成解析树。

解析树是以 DOM 元素以及属性为节点的树。DOM 是 文档对象模型（Document Object Model） 的缩写，它是 HTML 文档的对象表示，同时也是 HTML 元素面向外部（如 JavaScript）的接口。树的根部是 Document 对象。整个 DOM 和 HTML 文档几乎是一对一的关系。

解析算法

HTML 不能使用常见的自顶向下或自底向上方法来进行分析。主要原因有以下几点:

由于不能使用常用的解析技术，浏览器创造了专门用于解析 HTML 的解析器。解析算法在 HTML5 标准规范中有详细介绍，算法主要包含了两个阶段： 标记化（tokenization）和树的构建 。

解析结束之后

浏览器开始加载网页的外部资源（CSS，图像，JavaScript 文件等）。

此时浏览器把文档标记为 可交互的（interactive） ，浏览器开始解析处于 推迟（deferred） 模式的脚本，也就是那些需要在文档解析完毕之后再执行的脚本。之后文档的状态会变为 完成（complete） ，浏览器会触发 加载（load） 事件。

注意解析 HTML 网页时永远不会出现 无效语法（Invalid Syntax） 错误，浏览器会修复所有错误内容，然后继续解析。

第一种方法：

用System.Net.WebClient下载Web Page存到本地文件或者String中，用正则表达式来分析。这个方法可以用在Web Crawler等需要分析很多Web Page的应用中。

using System

using System.Net

using System.Text

using System.Text.RegularExpressions

namespace HttpGet

{

class Class1

{

[STAThread]

static void Main(string[] args)

{

System.Net.WebClient client = new WebClient()

byte[] page = client.DownloadData("http://www.google.com")

string content = System.Text.Encoding.UTF8.GetString(page)

string regex = "href=[\\\"\\\'](http:\\/\\/|\\.\\/|\\/)?\\w+(\\.\\w+)*(\\/\\w+(\\.\\w+)?)*(\\/|\\?\\w*=\\w*(&\\w*=\\w*)*)?[\\\"\\\']"

Regex re = new Regex(regex)

MatchCollection matches = re.Matches(content)

System.Collections.IEnumerator enu = matches.GetEnumerator()

while (enu.MoveNext() && enu.Current != null)

{

Match match = (Match)(enu.Current)

Console.Write(match.Value + "\r\n")

}

第二种方法：

利用Winista.Htmlparser.Net 解析Html。这是.NET平台下解析Html的开源代码，网上有源码下载

using System

using System.Collections.Generic

using System.ComponentModel

using System.Data

using System.Drawing

using System.Linq

using System.Text

using System.Windows.Forms

using Winista.Text.HtmlParser

using Winista.Text.HtmlParser.Lex

using Winista.Text.HtmlParser.Util

using Winista.Text.HtmlParser.Tags

using Winista.Text.HtmlParser.Filters

namespace HTMLParser

{

public partial class Form1 : Form

{

public Form1()

{

InitializeComponent()

AddUrl()

}

private void btnParser_Click(object sender, EventArgs e)

{

#region 获得网页的html

try

{

txtHtmlWhole.Text = ""

string url = CBUrl.SelectedItem.ToString().Trim()

System.Net.WebClient aWebClient = new System.Net.WebClient()

aWebClient.Encoding = System.Text.Encoding.Default

string html = aWebClient.DownloadString(url)

txtHtmlWhole.Text = html

}

catch (Exception ex)

{

MessageBox.Show(ex.Message)

}

#endregion

#region 分析网页html节点

Lexer lexer = new Lexer(this.txtHtmlWhole.Text)

Parser parser = new Parser(lexer)

NodeList htmlNodes = parser.Parse(null)

this.treeView1.Nodes.Clear()

this.treeView1.Nodes.Add("root")

TreeNode treeRoot = this.treeView1.Nodes[0]

for (int i = 0 i < htmlNodes.Count i++)

{

this.RecursionHtmlNode(treeRoot, htmlNodes[i], false)

}

#endregion

}

private void RecursionHtmlNode(TreeNode treeNode, INode htmlNode, bool siblingRequired)

{

if (htmlNode == null || treeNode == null) return

TreeNode current = treeNode

TreeNode content

//current node

if (htmlNode is ITag)

{

ITag tag = (htmlNode as ITag)

if (!tag.IsEndTag())

{

string nodeString = tag.TagName

if (tag.Attributes != null && tag.Attributes.Count > 0)

{

if (tag.Attributes["ID"] != null)

{

nodeString = nodeString + " { id=\"" + tag.Attributes["ID"].ToString() + "\" }"

}

if (tag.Attributes["HREF"] != null)

{

nodeString = nodeString + " { href=\"" + tag.Attributes["HREF"].ToString() + "\" }"

}

current = new TreeNode(nodeString)

treeNode.Nodes.Add(current)

}

//获取节点间的内容

if (htmlNode.Children != null && htmlNode.Children.Count > 0)

{

this.RecursionHtmlNode(current, htmlNode.FirstChild, true)

content = new TreeNode(htmlNode.FirstChild.GetText())

treeNode.Nodes.Add(content)

}

//the sibling nodes

if (siblingRequired)

{

INode sibling = htmlNode.NextSibling

while (sibling != null)

{

this.RecursionHtmlNode(treeNode, sibling, false)

sibling = sibling.NextSibling

}

private void AddUrl()

{

CBUrl.Items.Add("http://www.xxx.com")

}

欢迎分享，转载请注明来源：内存溢出

原文地址:https://54852.com/zaji/7229033.html

java中几种解析html的工具

发表评论

评论列表（0条）