Jsoup入门使用Java精确提取HTML数据

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for fetching URLs and extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors.
翻译：jsoup是一个用于处理实际HTML的Java库。它提供了一个非常方便的API，可以使用HTML5 DOM方法和CSS选择器来获取url、提取和操作数据。

jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do.
翻译：jsoup实现了WHATWG HTML5规范，并将HTML解析为与现代浏览器相同的DOM。

jsoup的主要功能如下：

从一个URL，文件或字符串中解析HTML；
使用DOM或CSS选择器来查找、取出数据；
可操作HTML元素、属性、文本；
jsoup是基于MIT协议发布的，可放心使用于商业项目。
关于Jsoup的更多介绍，请访问Jsoup的官网：http://jsoup.org/

这篇文章搬运了Jsoup的一些用法和例子，更多详情请到官网阅读Cookbook

安装依赖

如果你和我一样使用的是Maven管理Java项目，那你不需要下载Jsoup的jar包，只需要在POM文件的<dependencies>部分加入下面的依赖：

<dependency>
  <!-- jsoup HTML parser library @ https://jsoup.org/ -->
  <groupId>org.jsoup</groupId>
  <artifactId>jsoup</artifactId>
  <version>1.13.1</version>
</dependency>

其他情况请参考Jsoup官网：https://jsoup.org/download

从一个简单的例子开始

String html = "<html><head><title>First parse</title></head>"
  + "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);
System.out.println(doc.toString());
//输出可以看到，我们已经成功将html字符串存放到了DOM树中

1.从url、本地解析文档

Jsoup提供的parse()方法将尝试从您提供的HTML中创建一个干净的解析，而不管HTML是否格式良好。例如没有关闭的标签、隐式标签和创建可靠的文档结构(html包含head和body，并且保证内部的元素是合理的)

//从URL中解析HTML
Document doc = Jsoup.connect("http://example.com/").get();
String title = doc.title();

//从文件中解析HTML
File file = new File("c:/Users/admin/Desktop/score.html");
Document doc = Jsoup.parse(file,"UTF-8");

File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");

parse(File in, String charsetName, String baseUri)方法加载和解析HTML文件。如果在加载文件时发生错误，它将抛出一个IOException，您应该适当地处理它。如果你只在本地文件系统工作，完全可以使用姐妹方法parse(File in, String charsetName)，它将文件的地址作为baseUri。

2.用DOM方法遍历document

Elements 提供了一系列类似DOM的方法来查找elements，并提取和操作它们的数据。

File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");

Element content = doc.getElementById("content");
Elements links = content.getElementsByTag("a");
for (Element link : links) {
  String linkHref = link.attr("href");
  String linkText = link.text();
}

查找elements：

getElementById(String id)
getElementsByTag(String tag)
getElementsByClass(String className)
getElementsByAttribute(String key) (and related methods)
Element siblings: siblingElements(), firstElementSibling(), lastElementSibling();
nextElementSibling(), previousElementSibling()
Graph: parent(), children(), child(int index)

element 数据：

attr(String key) to get and attr(String key, String value) to set attributes
attributes() to get all attributes
id(), className() and classNames()
text() to get and text(String value) to set the text content
html() to get and html(String value) to set the inner HTML content
outerHtml() to get the outer HTML value
data() to get data content (e.g. of script and style tags)
tag() and tagName()

3.使用选择器语法(selector-syntax)获取元素

jsoup元素支持类似于选择器的CSS(或jquery)语法来查找匹配的元素，这样可以非常方便地找到所需的元素。

File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");

Elements links = doc.select("a[href]"); // a with href
Elements pngs = doc.select("img[src$=.png]");
  // img with src ending .png

Element masthead = doc.select("div.masthead").first();
  // div with class=masthead

Elements resultLinks = doc.select("h3.r > a"); // direct a after h3

注意，select()方法对于Document,Element,Elements都是适用的，返回元素列表(Elements)，更多内容详见Cookbook-Selector

分享一段完整的使用Jsoup解析HTML的代码

package io.ting.helloworld;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.File;
import java.io.IOException;
import java.util.ArrayList;

public class dataCatch {
    public static void main(String args[]) throws IOException {

        File file = new File("c:/Users/admin/Desktop/score.html");
        Document doc = Jsoup.parse(file,"UTF-8");

        ArrayList<teamInfo> teams = new ArrayList<>();

        Elements allTeams = doc.select("tr[id^=TR_]");
        for (Element team : allTeams) {
            //读id
            String teamId = team.id();
            String[] arr = teamId.split("_");
            teamId = arr[arr.length-1];

            //读比分
            Elements tds = team.children();
            String full = tds.select("td.acc_result_full").eachText().get(0);
            String bg = tds.select("td.acc_result_bg").eachText().get(0);

//            System.out.println(teamId+" "+full+" "+bg);
            teams.add(new teamInfo(teamId,full,bg));

        }

        for (teamInfo t : teams) {
            System.out.println(t.getTeamId()+" "+t.getResult_full()+" "+t.getResult_half());
        }



    }
}