jsoup is a Java library to work with HTML and XML markups. jsoup provides an API to extract and manipulate markup data, allowing us to scrape and parse HTML and XML from a URL, file, or string.
Install JSOUP with Maven
If you use maven to manage project dependencies, insert the below code into your POM files dependencies
section.
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.15.3</version>
</dependency>
Install JSOUP with Gradle
implementation 'org.jsoup:jsoup:1.15.3'
This tutorial covers a majority of APIs provided by jsoup. Each method is described with a code example. Feel free to download the entire project by the link provided at the end of the tutorial. The sample HTML code we use to test our API methods is listed below.
<html>
<head>
<title>Page Title</title>
</head>
<body>
<div id='toc' class='toc first'>Table of content</div>
<div id='index' class='toc'>
<table>
<tr><td>Name</td><td>Address</td></tr>
</table>
</div>
<div id='dynamic' class='toc'>Original Content</div>
<custom class='customClass commonClass'>Custom tag</custom>
<p>Parsed HTML into a doc.</p>
<a href='https://www/example.com/link1' class='hide'>Link 1</a>
<a href='https://www/example.com/link2' class='normal'>Link 2</a>
<a href='https://www/example.com/link3' class='normal'>Link 3</a>
<a href='https://www/example.com/link4' class='normal'>Link 4</a>
<a href='https://www/example.com/link5' class='normal'>Link 5</a>
<div>
<form action='/submit' name='myinputs'id='form'><input type='text' value='Name'/></form>
</div>
<input type='text' name='text box 1'/><button type='submit' class='hidden' value='Submit 1'/>
<input type='text' name='text box 2'/><button type='submit' value='Submit 2'/>
<input type='text' name='text box 3'/><button type='submit' value='Submit 3'/>
<input type='text' name='text box 4'/><button type='submit' class='hidden' value='Submit 4'/>
<input type='text' name='text box 5'/><button type='submit' value='Submit 5'/>
<customForm>
<form action='/submit' name='customInputs'id='custom'>
<input type='text' value='Lance'/></form>
</customForm>
</body>
</html>
Before interacting with HTML elements, we must parse the HTML string to a document model.
This document is similar to the browser's DOM or document object we refer to in Javascript.
jsoup's document is derived from org.jsoup.nodes.Document
Document document = Jsoup.parse(html);
The below code sets up the necessary steps to test our methods. We will add API calls to the parseHtml()
method when we progress through this tutorial.
Since we modify the HTML document in each jsoup method call, we will call the method resetAndReloadDocument()
to bring back the document object to its original state before the next API method.
public class JSoup{
static String html = "";//Insert above html here
static Document document = Jsoup.parse(html);
public static void main(String args[])
{
parseHtml();
}
public static void parseHtml()
{
//Below test methods go gere
}
}
Getting page title
document.title()
retrieves the string contents of the document's title element.
System.out.println(document.title());
Output :
Page Title
Find elements from the document
document.select(query)
finds elements that match the Selector query. Matched elements can include this element and its children.
Elements elements = document.select("a");
Element a1 = elements.get(0);
System.out.println(a1.absUrl("href"));
a1 = elements.get(1);
System.out.println(a1.absUrl("href"));
a1 = elements.get(2);
System.out.println(a1.absUrl("href"));
Output :
https://www/example.com/link1
https://www/example.com/link2
https://www/example.com/link3
Find element by id
document.getElementById(id)
finds the first matching ID, starting with this element. To get the html of the element use document.getElementById(id).html()
System.out.println(document.getElementById("toc"));
Output :
<div id="toc" class="toc first">
Table of content
</div>
Find element by attribute key and value
document.getElementsByAttributeValue(key, value)
finds elements that have an attribute with the specific value.
Key and value are case insensitive. Use document.getElementsByAttributeValue("class", "customClass commonClass").html()
to get inner html.
System.out.println(document.getElementsByAttributeValue("class", "customClass commonClass"));
Output :
<custom class="customClass commonClass">
Custom tag
</custom>
Replace elements inner html
element.html(html)
clears existing inner html sets the new value as inner HTML.
Element element = document.getElementById("dynamic");
element.html("Replaced content");
System.out.println(element.toString());
<div id="dynamic" class="toc">
Replaced content
</div>
Find element by selector query
document.select(query)
finds elements that match the selector query. Matched elements can contain their children.
elements = document.select("custom");
element = elements.get(0);
System.out.println(element.html());
element.html("Replaced custom tag content");
System.out.println(element.html());
Output :
Custom tag
Replaced custom tag content
Append an element to the end of the list
elements.add(Element e)
appends the specified element to the end of this list.
elements = document.select("custom");
element = new Element("anotherCustomTag");
element.html("Another custom tag");
elements.add(element);
System.out.println(elements);
Output :
Custom tag
<custom class="customClass commonClass">
Replaced custom tag content
</custom>
<anotherCustomTag>
Another custom tag
</anotherCustomTag>
Add an element to the specified location
elements.add(index, Element)
inserts the element at the specified index in this list. Shifts the element at the current position if any.
elements = document.select("custom");
element = new Element("custom_0").html("Another custom tag 0");
elements.add(0, element);
System.out.println(elements);
System.out.println("---------------------------");
element = new Element("custom_1");
element.html("Another custom tag 1");
elements.add(2, element);
System.out.println(elements);
Output :
<custom_0>
Another custom tag 0
</custom_0>
<custom class="customClass commonClass">
Replaced custom tag content
</custom>
---------------------------
<custom_0>
Another custom tag 0
</custom_0>
<custom class="customClass commonClass">
Replaced custom tag content
</custom>
<custom_1>
Another custom tag 1
</custom_1>
Append a collection of elements
elements.addAll(Collection< extends Element> c)
appends all of the DOM elements in the collection to the end of the list.
elements = document.select("custom");
List collection = new ArrayList();
element = new Element("custom_0");
element.html("Another custom tag 0");
collection.add(element);
element = new Element("custom_1");
element.html("Another custom tag 1");
collection.add(element);
elements.addAll(collection);
System.out.println(elements);
Output :
<custom class="customClass commonClass">
Replaced custom tag content
</custom>
<custom_0>
Another custom tag 0
</custom_0>
<custom_1>
Another custom tag 1
</custom_1>
Append a collection of elements to a specified index position
elements.addAll(int index, Collection< extends Element> c)
appends all of the DOM elements in the collection to the list starting from the specified index position.
elements = document.select("custom");
collection = new ArrayList();
element = new Element("custom_0");
element.html("Another custom tag 0");
collection.add(element);
element = new Element("custom_1");
element.html("Another custom tag 1");
collection.add(element);
elements.addAll(0,collection);
System.out.println(elements);
Output :
<custom_0>
Another custom tag 0
</custom_0>
<custom_1>
Another custom tag 1
</custom_1>
<custom class="customClass commonClass">
Replaced custom tag content
</custom>
Add a class name to an element
elements.addClass(String className)
adds the class name to every matched element's class attribute.
elements = document.select("custom");
elements.addClass("dynamicClass");
System.out.println(elements);
<custom class="customClass commonClass dynamicClass">
Custom tag
</custom>
Add element after given element
elements.after(String html)
inserts the HTML after each matched element's outer HTML.
elements = document.select("custom");
elements.after("<anotherCustom>Another custom element</anotherCustom>");
System.out.println(document);
Output :
...
<custom class="customClass commonClass">Custom tag</custom>
<anothercustom>Another custom element</anothercustom>
...<
Add an element to the end of the inner HTML
elements.append(String html)
adds the supplied HTML to the end of each matched element's inner HTML
elements = document.select("custom");
elements.append("<anotherCustom>Another custom element</anotherCustom>");
System.out.println(elements);
Output :
<custom class="customClass commonClass">
Custom tag
<anothercustom>
Another custom element
</anothercustom>
</custom>
Get attribute value of an element
element.attr(String attributeKey)
gets an attribute value from the first matched element that has the attribute. attributeKey
is case sensitive.
element = document.getElementById("dynamic");
System.out.println(element.attr("class"));
toc
Set attribute
element.attr(String attributeKey, String attributeValue)
sets an attribute value on the element. If the element already has an attribute with the key, its value is updated. Otherwise, the new attribute is added.
element = document.getElementById("dynamic");
element.attr("class", "newClass");
System.out.println(element);
Output :
<div id="dynamic" class="newClass">
Original Content
</div>
Insert element before given element
elements.before(String html)
inserts the supplied HTML before each matched element's outer HTML
elements = document.select("custom");
elements.before("<before>Insert before</before>");
System.out.println(document);
...
<before>
Insert before
</before>
<custom class="customClass commonClass">
Custom tag
</custom>
...
Remove all elements
elements.clear()
removes all of the dom elements from this list. The list will be empty after this call returns.
elements = document.select("custom");
System.out.println(elements);
elements.clear();
System.out.println(elements.isEmpty());
<custom class="customClass commonClass">
Custom tag
</custom>
true
Deep copy dom elements
elements.clone() creates a deep copy of these elements
elements = document.select("a");
Elements cloned = elements.clone();
System.out.println(cloned);
<a href="https://www/example.com/link1" class="hide">Link 1</a>
<a href="https://www/example.com/link2" class="normal">Link 2</a>
<a href="https://www/example.com/link3" class="normal">Link 3</a>
<a href="https://www/example.com/link4" class="normal">Link 4</a>
<a href="https://www/example.com/link5" class="normal">Link 5</a>
Check if specific element present
elements.contains(Element element)
returns true if the element list contains the specified element. Returns true if and only if this list contains at least one element e such that (o==null ? e==null : o.equals(e))
element = document.select("a").get(1);
elements = document.select("a");
System.out.println(elements.contains(element));
element = document.select("custom").get(0);
System.out.println(elements.contains(element));
Output :
true
false
Check if all elements present
elements.containsAll(Elements elements)
returns true if the element collection contains all of the elements in the specified collection.
elements = document.select("a");
cloned = document.select("a");
System.out.println(elements.containsAll(cloned));
Output :
true
Remove all child nodes from element
elements.empty()
removes all child nodes from each matched element. This is similar to setting the inner HTML of each element to empty.
elements = document.select("div");
elements.empty();
System.out.println(elements);
Output :
<div id="toc" class="toc first"></div>
<div id="index" class="toc"></div>
<div id="dynamic" class="toc"></div>
<div></div>
Get element as Elements array
elements.eq(int index)
gets the nth matched element as an Elements object
elements = document.select("a");
System.out.println(elements.eq(0));
Output :
<a href="https://www/example.com/link1" class="hide">Link 1</a>
Get the first matched element
elements.first() gets the first matched element
elements = document.select("a");
System.out.println(elements.first());
Output :
<a href="https://www/example.com/link1" class="hide">Link 1</a>
Check if at least one element has a specified attribute set
elements.hasAttr(String attributeKey)
checks if any of the matched elements have this attribute set
elements = document.select("div");
System.out.println(elements.hasAttr("class"));
Output :
true
Check if the specified class exists
elements.hasClass(String className)
checks if any of the matched elements have this class name set in their class attribute
elements = document.select("div");
System.out.println(elements.hasClass("toc"));
System.out.println(elements.hasClass("cot"));
Output :
true
false
Get combined inner HTML of elements
elements.html()
retrieves the combined inner HTML of all matched elements
elements = document.select("div");
System.out.println(elements.html());
Output :
Table of content
<table>
<tbody>
<tr>
<td>Name</td>
<td>Address</td>
</tr>
</tbody>
</table>
Original Content
<form action="/submit" name="myinputs" id="form">
<input type="text" value="Name">
</form>
Set inner HTML of all elements
elements.html(String html)
sets the inner HTML of each matched elements
elements = document.select("div");
System.out.println(elements.html("New HTML"));
Output :
<div id="toc" class="toc first">New HTML</div>
<div id="index" class="toc">New HTML</div>
<div id="dynamic" class="toc">New HTML</div>
<div>New HTML</div>