Introduction to Web Scraping With Java

Introduction to Web Scraping With Java

Web scraping or crawling is the fact of fetching data from a third party website by downloading and parsing the HTML code to extract the data you want.

Since every website does not offer a clean API, or an API at all, web scraping can be the only solution when it comes to extracting website information. Lots of companies use it to obtain knowledge concerning competitor prices, news aggregation, mass email collect. Visit; Web scraping Java.

Almost everything can be extracted from HTML, the only information that are “difficult” to extract are inside images or other media.

In this post, we are going to see basic techniques in order to fetch and parse data in Java.

Let’s scrape CraigList

For our first example, we are going to fetch items from Craigslist since they don’t seem to offer an API, to collect names, prices, and images, and export it to JSON.

First, let’s take a look at what happens when you search an item on Craigslist. Now you can open your favorite IDE it is time to code. HtmlUnit needs a WebClient to make a request. There are many options (Proxy settings, browser, redirect enabled …)

We are going to disable Javascript since it’s not required for our example, and disabling Javascript makes the page load faster.

Since there isn’t any ID we could use, we have to make an Xpath expression to select the tags we want.

XPath is a query language to select XML nodes( HTML in our case).

First, we are going to select all the <p> tags that have a class “`result-info“

Then we will iterate through this list, and for each item select the name, price, and URL, and then print it.

Then instead of just printing the results, we are going to put it in JSON, using Jackson library, to map items in JSON format.

We need a POJO (plain old java object) to represent Items.

Go further

This example is not perfect, there are many things that can be improved :

  • Multi-city search
  • Handling pagination
  • Multi-criteria search

You can find the code in this Github repo

If you like web scraping and are tired taking care of proxies, JS rendering and captchas, you can check our new web scraping API, the first 1000 API calls are on us.

Further reading

I recently wrote a blog post about a Web Scraping without getting blocked to explain the different techniques in order how to hide your scrapers, check it out !

HtmlUnit is a great headless browser, but you may want to try Headless Chrome for a better Javascript rendering. We wrote a great introduction to headless chrome, don’t hesitate to take a look.

If you prefer Python over Java, don’t hesitate to look at our ultimate guide to web scraping with Python.

Admin
Known for his amazing writing and technical blogging skills, Edward Thompson is the admin of the Techenger. Joined back in 2019, after moving from San Francisco to Chicago to switch from his role of staff writer to a guest blogger. Since then, he never looked back to his past. In nutshell, he is a tech enthusiast who loves to write, read, test, evaluate, and spread knowledge about the growing technology that surrounds mankind.

Related Articles

Leave a Reply