Home > Tutorials > Android XML Adventure – Parsing HTML using HtmlCleaner

Android XML Adventure – Parsing HTML using HtmlCleaner


Article Series: Android XML Adventure

Author: Pete Houston (aka. `xjaphx`)

TABLE OF CONTENTS

  1. What is the “Thing” called XML?
  2. Parsing XML Data w/ SAXParser
  3. Parsing XML Data w/ DOMParser
  4. Parsing XML Data w/ XMLPullParser
  5. Create & Write XML Data
  6. Compare: XML Parsers
  7. Parsing XML using XPath
  8. Parsing HTML using HtmlCleaner
  9. Parsing HTML using JSoup
  10. Sample Project 1: RSS Parser – using SAXParser
  11. Sample Project 1: RSS Parser – using DOM Parser
  12. Sample Project 1: RSS Parser – using XMLPullParser
  13. Sample Project 2: HTML Parser – using HtmlCleaner
  14. Sample Project 2: HTML Parser – using JSoup
  15. Finalization on the “Thing” called XML!

=========================================

After a long time, now I’d like to come back to the series, sorry guys for make you all waiting.

In this article, I will give a simple guide on how to use HtmlCleaner to parse HTML data in XPath format.

You might get to know what XPath is already and learned how to use XPath library on Android system.

This time, we will use a XPath to query the value we desire to have from an HTML page not XML file, interesting, isn’t it?

The HTML Page target is my blog: https://xjaphx.wordpress.com/

The data will be desired to parse is the “Statistics“, number of Views on my blog, which is on the bottom-right side of the blog. The current number is: 80,303 views.

The XPath for this is: “//div[@id=’blog-stats’]/ul/li

First, get the HtmlCleaner library and set it up, get it from here: http://htmlcleaner.sourceforge.net/

Open your Eclipse and create new project, then right click to the project on the left pane, select Properties.

HtmlCleaner Setup

HtmlCleaner Setup

Ok, on tab Libraries, click button “Add External JARs” on the right side, a dialog to select JAR files open up, select the HtmlCleaner library, then click button Open. It’s done for setting up the library.

Next is the layout of application, I use the default one, only one TextView, well, just enough to confirm value.

Let’s get straight to the source code 🙂

package pete.android.study;

import java.net.URL;

import org.htmlcleaner.CleanerProperties;
import org.htmlcleaner.HtmlCleaner;
import org.htmlcleaner.TagNode;

import android.app.Activity;
import android.os.Bundle;
import android.widget.TextView;

public class HtmlCleanerStudyActivity extends Activity {

	// HTML page
	static final String BLOG_URL = "https://xjaphx.wordpress.com/";
	// XPath query
	static final String XPATH_STATS = "//div[@id='blog-stats']/ul/li";

    @Override
    public void onCreate(Bundle savedInstanceState) {
    	// init view layout
        super.onCreate(savedInstanceState);
        setContentView(R.layout.main);

        // decide output
        String value = "";
        try {
        	value = getBlogStats();
        	((TextView)findViewById(R.id.tv)).setText(value);
        } catch(Exception ex) {
        	((TextView)findViewById(R.id.tv)).setText("Error");
        }
    }

    /*
     * get blog statistics
     */
    public String getBlogStats() throws Exception {
    	String stats = "";

    	// config cleaner properties
    	HtmlCleaner htmlCleaner = new HtmlCleaner();
    	CleanerProperties props = htmlCleaner.getProperties();
    	props.setAllowHtmlInsideAttributes(false);
    	props.setAllowMultiWordAttributes(true);
    	props.setRecognizeUnicodeChars(true);
    	props.setOmitComments(true);

    	// create URL object
    	URL url = new URL(BLOG_URL);
    	// get HTML page root node
    	TagNode root = htmlCleaner.clean(url);

    	// query XPath
    	Object[] statsNode = root.evaluateXPath(XPATH_STATS);
    	// process data if found any node
    	if(statsNode.length > 0) {
    		// I already know there's only one node, so pick index at 0.
    		TagNode resultNode = (TagNode)statsNode[0];
    		// get text data from HTML node
    		stats = resultNode.getText().toString();
    	}

    	// return value
    	return stats;
    }
}

Also, remember to set INTERNET permission as well on AndroidManifest.xml

Run it, and get the result:

HtmlCleaner Output

HtmlCleaner Output

It’s the output from my phone: Galaxy S II.

This library is simple and pretty fast and I’d like to use it. If you know any other better libraries, please let me know, I’d like to get it too.

In case you have some trouble, you can get this full source code: Get HtmlCleaner Sample Project

Cheers,

Pete Houston

Advertisements
Categories: Tutorials Tags: , , , ,
  1. June 15, 2016 at 3:39 pm

    Thanks for this useful information…..

  2. August 27, 2014 at 8:30 pm

    Hey! I know this is somewhat off topic but I was wondering
    which blog platform are you using for this site?
    I’m getting fed up of WordPress because I’ve had problems with hackers and I’m
    looking at alternatives for another platform.
    I would be awesome if you could point me in the direction of a good platform.

  3. August 20, 2014 at 2:53 pm

    Thanks , I’ve just been looking for information about this topic for a while and yours is the best I have found out
    so far. However, what about the conclusion? Are you sure about the supply?

  4. July 10, 2014 at 2:54 pm

    An outstanding share! I’ve just forwarded this onto a coworker who has been doing a little homework on this.
    And he in fact bought me lunch because I stumbled upon it
    for him… lol. So allow me to reword this…. Thanks for the meal!!
    But yeah, thanx for spending some time to discuss this topic here on your web site.

  5. Aidar
    December 13, 2013 at 2:40 am

    Thanks for good article. It’s works…but i have one question.
    How add in this code AnsyncTask and ProgressDialog?

    p.s. sorry for my english)))

  6. October 19, 2013 at 3:18 pm

    Woah this blog is excellent i like reading your articles. Stay up the great work! You understand, lots of people are hunting around for this info, you could help them
    greatly.

  7. October 18, 2013 at 1:23 pm

    I actually tend to agree with every aspect that is composed throughout “Android XML Adventure – Parsing HTML using HtmlCleaner | [ Android Newbie ]”.
    Thank you for all of the details.Thanks for your effort-Denisha

  8. September 19, 2013 at 1:13 pm

    Hi everyone, it’s my first visit at this web page, and piece
    of writing is genuinely fruitful designed for me, keep up posting these types of posts.

  9. Alexx
    September 19, 2013 at 12:58 am

    android.os.NetworkOnMainThreadException

  10. August 16, 2013 at 9:44 pm

    One of my relatives pointed me to read this specific blog and contact you.
    Maybe you are interested in partnership. Can I quote you on my web sites?

  11. Jaime
    July 17, 2013 at 7:53 am

    Hola
    Est bastante bien este articulo. Hay otros post no me interesaron mucho, en cualquier caso, la mayora estn bastante bien.

    😉

  12. Iluxa
    December 17, 2012 at 11:04 pm

    unfortunately htmlcleanerstudy has stopped. (
    I got this error on any avd. Could anyone explain what’s th problem pls.

  13. December 1, 2012 at 2:44 pm

    thank you so much …this code usefull for me …

  14. jojo
    June 30, 2012 at 12:42 am

    Hey guys,

    i have a problem with this line: TagNode root = htmlCleaner.clean(url);

    If call a System.out.println(“bla bla”) befor the line, i got the string in the logchat. if i call System.out…. below the line, there is nothing in the logchat. i only see the error message in my text view….i have copied the code from this site….i am using 2.3.3 and my test device is an galaxy note 4.0.1

    @coolsax: first, you have to create a folder called libs into your project. then you have to copy the library file into the libs folder manually and then add the library with “Add JARs” under the project properties….

    • jojo
      June 30, 2012 at 2:25 am

      Hey Guys…here is my code that resolve the NetworkOnMainThreadException problem:

      The MainActivity:

      package de.adcont.htmlcleaner;

      import android.app.Activity;
      import android.os.Bundle;
      import android.view.View;
      import android.view.View.OnClickListener;
      import android.widget.Button;
      import android.widget.TextView;

      public class MainActivity extends Activity implements OnClickListener {
      private Button checkBtn = null;

      static final String BLOG_URL = “https://xjaphx.wordpress.com”;

      static final String XPATH_STATS = “//div[@id=’blog-stats’]/ul/li”;

      @Override
      public void onCreate(Bundle savedInstanceState) {
      super.onCreate(savedInstanceState);
      setContentView(R.layout.main);

      checkBtn = (Button) findViewById(R.id.check);
      checkBtn.setOnClickListener(this);

      }

      private void startURL() {
      LoadAsyncTask loadTask = new LoadAsyncTask();
      loadTask.execute(BLOG_URL, XPATH_STATS, this);
      }

      private void initURL() {

      }

      public void showURL(String urlResult) {
      ((TextView) findViewById(R.id.tv)).setText(urlResult);
      }

      public void onClick(View view) {
      long clickedElement = view.getId();
      if (clickedElement == R.id.check) {
      initURL();
      startURL();
      }

      }

      }

      and here my AnsyncTask class:

      package de.adcont.htmlcleaner;

      import java.net.URL;

      import org.htmlcleaner.CleanerProperties;
      import org.htmlcleaner.HtmlCleaner;
      import org.htmlcleaner.TagNode;

      import android.os.AsyncTask;

      public class LoadAsyncTask extends AsyncTask {
      MainActivity mainActivity;

      @Override
      protected String doInBackground(Object… params) {

      String serviceUrl = (String) params[0];
      String xPath = (String) params[1];
      mainActivity = (MainActivity) params[2];
      try {
      return getBlogStats(serviceUrl, xPath);
      } catch (Exception ex) {
      return “Fehler”;
      }
      }

      @Override
      protected void onPostExecute(String response) {
      try{
      mainActivity.showURL(response);
      }
      catch(Exception ex){
      System.out.println(“Was ein kack Fehler”);
      }
      super.onPostExecute(response);
      }

      @Override
      protected void onPreExecute() {
      super.onPreExecute();
      }

      public String getBlogStats(String incommingURL, String incommingXPath) throws Exception {
      String stats = “”;

      HtmlCleaner htmlCleaner = new HtmlCleaner();
      CleanerProperties props = htmlCleaner.getProperties();
      props.setAllowHtmlInsideAttributes(false);
      props.setAllowMultiWordAttributes(true);
      props.setRecognizeUnicodeChars(true);
      props.setOmitComments(true);

      URL url = new URL(incommingURL);

      TagNode root = htmlCleaner.clean(url);
      System.out.println(“Die Ausgabe: Asi3”);

      Object[] statsNode = root.evaluateXPath(incommingXPath);

      if (statsNode.length > 0) {

      TagNode resultNode = (TagNode) statsNode[0];

      stats = resultNode.getText().toString();

      }

      return stats;
      }

      }

      Thanks so much Pete Houston 😉

  15. emilianorene
    May 25, 2012 at 3:59 am

    coolsax :
    I’m having a problem as soon as it hits “HtmlCleaner htmlCleaner = new HtmlCleaner();” when you first start to get the blog statistics. From LogCat it states “java.lang.NoClassDefFoundError: org.htmlcleaner.HtmlCleaner” yet I have it set up correctly in the properties. I’m using eclipse with the downloaded source code from your site. Whether it be an emulator with 2.3.3 or my phone with 2.3.5 it force closes. Any help would be appreciated.

    I have the same error… what is the solution? thanks

  16. April 28, 2012 at 2:49 am

    First I’ve got also the following errors android java.lang.NoClassDefFoundError org.jsoup.jsoup (and also for HtmlCleaner). After I renamed lib to libs it all went well !
    I’m using Android 2.2 , JSoup 1.6.2 and HtmlCleaner 2.1.1 and also I changed the Compiler Compliance Level from 1.5 to 1.6

  17. Eddie
    April 24, 2012 at 9:41 am

    I also have the same problem as coolsax by using android 2.3.3. It can’t work. Could anyone help me? thanks

  18. April 3, 2012 at 4:23 am

    Hmm, seems the same thing happens with JSoup. I completed every step several times. Not sure why it’s not recognizing it.

  19. April 3, 2012 at 4:08 am

    I’m having a problem as soon as it hits “HtmlCleaner htmlCleaner = new HtmlCleaner();” when you first start to get the blog statistics. From LogCat it states “java.lang.NoClassDefFoundError: org.htmlcleaner.HtmlCleaner” yet I have it set up correctly in the properties. I’m using eclipse with the downloaded source code from your site. Whether it be an emulator with 2.3.3 or my phone with 2.3.5 it force closes. Any help would be appreciated.

  20. jeremy
    April 1, 2012 at 11:46 am

    I am using android simulator 4.0.3 btw

  21. jeremy
    April 1, 2012 at 11:46 am

    it’s not working. tbe textview shows me the word error instead of the total number of people visited your blog. I did everything exactly according to your steps . please advise what could be the likely cause . thx

    • miracle
      May 13, 2012 at 11:07 pm

      if you modify the catch block like this:
      catch (Exception ex) {
      ((TextView)findViewById(R.id.tvAll)).setText(ex.toString());
      }
      you’ll see what is the error. And most probably it is a NetworkOnMainThreadException

      • July 18, 2013 at 2:12 pm

        yes it is NetworkOnMainThreadException. But how to remove it….plz help.

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: