Archive

Posts Tagged ‘jsoup’

A Note when Using Jsoup: User-Agent

January 29, 2013 1 comment

Several days ago, I’ve tried to run Jsoup on mobile testing for data parsing. My goal is to parse all questions posted on stackoverflow.com.

However, the result doesn’t fit me well.

First run on simple Android code:

public class MainScreen extends Activity
{
    ArrayList<String> mData =  new ArrayList<String>();
    ListView mListView;
    ArrayAdapter<String> mAdapter;

    @Override
    public void onCreate(Bundle savedInstanceState)
    {
        super.onCreate(savedInstanceState);
        setContentView(R.layout.main);

        mListView = (ListView) findViewById(R.id.listView);

        processData();

        mAdapter = new ArrayAdapter<String>(this, android.R.layout.simple_list_item_1, android.R.id.text1, mData);
        mListView.setAdapter(mAdapter);
    }

    private void processData() {
        String URL = "http://stackoverflow.com/questions/tagged/android";
        try {
            Document doc = Jsoup.connect(URL).get();
            Elements questions = doc.select(".summary h3 a");
            for(Element question: questions) {
                mData.add(question.text());
            }

            if(mData.size() == 0) {
                mData.add("Empty result");
            }

        } catch (Exception ex) {
            ex.printStackTrace();
            mData.clear();
            mData.add("Exception: " + ex.toString());
        }
    }
}

The result is empty. Well, thought of something else, so my next try is to print HTML from “doc” object, it outputs parts of full expected HTML results. So I parse with this selector: “div.nav li a”. The results show up but not for “.summary h3 a”.

After two days, working with Johnathan Hedley on GitHub, finally, found the problem is that: the mobile browser user-agent differs from the desktop browser; therefore, the HTML responses differ.

Make a note to mobile developers that use Jsoup:

+ always set a desktop user-agent

+ set a timeout

That’s good practice to avoid unexpectation.

This is the update working line:

Document doc = Jsoup.connect(URL).userAgent("Mozilla/5.0 (Macintosh; U; Intel Mac OS X; de-de) AppleWebKit/523.10.3 (KHTML, like Gecko) Version/3.0.4 Safari/523.10").get(
);

This issue was discussed here in GitHub: https://github.com/jhy/jsoup/issues/287

 

Cheers,
Pete Houston

Categories: Tricks & Tips Tags: , , , , ,

Android XML Adventure – Parsing HTML using JSoup

February 4, 2012 22 comments

Article Series: Android XML Adventure

Author: Pete Houston (aka. `xjaphx`)

TABLE OF CONTENTS

  1. What is the “Thing” called XML?
  2. Parsing XML Data w/ SAXParser
  3. Parsing XML Data w/ DOMParser
  4. Parsing XML Data w/ XMLPullParser
  5. Create & Write XML Data
  6. Compare: XML Parsers
  7. Parsing XML using XPath
  8. Parsing HTML using HtmlCleaner
  9. Parsing HTML using JSoup
  10. Sample Project 1: RSS Parser – using SAXParser
  11. Sample Project 1: RSS Parser – using DOM Parser
  12. Sample Project 1: RSS Parser – using XMLPullParser
  13. Sample Project 2: HTML Parser – using HtmlCleaner
  14. Sample Project 2: HTML Parser – using JSoup
  15. Finalization on the “Thing” called XML!

=========================================

Another library used common for parsing HTML is JSoup.

Unlike HtmlCleaner, JSoup uses the concept of attributes as a selector to identify each node in HTML tree.

I suggest you should learn the basics syntax of JSoup selector before continue, http://jsoup.org/cookbook/extracting-data/selector-syntax

Well, we will do the same thing as previous article, we get the blog statistics using JSoup.

The syntax is like this: ” div#blog-stats ul li

Literally, it means: select the node <li> inside node <ul> , which has parent is a <div> having ID value is “blog-stats“.

Download the libary JSoup and add it as “External JARs”.

Head straight to the source code to get our desire value:

package pete.android.study;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;

import android.app.Activity;
import android.os.Bundle;
import android.widget.TextView;

public class JSoupStudyActivity extends Activity {

	// blog url
	static final String BLOG_URL = "https://xjaphx.wordpress.com/";

    @Override
    public void onCreate(Bundle savedInstanceState) {
    	// set layout view
        super.onCreate(savedInstanceState);
        setContentView(R.layout.main);

        // process
        try {
        	((TextView)findViewById(R.id.tv)).setText(getBlogStats());
        } catch (Exception ex) {
        	((TextView)findViewById(R.id.tv)).setText("Error");
        }
    }

    protected String getBlogStats() throws Exception {
    	String result = "";
    	// get html document structure
    	Document document = Jsoup.connect(BLOG_URL).get();
    	// selector query
    	Elements nodeBlogStats = document.select("div#blog-stats ul li");
    	// check results
    	if(nodeBlogStats.size() > 0) {
    		// get value
    		result = nodeBlogStats.get(0).text();
    	}

    	// return
    	return result;
    }
}

Remember to add INTERNET permission. Here the result on my Galaxy S II phone, which is a little chocky-cocky:

JSoup Sample

JSoup Sample

Not so much different from XPath, is it?

Cheers,

Pete Houston

Categories: Tutorials Tags: , , , , , ,

Android XML Adventure – What is the “Thing” called XML?

October 9, 2011 1 comment

Currently I’m working on XML Data Storage for Android Application. It’s quite interesting! So I’ve thought to make it into series.

– What is the thing called XML?

Extensible Markup Language (XML) is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification[4] produced by the W3C, and several other related specifications, all gratis open standards.[5]

The design goals of XML emphasize simplicity, generality, and usability over the Internet.[6] It is a textual data format with strong support via Unicode for the languages of the world. Although the design of XML focuses on documents, it is widely used for the representation of arbitrary data structures, for example in web services.

Many application programming interfaces (APIs) have been developed that software developers use to process XML data, and several schema systems exist to aid in the definition of XML-based languages.

As of 2009[update], hundreds of XML-based languages have been developed,[7] including RSS, Atom, SOAP, and XHTML. XML-based formats have become the default for most office-productivity tools, including Microsoft Office (Office Open XML), OpenOffice.org (OpenDocument), and Apple‘s iWork.[8]

(Quoted from Wikipedia: http://en.wikipedia.org/wiki/XML)

– As you see that, XML is really useful and applicable everywhere throughout the Internet nowadays, and you’d better know more about it.

– In Android, XML is used for resource planning like layout, strings (localization), … or the pre-defined SharedPreferences, or to be used as custom database…

– In this series “Android XML Adventure“, I will talk to you about the way how we can handle XML file in Android.

TABLE OF CONTENTS

  1. What is the “Thing” called XML?
  2. Parsing XML Data w/ SAXParser
  3. Parsing XML Data w/ DOMParser
  4. Parsing XML Data w/ XMLPullParser
  5. Create & Write XML Data
  6. Compare: XML Parsers
  7. Parsing XML using XPath
  8. Parsing HTML using HtmlCleaner
  9. Parsing HTML using JSoup
  10. Sample Project 1: RSS Parser – using SAXParser
  11. Sample Project 1: RSS Parser – using DOM Parser
  12. Sample Project 1: RSS Parser – using XMLPullParser
  13. Sample Project 2: HTML Parser – using HtmlCleaner
  14. Sample Project 2: HTML Parser – using JSoup
  15. Finalization on the “Thing” called XML!

Be await for me, this series will make you fall in love XML for real 🙂

Cheers,

Pete Houston