Using WEB HARVEST for Content Scraping

Wednesday, 26 February 2014

Using WEB HARVEST for Content Scraping

"Web scraping" is the process of crawling over a web site, downloading web pages intended for human consumption, extracting information, and saving it in a machine-readable format. With the advent of the Web 2.0 and services-based architectures, web scraping has largely fallen into disuse, but it is still required/handy in situations such as this.

There are a number of web scraping tools available, with various functionality and state of repair. Many are frameworks or libraries intended to be embedded in languages like Python. Others are commercial. For my purposes, I wanted a stand-alone, open-source tool with fairly powerful features and a GUI interface. I ended up settling on Web-Harvest. Web-Harvest is written in Java, so it can be run on nearly any platform, and can also be embedded into Java programs.

<?xml version="1.0" encoding="UTF-8"?>

<config>
    <var-def name="google">
    <html-to-xml>
         <http url="http://www.google.com"/>
    </html-to-xml>
    </var-def>
</config>

There are three commands (what Web-Harvest calls "processors") in this configuration file: var-def, html-to-xml, and http. Reading these from the inside outwards, this is what they do:

The innermost processor, http, fetches the web page given in the url attribute -- in this case, the Google home page.
The next processor, html-to-xml, takes the web page, cleans it up a bit and converts it to XML.
The last processor, var-def, defines a new Web-Harvest variable named google and gives it the value of the XML returned by the html-to-xml processor.

To see this in action, click the green "Run" arrow near the top of the GUI. Web-Harvest will whir through the script and give you a message that "Configuration 'Config 1' has finished execution." Click OK

This is just the HTML for the Google home page -- it's what the http processor fetched from the Web. Try clicking on the "html-to-xml [1]" processor and see how the same web page looks encoded as XML. (Pretty much the same in this case.)

Conclusion

So far I've shown how to get Web-Harvest installed and to create a simple script to download a web page. Next time I'll go into some more detail about how to use the various Web-Harvest features to extract and save information from a web page.

Unknown Web Developer

Labels

Labels

300x250 AD TOP

Random Posts

About

Labels

Flickr

Find Us On Facebook

Popular Posts

Recent comments

Sponsor

TechWizard

Wednesday, 26 February 2014