"Web scraping" is the process of crawling over a web site, downloading
web pages intended for human consumption, extracting information, and
saving it in a machine-readable format. With the advent of the Web 2.0
and services-based architectures, web scraping has largely fallen into
disuse, but it is still required/handy in situations such as this.
There are a number of web scraping tools available, with various functionality and state of repair. Many are frameworks or libraries intended to be embedded in languages like Python. Others are commercial. For my purposes, I wanted a stand-alone, open-source tool with fairly powerful features and a GUI interface. I ended up settling on Web-Harvest. Web-Harvest is written in Java, so it can be run on nearly any platform, and can also be embedded into Java programs.
<?xml version="1.0" encoding="UTF-8"?>
<config>
<var-def name="google">
<html-to-xml>
<http url="http://www.google.com"/>
</html-to-xml>
</var-def>
</config>
There are three commands (what Web-Harvest calls "processors") in this configuration file: var-def, html-to-xml, and http. Reading these from the inside outwards, this is what they do:
This is just the HTML for the Google home page -- it's what the http processor fetched from the Web. Try clicking on the "html-to-xml [1]" processor and see how the same web page looks encoded as XML. (Pretty much the same in this case.)
There are a number of web scraping tools available, with various functionality and state of repair. Many are frameworks or libraries intended to be embedded in languages like Python. Others are commercial. For my purposes, I wanted a stand-alone, open-source tool with fairly powerful features and a GUI interface. I ended up settling on Web-Harvest. Web-Harvest is written in Java, so it can be run on nearly any platform, and can also be embedded into Java programs.
<?xml version="1.0" encoding="UTF-8"?>
<config>
<var-def name="google">
<html-to-xml>
<http url="http://www.google.com"/>
</html-to-xml>
</var-def>
</config>
There are three commands (what Web-Harvest calls "processors") in this configuration file: var-def, html-to-xml, and http. Reading these from the inside outwards, this is what they do:
- The innermost processor, http, fetches the web page given in the url attribute -- in this case, the Google home page.
- The next processor, html-to-xml, takes the web page, cleans it up a bit and converts it to XML.
- The last processor, var-def, defines a new Web-Harvest variable named google and gives it the value of the XML returned by the html-to-xml processor.
This is just the HTML for the Google home page -- it's what the http processor fetched from the Web. Try clicking on the "html-to-xml [1]" processor and see how the same web page looks encoded as XML. (Pretty much the same in this case.)
No comments:
Post a Comment