Wednesday 26 February 2014

Installing Web-Harvest


Installing Web-Harvest is trivial. Download the latest "Single self-executable JAR file" from the website here .
This contains a single Jar file.  Put that somewhere on your computer and then double-click on the Jar file.  Presuming you have Java correctly installed, after a few moments the Web-Harvest GUI will pop up:



Notice that you can download and open some examples.  Under the Help menu (or with F1) you'll find the Web-Harvest manual.  You can also read this online here.

A Useful Note:  Version 2.0 of Web-Harvest has a memory leak bug.  This can cause the tool to use up all available memory and hang when downloading and processing a large number of web pages.  (Say, a whole season's worth of basketball games :-)  You can somewhat minimize this problem by starting Java with a larger memory allocation, using the "-Xms" and "-Xmx" options.  How to do this will vary slightly depending upon your operating system and whether things are installed.  On my Windows machine, I use a command line that looks something like this:
C:\WINDOWS\system32\javaw.exe -Xms1024m -Xmx1024m -jar "webharvest_all_2.jar"
On Windows you can create a shortcut and set the "Target" to be the proper command line.  However, even with this workaround, Web-Harvest will eventually hang.  The only choice then is to quit and restart.

Initial Set-Up

After you've downloaded and installed Web-Harvest, there are one or two things you should set before continuing.  Open the Web-Harvest GUI as above, and on the Execution menu, select Preferences.  This should open a form like this:


First of all, use this form to set an "Output Path".  This is the folder (directory) where Web-Harvest will look for input files and write output files.  (You can use absolute path names as well, but if you don't, this is where Web-Harvest will try to find things.)  There's no way to change this within your Web-Harvest script, so if you need to change this for different scripts, you'll have to remember to do it here first before running your script.

Second, if you need to use a proxy, this is where you can fill in that information.
Unknown Web Developer

No comments:

Post a Comment

Total Pageviews

DjKiRu Initative. Powered by Blogger.