Reading a File
To begin with, let's look at how we can read information from a file into Web-Harvest. In this example, I'm going to assume we have a file in our working directory called "date.txt" and that file contains a single line with a date in the format YYYYMMDD, e.g., 20121110. Go to your working directory and create that file. Then open up Web-Harvest, start a new configuration file, and type in this script:<?xml version="1.0" encoding="UTF-8"?>
<config>
<var-def name="datestring">
<file action="read" path="date.txt"></file>
</var-def>
<var-def name="USdate">
<regexp>
<regexp-pattern>
^(/d/d/d/d)(/d/d)(/d/d)
</regexp-pattern>
<regexp-source>
<var name="datestring"></var>
</regexp-source>
<regexp-result>
<template>${_2}/${_3}/${_1}</template>
</regexp-result>
</regexp>
</var-def>
</config>
The file processor reads the contents of the "date.txt" file and provides that as a result to the outer processor. In this case, that's the var-def processor that is creating the "datestring" variable. The result is that the datestring variable will be created and its value will be the contents of the date.txt file. To see this, hit the green "Run" arrow, and then examine the datestring variable:
Regular Expressions
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. (Jamie Zawinski)Jamie Zawinski's famous and generally sound advice notwithstanding, regular expressions are a significant element in the Web-Harvest toolbox. This makes sense -- much of what we do in screen scraping is manipulating text, and regular expressions are very good at that task. It's beyond my interest (and probably, ability) to teach you regular expressions. You'll have to find other resources for that. But I recommend using a regular expression tester like this to help you debug your regular expressions. (Remember that Web-Harvest is implemented in Java, so it uses the Java regular expression syntax.)
For a simple example, we'll use a regular expression to pull the year out of the date we've read from the date.txt file. The year is the first four digits of the date, and digits in regular expressions are represented as \d. Here's the script to use a regular expression to pull the year out of the datestring variable and store it in a new variable called year:
This example introduces a couple of new processors. The first is the regexp processor, which has three parts: the regexp-pattern, the regexp-source, and the regexp-result. The regexp-pattern portion holds the regular expression we're trying to match. In this case, it is the expression "^(\d\d\d\d)" which means "a group of four digits at the beginning of a line". The regexp-source provides the string against which we'll try to match the pattern. In this case, it is the value of the datestring variable, which is the contents of the date.txt file from the previous step of the configuration file. Finally, the regexp-result portion determines what the result of the regular expression will be -- that is, what value it will feed back up to the next processor.
As you can see, inside of regexp-result we have another processor -- template. Template basically returns whatever is inside of it. So if you wrote <template>Test</template> the result would simply be the string "Test". However -- and this is the useful part -- anything enclosed inside ${ } will be evaluated in Javascript and the result of the Javascript will be injected into the template. So if you wrote <template>Today is the ${sys.datetime("dd")}th</template> you'd get back "Today is the 13th" (or whatever the current day is).
Web-Harvest defines a number of useful variables inside Javascript. One of these is _1, which is the value of the first matched group in a regular expression. Because the _1 in our template is enclosed in ${ } it is evaluated in Javascript and is replaced with the first matched group in the regular expression. So in this case, our template returns "2012".
Finally, the regexp processor returns the value of the regexp-result part, and the year variable gets set to "2012". (As you can see in the above screenshot.)
Here's a slightly more complicated example that uses a regular expression to reformat the date in US format. See if you can figure it out:
No comments:
Post a Comment