<?xml version="1.0" encoding="UTF-8"?>
<config>
<var-def name="datestring">
<file action="read" path="date.txt"></file>
</var-def>
<var-def name="webpage">
<html-to-xml>
<http url="http://scores.espn.go.com/ncb/scoreboard?date=20121124"/>
</html-to-xml>
</var-def>
<var-def name="duke">
<xpath expression="(//div[@class='team visitor'])[1]//a[@title]/text()">
<var name="webpage"></var>
</xpath>
</var-def>
</config>
Today I'll show how to pull information out of the webpage and save it
-- specifically, we'll pull the teams and scores out of the page and
save them off for later use.
Find the Information
The first step is to figure out where the information we want is in the
web page. This is easier said than done on modern web pages, which tend
to be impenetrable morasses of Javascript, HTML and CSS. One way to
get started is to use the "View Source" option on your web browser (or
save the web page onto your computer and view it with your favorite text
editor) and then search for text you can see from the web page. For
example, if we go the ESPN Scoreboard page for 11/24/2012, we can see
that the first listed game is Duke versus Louisville. If we do "View
Source" and search for "Duke", we find this as the first reference:
<a title="Duke" href="http://espn.go.com/mens-college-basketball/team/_/id/150/duke-blue-devils">Duke</a>
And, this in fact, is the HTML code that creates the "Duke" text in the
Duke vs. Louisville scoreboard. With some more digging, re-formating
and so on, we can eventually see that the information about Duke is in
an HTML structure that looks like this:
<div class="team visitor">
<div class="team-capsule">
<span id="323290097-aTeamName">
<a title="Duke">
Duke</a>
</span>
</div>
<ul id="323290097-aScores" class="score" style="display:block">
<li class="final" id="323290097-awayHeaderScore">
76</li>
</ul>
</div>
The information about Louisville is in a structure that is identical
except that it starts with "team home" instead of "team visitor."
So now that we know where the information is, we need to pluck it out and put it to use.
The Power of XPath
Web-Harvest uses Xpath extensively to dig information out of webpages.
Xpath is a notation for specifying where to find something in an XML
file. It's a "path" from the top-level of the
XML down
to some particular piece (or pieces) of the XML. It's very useful and
very powerful, but like regular expressions can be confusing and
difficult to use. If you don't know anything about XPath, you might
want to go off and read a tutorial about it to familiarize yourself with how it works. It's also very useful to have an XPath tester
for working out the correct paths for the information you're trying to get.
In fact, Web-Harvest itself provides a very handy XPath tester. To see
it's use, run the above script to fetch the ESPN page, and then use the
left-hand pane to see the value of the "webpage" variable (also as shown
above). Now click on the magnifier icon to the right of the "[Value]"
box and you'll get a pop-up window showing the text of the webpage:
Notice the "View as:" option in the top left of the pop-up. Click here
and select XML. This will show the webpage in XML format:
This view has a couple of handy features. First, you can use the
"Pretty-Print" button at the top to reorganize and cleanup the XML for
easier viewing. Second, you'llsee a box at the bottom labeled "XPath
expression." If you type an Xpath into this box, Web-Harvest will run
that XPath against the displayed XML and show the result. For example,
try typing the Xpath
//div[@class="team visitor"] into the box. This expression finds all
div elements in the page that have the class
"team visitor":
This matches a total of 15 div elements on this page, the first of which is the Duke entry we found above.
When an Xpath returns a list of items, we can pick items out of the list
in various ways, including using an index. To pick out the first
element of this list, we use
(//div[@class="team visitor"])[1].
That gives us the entire block HTML for Duke that I showed earlier. If
you look up there, you'll see the team name is within a
<a @title="Duke"> tag. We can pull that out by extending our Xpath to say
(//div[@class="team visitor"])[1]//a[@title]
which essentially says "Give me all the <a> elements with a title
attribute that are within the first div element with a class of team
visitor". Try that out:
We've now narrowed the Xpath down to just the <a> element
containing the team name. We can extract the actual name by appending
the function text()to the end of our Xpath. This function returns
whatever text it finds inside the element selected by the Xpath:
Here's how we'd use that same Xpath within Web-Harvest to pull out the name and save it in a variable:
You can experiment with creating the Xpaths to pull out the home team's name and the final scores of the game.
Looping
The Xpath example above works on the first element in the list of
visitor team names, but what we really want to do is capture the team
names and scores for all the games on the page. To do that, we will
loop over each of the game sections in turn. Web-Harvest provides a
processor for this called <loop>, which works about as you would
imagine. It takes a list of elements and loops over them one at a time,
and returns a list of the results. Here's the skeleton for looping
over each of the games in turn:
The <loop> processor has two parts. The first part is a
<list> of items to loop over. The second party is a <body>
that will be executed for each element of the list. Each time the
<body> is executed, a variable called currGame (which is specified
as "item" in the <loop> tag) will be set to the current element
of the list. In this case, each <body> execution just returns the
current item, so the result of the loop is just the list.
Notice that the <list> of items is given by the Xpath
"(//div[contains(@class,'final-state')])". That Xpath returns a list of
div elements. There's one div element for each game on the page, and
the div has the team names and scores inside of it. (The visiting team
name we pulled out earlier is inside this div.)
So now, each time through the loop we need pull out the team names and
scores for currGame. currGame contains a chunk of XML, so we can once
again use Xpath to do this. Then we'll store each item in its own
variable:
<?xml version="1.0" encoding="UTF-8"?>
<config>
<var-def name="datestring">
<file action="read" path="date.txt"></file>
</var-def>
<var-def name="webpage">
<html-to-xml>
<http url="http://scores.espn.go.com/ncb/scoreboard?date=20121124"/>
</html-to-xml>
</var-def>
<loop item="currGame">
<list>
<xpath expression="(//div[contains(@class,'final-state')])">
<var name="webpage"/>
</xpath>
</list>
<body>
<var-def name="visitor">
<xpath expression="(//div[@class='team visitor'])[1]//a[@title]/text()">
<var name="currGame"/>
</xpath>
</var-def>
<var-def name="visitorScore">
<xpath expression="(//li[@class='final'])[2]//text()">
<var name="currGame"/>
</xpath>
</var-def>
<var-def name="home">
<xpath expression="(//div[@class='team home'])[1]//a[@title]/text()">
<var name="currGame"/>
</xpath>
</var-def>
<var-def name="homeScore">
<xpath expression="(//li[@class='final'])[3]//text()">
<var name="currGame"/>
</xpath>
</var-def>
<file action="append" type="text" path="scores.txt">
<template>
${visitor} ${visitorScore} ${home} ${homeScore} ${sys.cr}${sys.lf}
</template>
</file>
</body>
</loop>
</config>
Each var-def in the body of the loop uses an Xpath expression to pull
out a particular piece of the data. You might want to experiment with
the Xpaths to see how each of them finds the right piece of information.
If you run this and look at the value of the loop after it is complete you'll see this:
The value of the loop is a list of all the values of the body as it is
executed, and the value of each body is just the list of the values of
the processors in the body (four var-def processors in this case). It
all gets mashed together and you end up with a long list of team names
and scores.
Format and Output
To make this more useful, let's clean up the format of the game data and
write it out to a file. We can format using the <template>
process that we saw last time, and to output we use the same
<file> processor we used to read a file. Every time through the
loop we'll add a line to the file for the game we just processed:
The ${sys.cr} and ${sys.lf} are Javascript values that put a
carriage-return/line-feed at the end of every line. The output file
looks like this:
Conclusion
This tutorial should give a general idea of how Web-Harvest works and
some of the basic tools it offers for scraping information out of web
pages. More help can be found online at the Web-Harvest documentation as well as the Web-Harvest forums.
Here is the completed Web-Harvest script, for cut & paste purposes:
<?xml version="1.0" encoding="UTF-8"?>
<config>
<var-def name="datestring">
<file action="read" path="date.txt"></file>
</var-def>
<var-def name="webpage">
<html-to-xml>
<http url="http://scores.espn.go.com/ncb/scoreboard?date=20121124"/>
</html-to-xml>
</var-def>
<loop item="currGame">
<list>
<xpath expression="(//div[contains(@class,'final-state')])">
<var name="webpage"/>
</xpath>
</list>
<body>
<var-def name="visitor">
<xpath expression="(//div[@class='team visitor'])[1]//a[@title]/text()">
<var name="currGame"/>
</xpath>
</var-def>
<var-def name="visitorScore">
<xpath expression="(//li[@class='final'])[2]//text()">
<var name="currGame"/>
</xpath>
</var-def>
<var-def name="home">
<xpath expression="(//div[@class='team home'])[1]//a[@title]/text()">
<var name="currGame"/>
</xpath>
</var-def>
<var-def name="homeScore">
<xpath expression="(//li[@class='final'])[3]//text()">
<var name="currGame"/>
</xpath>
</var-def>
<file action="append" type="text" path="scores.txt">
<template>
${visitor} ${visitorScore} ${home} ${homeScore} ${sys.cr}${sys.lf}
</template>
</file>
</body>
</loop>
</config>