Parse
HTML Text Object |
Parse
HTML Text Object Version 2.00 |
Description
A complex method designed to simplify the process of extracting data (text and/or tags) from an HTML document. The advantage of using the Filter method is that filter scripts can be stored outside of your code allowing for them to be modified when the HTML data source has been modified.
Syntax
object.Filter filter script, dictionary object
The filter script contains a string consisting of filter script commands separated by semi-colons.
The dictionary object is a pervious defined dictionary object used to return data values found in the HTML.
Remarks
The Filter method is best used in one of the following scenarios.
1. Extracting Specific Data from a Single HTML Page.
Suppose you want to extract a single stock price from a particular stock service. Your application would utilize one of the many available HTTP components to get the page. Such code would be written in a dynamic manor where the URL, page method (GET or POST) and any name(s)/value(s) necessary are stored in an external database. Such a practice would allow for changes made by the stock service to their Web site and the ability for you to us an alternate service with out modifying your code. Once retrieved the Filter method would be called with a filter script that would extract the stock price placing it in to the dictionary object for use by your application. Again, it is best that the filter script also be stored in an external database.
2. Extracting Specific Data from Any Pages.
Suppose you want to extract information for a search engine form any page on the Internet. In this case you would write an application which would find and retrieve HTML pages from sites you spider. Once retrieved you would use the Filter method to apply a filter script that would extract the title, all meta tags for locating the keyword and description information, all text for indexing and all anchor tags to continue spidering the site.
3. Extracting Similar Data from Multiple Pages.
Suppose you were looking to build a site that compared prices for the same product among different Web sites on the Internet. Your application would use a database to store the many sites you wished to search when a user entered a product. When a user enters a specific product you would want to retrieve similar data, such as product offered and price from each site. Since each site will offer this data in different HTML formats it will be necessary for you to store a similar but different filter script for each site URL.
The Filter method steps through an HTML document sequentially processed each text and tag components according to the current filter script command. Because HTML is a formatting language it (in particular it tags) can be used as street signs to locate data in a document. This holds true for even dynamic pages which utilize code to output a consistent page of information. Filter scripts utilize the following commands separated by a semi-colon.
<tag>[#key]
The tag command causes the filter method to begin searching for the next tag with the tag value. If followed by a tag variable command (optional) the entire tag including the delimiting greater and less than symbols are stored as the item value associated with dictionary key when the tag is found. Once found execution proceeds to the next filter script command.
For example, to search for the next table tag you would use the command <table>. To find and store the value of the next anchor tag associated to the key name AnchorTag you would use the command <a>#AnchorTag.
text[$key]
The text command causes the filter method to begin searching for the next text component that contains the text value. If followed by a text variable command (optional) the entire text is stored as the item value associated with dictionary key when found. Once found execution proceeds to the next filter script command.
For example, to search for text containing the work 'rock' you would use the command rock. To find and store the value of the next text string containing the phrase 'the quick brown fox' associated to the key name TextString you would use the command, the quick brown fox$TextString.
@key
The all variable command indicated that all text and tags found up to the next find tag or text command is to be stored as the item value associated with dictionary key.
For example, to save all of the text and tags up until the next tag or text command in found in the dictionary key AllData you would use the command @AllData.
#key
The tag variable command indicated that all tags found up to the next tag or text command is to be stored as the item value associated with dictionary key.
For example, to save all of the tags up until the next tag or text command in found in the dictionary key TagData you would use the command #TagData.
$key
The text variable command indicated that all text found up to the next find tag or text command is to be stored as the item value associated with dictionary key.
For example, to save all of the text up until the next tag or text command in found in the dictionary key TextData you would use the command $TextData.
Do Until <tag>, Loop
The do loop command allows the method to loop through repeated data. The best use of the command is for processing table information. All filter script commands between the Do Until <tag> and Loop command lines are repeated until the ending tag value is found. All variable commands inside a loop store the item value associated with the dictionary key appended to an array value (x) beginning with the value zero (i.e. key(x)).
For example, if you had a HTML document with a table with rows containing a column with a name and a second column with a phone number you could extract the data with the following filter script.
<table>;
</tr>;
Do Until </table>;
<tr>;
<td>;
$Name;
</td>;
<td>;
$PhoneNumber;
</td>;
</tr>;
Loop;This filter script first locates the beginning of the table (<table>). Avoids the table heading in the first row (</tr>). Then loops through each row saving the text in the first column as the dictionary key(s) Name(0)...Name(x) and the text in the second column as the dictionary key(s) PhoneNumber(0)...PhoneNumber(x).
Copyright (c)
1999-2002 by
Cimarron
Ravine, L.L.C.
All Rights Reserved.