Screen Scraping the Regular Way
Recently, I was demonstrating an RSS client for one of our partners with the idea that attorneys might be interested in being notified when updates are detected on certain websites pertinent to our legal practice. He seemed intrigued by the publish/subscribe metaphor being demonstrated and asked about using it to be notified when two particular pages of the SEC's website are updated. One is for proposed rules (http://www.sec.gov/rules/proposed.shtml) and one is for final rules (http://www.sec.gov/rules/final.shtml).
After poking around on the SEC's site, I determined - much to my consternation - that this info was not being published via RSS or any other syndication format, nor was the information available through any web service that I could find. So after exhausting the elegant possibilities, I resolved myself to the fact that I'd have to resort to screen scraping.
Reading the HTML pages is simple enough with the framework. The following snippet puts the HTML into a string:
String HTML;
StreamReader Reader;
HttpWebRequest Request = (HttpWebRequest) WebRequest.Create(url);
HttpWebResponse Response = (HttpWebResponse) Request.GetResponse();
Stream Stm = Response.GetResponseStream();
using (Reader = new StreamReader(Stm)) {
HTML = Reader.ReadToEnd();
}
The next step was to parse through the HTML page and find the data I was interested in which was stored in an HTML table. My first thought was to use Microsoft's MHTML COM object to read this into the HTML Document Object Model (DOM), and traverse the DOM until I found the data of interest. I had some difficulty determining how to load up the DOM (I've only used it from within DHTML scripting) and turned to Usenet for a quick answer. I fired up some Google queries that mentioned screen scraping - under the assumption that others might have taken this approach - and instead of information about the DOM, I came across posts about regular expressions.
This struck a resonant chord. I am relatively proficient with regular expressions, and had remembered that the framework provided good regular expression support, but I hadn't previously had the opportunity to make use of it. But this seemed like the right job for regular expressions.
The pattern I came up with to match my data table within the SEC's HTML page was:
< tr >\r\n< td.* >(\r\n)?< a href="(?< url >.*?)" >(?< releaseno >.*)< /a >< /td >(\r\n)?< td.* >(?< date >.*20[0-9][0-9])< /td >(\r\n)?< td.*? >< b.*? >(?< title >.*?)< /b >
I had to brush up on the particulars of the framework's regex syntax, and it took a little trial an error, but this did the trick. The key is to "capture" the four columns of data and "label" them as url, releaseno, date, and title. These are known as regular expression groups. So when the row is matched, the framework will make these four pieces of data available because I have delineated them appropriately within the expression string. The steps are to:
1. Construct a regex instance passing this regular expression to the constructor
2. Call the matches method of this instance passing in the HTML string representing the page.
3. Loop over each match. Each match is a match of the entire expression - one table row in this instance.
4. For each match, iterate through the "groups". In my expression, I have four groups corresponding to url, releaseno, date, and title.
My goal was to parse the data into something more structured - XML. So I wrote code that created the following XML from the page using the technique outlined above:
< row >
< url >/rules/final/34-50295.htm< /url >
< releaseno >34-50295< /releaseno >
< date >Aug. 31, 2004< /date >
< title >Rule 15c3-3 Reserve Requirements for Margin Related to Security Futures Products< /title >
< /row >
So far so good. But it irked me that this solution was so tied to these two particular web pages. I wanted to make the code a bit more generic. After all, there may be other web pages that may have embedded tabular data that might lend themselves to this type of scraping. Ideally, it would be nice to only have to specify the regular expression to find the data. Since the regular expression contains "code" to parse out and name the data columns, why not parse the regular expression string itself for those data column names?
How to parse the regular expression string passed to the method? With a regular expression, of course:
\(\?< (\w+) >.*?\)
This expression allows me to parse the caller's regular expression for the data column names (which are the names of the regex groups described above). I do this once at the top of the method storing these group names in a simple Arraylist. When I loop through the matches, I create an XMLElement for each of these names and set the value by referencing the group by name from within the Match.
foreach(string FieldName in FieldNames) {
Field = Doc.CreateElement(FieldName);
Field.InnerText = Entry.Groups[FieldName].Value;
…
Now that I've tamed the HTML and put it into well behaved XML, all that's left is to write an application to poll periodically, store the results, and compare them to those of the previous poll. I'll easily detect when and what updates have occurred and I'll fire off a message to interested parties. The framework's excellent regular expression support did all the heavy lifting.
After poking around on the SEC's site, I determined - much to my consternation - that this info was not being published via RSS or any other syndication format, nor was the information available through any web service that I could find. So after exhausting the elegant possibilities, I resolved myself to the fact that I'd have to resort to screen scraping.
Reading the HTML pages is simple enough with the framework. The following snippet puts the HTML into a string:
String HTML;
StreamReader Reader;
HttpWebRequest Request = (HttpWebRequest) WebRequest.Create(url);
HttpWebResponse Response = (HttpWebResponse) Request.GetResponse();
Stream Stm = Response.GetResponseStream();
using (Reader = new StreamReader(Stm)) {
HTML = Reader.ReadToEnd();
}
The next step was to parse through the HTML page and find the data I was interested in which was stored in an HTML table. My first thought was to use Microsoft's MHTML COM object to read this into the HTML Document Object Model (DOM), and traverse the DOM until I found the data of interest. I had some difficulty determining how to load up the DOM (I've only used it from within DHTML scripting) and turned to Usenet for a quick answer. I fired up some Google queries that mentioned screen scraping - under the assumption that others might have taken this approach - and instead of information about the DOM, I came across posts about regular expressions.
This struck a resonant chord. I am relatively proficient with regular expressions, and had remembered that the framework provided good regular expression support, but I hadn't previously had the opportunity to make use of it. But this seemed like the right job for regular expressions.
The pattern I came up with to match my data table within the SEC's HTML page was:
< tr >\r\n< td.* >(\r\n)?< a href="(?< url >.*?)" >(?< releaseno >.*)< /a >< /td >(\r\n)?< td.* >(?< date >.*20[0-9][0-9])< /td >(\r\n)?< td.*? >< b.*? >(?< title >.*?)< /b >
I had to brush up on the particulars of the framework's regex syntax, and it took a little trial an error, but this did the trick. The key is to "capture" the four columns of data and "label" them as url, releaseno, date, and title. These are known as regular expression groups. So when the row is matched, the framework will make these four pieces of data available because I have delineated them appropriately within the expression string. The steps are to:
1. Construct a regex instance passing this regular expression to the constructor
2. Call the matches method of this instance passing in the HTML string representing the page.
3. Loop over each match. Each match is a match of the entire expression - one table row in this instance.
4. For each match, iterate through the "groups". In my expression, I have four groups corresponding to url, releaseno, date, and title.
My goal was to parse the data into something more structured - XML. So I wrote code that created the following XML from the page using the technique outlined above:
< row >
< url >/rules/final/34-50295.htm< /url >
< releaseno >34-50295< /releaseno >
< date >Aug. 31, 2004< /date >
< title >Rule 15c3-3 Reserve Requirements for Margin Related to Security Futures Products< /title >
< /row >
So far so good. But it irked me that this solution was so tied to these two particular web pages. I wanted to make the code a bit more generic. After all, there may be other web pages that may have embedded tabular data that might lend themselves to this type of scraping. Ideally, it would be nice to only have to specify the regular expression to find the data. Since the regular expression contains "code" to parse out and name the data columns, why not parse the regular expression string itself for those data column names?
How to parse the regular expression string passed to the method? With a regular expression, of course:
\(\?< (\w+) >.*?\)
This expression allows me to parse the caller's regular expression for the data column names (which are the names of the regex groups described above). I do this once at the top of the method storing these group names in a simple Arraylist. When I loop through the matches, I create an XMLElement for each of these names and set the value by referencing the group by name from within the Match.
foreach(string FieldName in FieldNames) {
Field = Doc.CreateElement(FieldName);
Field.InnerText = Entry.Groups[FieldName].Value;
…
Now that I've tamed the HTML and put it into well behaved XML, all that's left is to write an application to poll periodically, store the results, and compare them to those of the previous poll. I'll easily detect when and what updates have occurred and I'll fire off a message to interested parties. The framework's excellent regular expression support did all the heavy lifting.
0 Comments:
Post a Comment
<< Home