Tuesday, September 28, 2004

Screen Scraping the Regular Way

Recently, I was demonstrating an RSS client for one of our partners with the idea that attorneys might be interested in being notified when updates are detected on certain websites pertinent to our legal practice. He seemed intrigued by the publish/subscribe metaphor being demonstrated and asked about using it to be notified when two particular pages of the SEC's website are updated. One is for proposed rules (http://www.sec.gov/rules/proposed.shtml) and one is for final rules (http://www.sec.gov/rules/final.shtml).

After poking around on the SEC's site, I determined - much to my consternation - that this info was not being published via RSS or any other syndication format, nor was the information available through any web service that I could find. So after exhausting the elegant possibilities, I resolved myself to the fact that I'd have to resort to screen scraping.

Reading the HTML pages is simple enough with the framework. The following snippet puts the HTML into a string:

String HTML;
StreamReader Reader;
HttpWebRequest Request = (HttpWebRequest) WebRequest.Create(url);
HttpWebResponse Response = (HttpWebResponse) Request.GetResponse();

Stream Stm = Response.GetResponseStream();
using (Reader = new StreamReader(Stm)) {
HTML = Reader.ReadToEnd();
}

The next step was to parse through the HTML page and find the data I was interested in which was stored in an HTML table. My first thought was to use Microsoft's MHTML COM object to read this into the HTML Document Object Model (DOM), and traverse the DOM until I found the data of interest. I had some difficulty determining how to load up the DOM (I've only used it from within DHTML scripting) and turned to Usenet for a quick answer. I fired up some Google queries that mentioned screen scraping - under the assumption that others might have taken this approach - and instead of information about the DOM, I came across posts about regular expressions.

This struck a resonant chord. I am relatively proficient with regular expressions, and had remembered that the framework provided good regular expression support, but I hadn't previously had the opportunity to make use of it. But this seemed like the right job for regular expressions.

The pattern I came up with to match my data table within the SEC's HTML page was:

< tr >\r\n< td.* >(\r\n)?< a href="(?< url >.*?)" >(?< releaseno >.*)< /a >< /td >(\r\n)?< td.* >(?< date >.*20[0-9][0-9])< /td >(\r\n)?< td.*? >< b.*? >(?< title >.*?)< /b >

I had to brush up on the particulars of the framework's regex syntax, and it took a little trial an error, but this did the trick. The key is to "capture" the four columns of data and "label" them as url, releaseno, date, and title. These are known as regular expression groups. So when the row is matched, the framework will make these four pieces of data available because I have delineated them appropriately within the expression string. The steps are to:

1. Construct a regex instance passing this regular expression to the constructor
2. Call the matches method of this instance passing in the HTML string representing the page.
3. Loop over each match. Each match is a match of the entire expression - one table row in this instance.
4. For each match, iterate through the "groups". In my expression, I have four groups corresponding to url, releaseno, date, and title.

My goal was to parse the data into something more structured - XML. So I wrote code that created the following XML from the page using the technique outlined above:

< row >
< url >/rules/final/34-50295.htm< /url >
< releaseno >34-50295< /releaseno >
< date >Aug. 31, 2004< /date >
< title >Rule 15c3-3 Reserve Requirements for Margin Related to Security Futures Products< /title >
< /row >

So far so good. But it irked me that this solution was so tied to these two particular web pages. I wanted to make the code a bit more generic. After all, there may be other web pages that may have embedded tabular data that might lend themselves to this type of scraping. Ideally, it would be nice to only have to specify the regular expression to find the data. Since the regular expression contains "code" to parse out and name the data columns, why not parse the regular expression string itself for those data column names?

How to parse the regular expression string passed to the method? With a regular expression, of course:

\(\?< (\w+) >.*?\)

This expression allows me to parse the caller's regular expression for the data column names (which are the names of the regex groups described above). I do this once at the top of the method storing these group names in a simple Arraylist. When I loop through the matches, I create an XMLElement for each of these names and set the value by referencing the group by name from within the Match.

foreach(string FieldName in FieldNames) {
Field = Doc.CreateElement(FieldName);
Field.InnerText = Entry.Groups[FieldName].Value;


Now that I've tamed the HTML and put it into well behaved XML, all that's left is to write an application to poll periodically, store the results, and compare them to those of the previous poll. I'll easily detect when and what updates have occurred and I'll fire off a message to interested parties. The framework's excellent regular expression support did all the heavy lifting.

Sunday, September 05, 2004

Parsimonious parsing and System.Configuration

Not long after I got into .NET development, I learned of the System.Configuration namespace. Initially, I simply took advantage of using a simple <appSettings> section, and the corresponding System.Configuration.ConfigurationSettings.AppSettings NameValueCollection. For small amounts of application configuration, this has great appeal.

Later explorations led me to the use of custom sections with the configuration file. If you include a <configSections> section within your configuration file, you can list your custom configuration sections with child <section> elements. Each <section> element indicates a name for the section and a string that identifies the class (and containing assembly) that can interpret the section. The actual custom section can then appear lower down in the configuration file and can include much more complex data since it’s simply XML and you provide the class to interpret it.

So what’s involved in providing the class to read the custom section? Not much. It has to implement IConfigurationSectionHandler and provide its sole member, Create. When your code calls System.Configuration.ConfigurationSettings.GetConfig(<your section name>), the framework invokes your Create method to hydrate an object from the xml that is passed via the XmlNode section argument.

My early implementations of a Create method were crude. In a nutshell, the code I wrote traversed the XML DOM, interpreted the data therein, and stored it in a custom class that was the method’s return value. There is nothing wrong with doing it this way, but it is tedious and, as I later learned, unnecessary.

One of the fundamental principals I’ve learned over the years is that if you find yourself doing something tedious, it’s time to reexamine it and look for an easier (and usually more elegant) way. In the case of implementing IConfigurationSectionHandler::Create(), it is far easier to leverage the power of XML serialization than it is to parse the data yourself. I believe I first caught wind of this technique in an MSDN article, and if I can find it again, I’ll add it to this entry. The gist of the idea is that if your custom configuration data can be stored in a class that can be serialized to XML, then it can be deserialized from the XML fragment that is your custom configuration section.

Rather than create the class to hold my configuration information, I find it simpler to define the XML schema (xsd) file that describes it. The class definition can then be generated with

Xsd /classes config.xsd

Xsd generates the config.cs file for you that contains the class that can be deserialized from the xml section. Here is a snippet that shows how simple it is to implement the Create method using XML deserialization.

public object Create(object parent, object configContext, System.Xml.XmlNode Section) {

CheckerConfiguration Result;
XmlSerializer Hydrator;
XmlReader Reader;
Hydrator = new XmlSerializer(typeof(CheckerConfiguration));
Reader = new XmlNodeReader(section);
try {
Result = (CheckerConfiguration) Hydrator.Deserialize(Reader);
}
finally {
Reader.Close();
}
return Result;
}


Digression

Although my blog is intended to be of a technical nature, I feel compelled to provide this glimpse into my life. I am a lifelong New York Giants football fan. Not a true fanatic -- just a run-of-the-mill fan. For those of you who don't know the NFL, the Giants play in the NFC East, and our traditional rivals are the Philadelphia Eagles, the Dallas Cowboys, and the Washington Redskins.

I met my wife Pilar

twenty years ago in Washington DC. And unlike many American wives, she watches football with me. She embraces football. She LOVES football. One would think that any red blooded American husband would look at this as a good thing. But sometimes it's hard to live with the "biggest Redskins fan known to man"!

I've already told my children that they shouldn't plan on college because Pilar is turning my basement into a Redskins sports bar -- one item at a time -- through EBay auctions. Each night, after I return from work, she enthusiastically relates to me each of the day's items that she "won"! I'm including a picture of this work in progress here.


Most weeks we go to a sports bar to watch the 'Skins because we are in NY and her games are typically not televised. So in we walk, me with my Giants jersey, and Pilar with her Redskins jersey, jacket, hat, earrings, rings, bracelets, key chain, beer cooler, etc. We watch the games together, and typically empathise with each other about our respective teams' fortunes, but when the 'Skins play the Giants -- normally twice a year -- we're talking about a whole different dynamic. That involves maintaining a safety buffer of at leat four or five bar stools and a subsequent twenty-four hour cooling off period.

Anyway, some other men might feel emasculated by the fact his wife was turning his home and castle into a shrine for his team's nemesis. But not, me. I have refused to allow her or my two beloved daughters to prevent me from wearing the pants in my family. As proof of that, I give you our family dog:


Man's Dog Posted by Hello
If you want to look thinner, always photograph yourself with a large marine mammal.

Shamu and I at Tech-Ed San Diego

Saturday, September 04, 2004

Exploring SQL Server 2000 Reporting Services

In creating my new Conflict Report I had anticipated the need to produce multiple formats. SQL 2000 Reporting Services -- which is used to create the report -- allows for multiple output formats. The initial report was designed to look very much like the one in use now at my firm. That is, it is designed for the printed page, and the PDF output format fits the bill. Since the report submission process is now streamlined to the point where requests can come in via emails initiated on Blackberries, the next logical step is to produce a version of the Conflict report suitable for display on a Blackberry. My first inclination was to simply create another report project where the output was flattened into a simple

Label: Value

format. And I certainly can do that. I would also render this in PDF via Reporting Services and the Blackberry user can then view this PDF (or alternately, I might render it as an Excel file since these render reasonably well on Blackberries. However, it occurred to me that it might be easier and more elegant to leverage Reporting Services ability to render my existing report into XML. An XSL transform could then be used to render a format suitable for the Blackberry...

Impersonating someone who understands ASP.NET Web Service Security

Wednesday, I was briefly working on a web service to deal with a rare problem with KDocs document returns. I decided to write a utility that potentially would be available to domain users which would correct the situation after the fact. It would connect to the appropriate database server, confirm that the symptoms matched those known for this particular problem, and then correct it. That requires that the web service connect to the database as semi-privileged SQL user. The easiest way would be to use SQL authentication. However, I didn't think of that right away, and had decided to rely on an Integrated Security connection.

When you run the web service, your identity is the local ASPNET account. I got errors connecting with this identity.

The solution was:
<identity impersonate="true" userName="domain\priveduser" password="secret" />

The <identity> tag, a child of <system.web>, does not require the userName and password attributes. They are only required if you want to impersonate someone other than the caller.

Which brings me to my next lesson. The next morning, while riding the train, I worked on the very same project on my laptop. I was thinking that since only domain users should ever be able to invoke this web service that I should turn off Anonymous Access and leave only Windows Authentication enabled. The next time I ran the service (from my WinForms test application), I got an HTTP 404 - Access Denied error. And I struggled with this error for the entire ride. After all, I am in my laptop's local Administrators group. So what was it? At first I thought it was misset ACLs somewhere in the c:\inetpub tree. And indeed I did find entries for laptop\Pinsley that seemed to allow no access. But the problem persisted even after I corrected what I now perceive to be a red herring.

As I pulled into Grand Central, I came upon the solution. You have to explicitly set the network credentials on the web service proxy. The magic line was:

FixTool = new localhost.Fixer();
FixTool.Credentials = System.Net.CredentialCache.DefaultCredentials;

That did the trick. So if you want to disable Anonymous access to your web service, remember to set your credentials! Live and learn...

Welcome to my Blog

As I delve into the world of blogging, I envision it, at least initially, as a journal primarily for my own consumption. I intend to use it to record the projects and tasks that occupy my day. I hope to draw upon it later to learn from my past. To start this off, I'll note that yesterday I found that code I had thought was properly installed on a Lotus Notes Server was behaving, it appeared, erratically where sometimes it would run "correctly" and other times it would run as if the recent changes I had made weren't installed. It turned out that the process invoking the COM proxy was the Notes Agent -- and there were four such processes allocated by Notes.

So the steps I had to take were:
Copy the modified assembly to the target folder.

gacutil -u KDConflictProxy
gacutil -i KDConflictProxy.dll
regasm /tlb:KDConflictProxy.tlb KDConflictProxy.dll

then...

tell amgr quit
load amgr

One of the agent processes had held open an older version of the component; it was invoked seemingly randomly and the resulting intermittent behavior was driving me crazy.