Monday, March 16, 2009

Getting Closure Calling SQL Stored Procedures – Keeping it DRY

(Please pardon some of the code formatting; I intend to clean it up)

When a new technology, tool, or concept emerges on my horizon, I will note it for future reference, but typically not take the time to learn it unless and until I have a real world problem to solve. Such was the case with lambda expressions and generics. I read about them but never fully appreciated their usefulness and expressiveness. When I found a real world use in my own code, I finally felt that the concepts were sinking in; I don’t think I could have gotten there with a contrived example.

For me, the opportunity to solve a problem presented itself during the coding of a Data Access Layer (DAL) for my latest project. Although the latest technologies from Microsoft – Linq to SQL and the Entity Framework – should eliminate much of the programmer’s burden with respect to developing a DAL, for one large project, I am still using the time-tested technique of passing strongly typed DataTable and DataRow objects from my DAL to my business layer. In addition, SQL stored procedures are used throughout the DAL.

Visual Studio has some terrific, easy-to-use features that make creating a DAL simple. Need a stored procedure to return a DataTable object? No problem. Drop a SqlDataAdapter onto the design surface, create a quick query – visually if you like – and you’re a few clicks away from generated code both for the stored procedure itself and for its invocation complete with parameter passing and the code necessary to fill your strongly typed DataTable. However, I felt the generated code lacked one important feature.

When my business layer makes a call into the DAL, I find it useful to signal various error conditions with custom exceptions. I typically write my DAL’s stored procedures so that I can signal various error conditions by returning distinct numeric values via the procedure’s return statement. The general pattern is to first call RAISERROR with a meaningful message for the user, and then return a status code as illustrated in this example:

if (not exists(select * from dbo.Repositories where Repository_ID = @Repository_ID))
begin
raiserror('Repository %d does not exist.', 16, 1, @Repository_ID)
return 1001 --Special return value – NO_SUCH_REPOSITORY
end

When the procedure signals an error condition in this manner, the calling code will get a SqlException exception. I write my DAL code to catch SqlExceptions, retrieve the procedure’s return value through the SqlCommand object – available if you pass in a parameter with its Direction property set to ParameterDirection.ReturnValue -- and potentially use that value to re-throw a more meaningful custom exception with the original SqlException embedded within the innerException property. Here is the catch block with two helper methods:

catch (SqlException E) {
Exception TransformedException;
if ((TransformedException = TransformSqlException(E, Cmd)) != null)
throw TransformedException;
else
throw;
}

//Get the procedure’s return value and if it is one of our custom values, return a
//more expressive exception type
private static Exception TransformSqlException(SqlException E, SqlCommand Cmd) {
Exception TransformedException = null;
if (HasValidReturnValue(Cmd)) {
int ReturnValue = (int)Cmd.Parameters["@RETURN_VALUE"].Value;
switch (ReturnValue) {
case (int)StoredProcCodes.NO_SUCH_REPOSITORY:
TransformedException = new NoSuchRepositoryException(E.Message, E);
break;
case (int)StoredProcCodes.NO_JOBS_ELIGIBLE:
TransformedException = new NoJobsEligibleException(E.Message, E);
break;
case (int)StoredProcCodes.NO_SUCH_REQUEST:
TransformedException = new NoSuchRequestException(E.Message, E);
break;
}
}
return TransformedException;
}

//See if a return value parameter is available. By convention, I always
//name mine @RETURN_VALUE
public static bool HasValidReturnValue(SqlCommand Cmd) {
SqlParameter Param;
Param = Cmd.Parameters["@RETURN_VALUE"];
return ((Param != null) && (Param.Value != null));
}

I find that adhering to the discipline of throwing custom exceptions leads to a more robust interface between the business layer and the DAL. The code generated by Visual Studio to invoke stored procedures does not provide a way to get to the procedure’s return value and thus no opportunity to throw anything other than the original SqlException. (If I’m wrong about that, I’d love to be corrected). Consequently, I write the code to invoke the stored procedures, and while doing it for the umpteenth time, it became apparent to me that there must be a better way.

These DAL procedures come in two flavors – those that take input parameters and return a rowset used to fill either a DataTable or a DataRow object, and those that take input parameters, do some work, and return one or more output parameters. In each case, the code that invokes the procedure is very repetitive. And repetitive code has a “smell” – it violates the D-R-Y principle – Don’t Repeat Yourself, and it cries out for refactoring.

To illustrate, take the case of calling a stored procedure to return a single row of data. The DAL code should return a strongly typed DataRow object or throw a custom exception indicating what the problem is. Here is a sample of such a method that takes two input arguments and is responsible for returning a DataRow object encapsulating a single row.

private static JobAssignmentRow GetJobAssignment(int EngineId, string OfficeAbbr) {

var ResultTable = new JobAssignmentDataTable ();

using (SqlConnection Connection = GetConnection()) {
using (SqlCommand Cmd = new SqlCommand(“dbo.GetJobAssignment”, Connection)) {
Connection.Open();
Cmd.CommandType = CommandType.StoredProcedure;

Cmd.Parameters.Add("@RETURN_VALUE", SqlDbType.Int).Direction = ParameterDirection.ReturnValue;
Cmd.Parameters.Add("@EngineId", SqlDbType.Int).Value = EngineId;
Cmd.Parameters.Add("@OfficeAbbr", SqlDbType.Varchar, 3).Value = OfficeAbbr;

try {
using (SqlDataAdapter Adapter = new SqlDataAdapter(Cmd)) {
Adapter.Fill(ResultTable);
}
return (JobAssignmentRow)ResultTable.Rows[0];
}
catch (SqlException E) {
Exception TransformedException;
if ((TransformedException = TransformSqlException(E, Cmd)) != null)
throw TransformedException;
else
throw;
}
}
}
}

But what is truly variable here? There’s a lot of boilerplate code. The using blocks for the allocation of a SqlConnection, a SqlCommand, and a SqlDataAdapter are common to all these type of methods. And so is the exception block handling. The main variability in the code is for setting the stored procedure’s parameter values. And even there, by convention, I always set the @RETURN_VALUE parameter. The other piece of variability is the return type – in this case a strongly typed DataRow descendent type – JobAssignmentRow. I needed to find a way to factor out the variant from the invariant code; and for this, two concepts worked perfectly – generics and lambda expressions.

Generics allow me to generalize the return type. In the use case in question, the code uses a SqlDataAdapter to fill a DataTable object and return its first row – a DataRow object. So it was clear that I had to specify two type parameters. My first attempt at refactoring looked like this.

private static RowType ReturnDataObject<RowType,TableType>(string StoredProcedure) {

var ResultTable = new TableType();

using (SqlConnection Connection = GetConnection()) {
using (SqlCommand Cmd = new SqlCommand(StoredProcedure, Connection)) {
Connection.Open();
Cmd.CommandType = CommandType.StoredProcedure;

//PARAMETERS need to be set here

try {
using (SqlDataAdapter Adapter = new SqlDataAdapter(Cmd)) {
Adapter.Fill(ResultTable);
}
return (RowType)ResultTable.Rows[0];
}
catch (SqlException E) {
Exception TransformedException;
if ((TransformedException = TransformSqlException(E, Cmd)) != null)
throw TransformedException;
else
throw;
}
}
}
}

I created a helper method called ReturnDataObject that declares two type parameters and a regular parameter for the caller to specify the name of the strored procedure to invoke. Aside from the issue of setting parameters (discussed below), this looks pretty clean. However it didn’t compile and displayed the following four errors:

error CS0304: Cannot create an instance of the variable type 'TableType' because it does not have the new() constraint

error CS1502: The best overloaded method match for 'System.Data.Common.DbDataAdapter.Fill(System.Data.DataTable)' has some invalid arguments
C:\TFSSource\Portal\Version 1.0\DocumentsDAL\DocProcs.cs(218,42):

error CS1503: Argument '1': cannot convert from 'TableType' to 'System.Data.DataTable'
C:\TFSSource\Portal\Version 1.0\DocumentsDAL\DocProcs.cs(220,53):

error CS1061: 'TableType' does not contain a definition for 'Rows' and no extension method 'Rows' accepting a first argument of type 'TableType' could be

The errors are all complaints from the compiler about my usages of the type parameters – essentially assumptions about these types that I’m making but not telling the compiler about. The first line of my code:

var ResultTable = new TableType();

Is calling a parameterless constructor of the TableType type parameter. The compiler wants to ensure that there is an such a constructor to call for the concrete type that will be specified by the caller of this method. Additionally, the call to the SqlDataAdapter’s fill method requires a DataTable object. We have to assue the compiler that we are giving it one by telling it to restrict usage of this method to callers that specify a compatible concrete type. Similarly, it doesn’t know that there is a valid method called Rows because, again, we haven’t indicated that the second type parameter is a DataTable descendent. All of these problems are solved by using c#’s class restrictor feature specified in the a where clause. The declaration of this method now looked like this:

private static RowType ReturnDataObject<RowType,TableType>(string StoredProcedure)
where RowType : DataRow
where TableType : DataTable, new() {

This indicates to the compiler that the RowType type must descend from DataRow and the TableType type must descend from DataTable. The new() clause ensures that the TableType class has an accessible parameterless--or default-- constructor. It now compiled cleanly.

But what about the last piece of the puzzle? How was the caller going to specify the input parameters unique to the invoked procedure? And in the case of a procedure that sets output parameters, how was the caller going to get at them?

It was apparent that I had to pass in a delegate type for a method that takes a SqlCommand object as a parameter. In my new ReturnDataObject implementation, I would call back through the delegate to allow the caller to set the input parameters appropriately. A second delegate could be used to call back when output parameters might be accessed as would be the case after a call to SqlCommand.ExecuteNonQuery (although that technique is not used in the ReturnDataObject method).

The delegate type I used looked like this:

delegate void CmdAccessProc(SqlCommand Cmd);

and my ReturnDataObject method now took its final form:

private static RowType ReturnDataObject<RowType,TableType>(string StoredProcedure, CmdAccessProc FillParamsMethod)
where RowType : DataRow
where TableType : DataTable, new() {

var ResultTable = new TableType();

using (SqlConnection Connection = GetConnection()) {
using (SqlCommand Cmd = new SqlCommand(StoredProcedure, Connection)) {
Connection.Open();
Cmd.CommandType = CommandType.StoredProcedure;

Cmd.Parameters.Add("@RETURN_VALUE", SqlDbType.Int).Direction = ParameterDirection.ReturnValue;
FillParamsMethod(Cmd);

try {
using (SqlDataAdapter Adapter = new SqlDataAdapter(Cmd)) {
Adapter.Fill(ResultTable);
}
return (RowType)ResultTable.Rows[0];
}
catch (SqlException E) {
Exception TransformedException;
if ((TransformedException = TransformSqlException(E, Cmd)) != null)
throw TransformedException;
else
throw;
}
}
}
}

Since I always create a parameter to retrieve the procedure’s return value, I put this into the ReturnDataObject method and relieved that burden from all the method’s callers. The CmdAccessProc delegate – parameter FillParamsMethod -- is used after the SqlCommand object is constructed and before it is used to construct the SqlDataAdpater that is used to fill the result table. This method now looks totally generic.

What would the code look like to call this method? Although you could create a delegate method and pass it into this method, I found that using the new c# lambda construct to be the most natural fit.

Lambdas provide a concise way to define and pass an anonymous delegate into a method. I see three advantages to using them:

1. You can define the method consisely where the delegate would normally be passed as an argument.
2. The code is smaller and tighter because the compiler makes inferences about the lambda expression’s parameter types.
3. They provide the powerful closure property (more on that below).

Here is the code to invoke a stored procedure that takes two input parameters and returns a DataRow descendent type – JobAssignmentRow.

public static JobAssignmentRow GetJobAssignment(int EngineId, string OfficeAbbr) {

return ReturnDataObject<JobAssignmentRow,JobAssignmentDataTable>
("dbo.GetJobAssignment",
Cmd => {
Cmd.Parameters.Add("@EngineId", SqlDbType.Int).Value = EngineId;
Cmd.Parameters.Add("@OfficeAbbr", SqlDbType.Varchar, 3).Value = OfficeAbbr;
}
);
}

Note that the two concrete class types are specified in the angle brackets in the call to ReturnDataObject. The first argument to the method is the string procedure name, and the second argument is the delegate – defined inline using the new lambda expression syntax. The compiler treats the argument(s) to the left of the => as inputs to the method. The return type is inferred from the method’s implementation which appears to the right of the =>. The compiler is smart enough to infer that Cmd is the input argument to the delegate method and is of type SqlCommand. It does this by examing the arguments types for the ReturnDataObject method and noting that the second argument is a CmdAccessProc delegate, and that a method with that signature takes a SqlCommand argument as its first (and only) parameter. So the compiler creates this anonymous delegate, and when it is called, assigns the passed SqlCommand object to the Cmd variable.

As you can see in the example, the purpose of the lambda’s body, which appears to the right of the =>, is used to set the input parameters – in this case for a stored procedure called “GetJobAssignment”. Something subtle but very powerful is going on here. When setting the parameters of the SqlCommand object, the anonymous delegate is referencing variables EngineId and OfficeAbbr – both input variables to the GetJobAssignment method. These variables have local scope – that is, they are only accessible within the scope of the GetJobAssignment method. Yet they are being referenced within the anonymous delegate method. This is possible because lambda expressions have the property of closure. This property allows the scope of the creating method to be passed through to the lambda expression’s implementation. Any local variables in the outer method become accessible within the body of the lambda expression.

As a final example, consider the second flavor of stored procedures – those that take input parameters and set one or more output parameters. To handle this case, I use this method:

private static void InvokeNonTabularStoredProcedure(string StoredProcedure, CmdAccessProc FillParamsMethod,
CmdAccessProc AccessOutputParamsMethod) {
using (SqlConnection Connection = GetConnection()) {
using (SqlCommand Cmd = new SqlCommand(StoredProcedure, Connection)) {
Connection.Open();
Cmd.CommandType = CommandType.StoredProcedure;

Cmd.Parameters.Add("@RETURN_VALUE", SqlDbType.Int).Direction = ParameterDirection.ReturnValue;
FillParamsMethod(Cmd);

try {
Cmd.ExecuteNonQuery();
if (AccessOutputParamsMethod != null)
AccessOutputParamsMethod(Cmd);
}
catch (SqlException E) {
Exception TransformedException;
if ((TransformedException = TransformSqlException(E, Cmd)) != null)
throw TransformedException;
else
throw;
}
}
}
}

As I’m not returning a DataTable or DataRow objects, there is no need for generics. But the method now accepts two delegate methods when called – one to set the input parameters and one to optionally access output parameters. The first delegate, FillParamsMethod, serves the same purpose as it did in the ReturnDataObject method. It is called prior to invoking the stored procedure with the SqlCommand’s ExecuteNonQuery method. The second delegate is optional. If non-null, it is used to give the caller an opportunity to access the procedure’s output parameters. This is illustrated in this sample call to InvokeNonTabularStoredProcedure:

//Submit a document and return the assigned request id.
public static int SubmitDocPushByTag(string Requestor, int Case_ID, int DBId,
string TagName, int Category_ID, int ExpectedDocCount, int Priority) {

int PushByTagRequest_ID = 0;

InvokeNonTabularStoredProcedure ("dbo.usp_Documents_SubmitDocPushByTag",

Cmd => {
Cmd.Parameters.Add("@Requestor", SqlDbType.VarChar, 255).Value = Requestor;
Cmd.Parameters.Add("@Case_ID", SqlDbType.Int).Value = Case_ID;
Cmd.Parameters.Add("@DBId", SqlDbType.Int).Value = DBId;
Cmd.Parameters.Add("@TagName", SqlDbType.VarChar, 255).Value = TagName;
Cmd.Parameters.Add("@Category_ID", SqlDbType.Int).Value = Category_ID;
Cmd.Parameters.Add("@ExpectedDocCount", SqlDbType.Int).Value = ExpectedDocCount;
Cmd.Parameters.Add("@Priority", SqlDbType.Int).Value = Priority;

Cmd.Parameters.Add("@PushByTagRequest_ID", SqlDbType.Int).Direction = ParameterDirection.Output;
},

Cmd => {
PushByTagRequest_ID = (int)Cmd.Parameters["@PushByTagRequest_ID"].Value;
}
);

return PushByTagRequest_ID;
}

Not counting the @RETURN_VALUE parameter automatically supplied by InvokeNonTabularStoredProcedure, seven input parameters are created and added to the SqlCommand object. The eighth is an output parameter that must be referenced after the call to ExecuteNonQuery. Once again, you can see the power of the closure property. Notice in the body of the second delegate, the value of the output parameter is assigned to a variable called PushByTagRequest_ID. That variable is in the scope of the SubmitDocPushByTag method – but it is available to be set within the body of the lambda expression. In this particular instance, the method simply returns that value as the return value of the method, but you can of course use the technique to do more elaborate things – perhaps constructing some object with the values from the procedure’s output parameters and returning that.

I have used generics before having grown to love and rely upon the System.Collections.Generics namespace. But it wasn’t until I found a use for them in my own code that I came to appreciate their power. Similarly, lambda expressions an their esoteric closure property were simply interesting concepts in the many Linq examples I had seen. Again, finding a place for them in my own code was the key to gaining an appreciation of their expressiveness. So be on the lookout for opportunities to try out these and other powerful constructs in your own code. You’ll be a better programmer for it.

Saturday, September 13, 2008

Blast from the Past

Has it been six years since .NET came on the scene? Microsoft's DNA architecture seems so '90s... I had to take an unscheduled trip to visit some code I had written back in 1999, and I thought I'd chronicle the problem and the solution.

I wrote my firm's document management system back in 1999. It uses a three tier architecture -- a Visual Basic thick client speaking DCOM to middle tier COM+ components written in C++ using ATL. It's been very successful, has scaled extremely well, and is run in all eight of our offices. Each office hosts a back-end server that contains the COM+ application (KDocs -- consisting of 18 separate components) and the SQLServer. The SQLServer is typically on the same back-end server, but need not be.

We recently migrated the back-end server in our largest office -- New York -- from a MSC cluster to a new virtual machine hosted on VMWare's ESX technology. Since the location of the COM+ application had moved from the old server to a new one with a different name, I had to redirect all the clients so that they activated the COM+ application on the new server. The procedure was old hat as I had done essentially the same thing for several of my smaller offices that had gone through similar infrastructure upgrades.

All seemed routine and on Monday morning the entire office -- about 1,000 Windows XP workstations -- were running without incident on the new server. But then the call came from my mobile group -- there was an attorney working from home with a VPN connection that was getting a strange error after being redirected to the new server:

Error on FillTreeView2 - The stub received bad data.

Huh? I had never seen this error message before. Was it the new server? But all the workstations in the office were working fine. I told the mobile group to switch the attorney back to the old sever (which was still up), and the error disappeared. So what was the difference? Turns out this attorney was running Vista at home.

We don't run Vista in any of our offices, but we do have some attorneys that run Vista at home (certainly some in my New York office). I do as well and I've never seen this problem. To confirm that there was an issue, I fired up my Vista laptop, pointed it to the new server, and got the same error. I pointed it back to the old server, and it worked fine. Clearly there was some problem with Vista and the components on the new server -- a problem that did not seem to affect XP clients. What could it be?

Next stop -- the application error log on my laptop. This yielded more information on the error:
Source:        Microsoft-Windows-RPC-Events
Date: 9/2/2008 11:56:07 AM
Event ID: 10
Level: Error
Computer: DevLaptop
Description: Application has failed to complete a COM call because an incorrect interface ID was passed as a parameter.

The expected Interface ID was 00000555-0000-0010-8000-00aa006d2ea4,

The Interface ID returned was 00000556-0000-0010-8000-00aa006d2ea4.

User Action - Contact the application vendor for updated version of the application.
I love the user action! All I needed to do was contact my vendor... :-)

The interface ids provided the clue I needed to unravel the mystery. The "expected" interface id identifies MDAC's Recordset interface -- specifically version 2.1 of that interface. The "returned" interface corresponds to a later version of Recordset (version 2.5 which differs from version 2.1 by the inclusion of one additional entry at the end of the vtable -- method Save).

Indeed my component's interfaces expose many methods that pass Recordset as an output parameter. So were they suddenly returning a later version of Recordset -- with a different interface id? It certainly appeared to be the case. And then I thought, why should it matter. The vtable looks the same to clients of the older interface. Indeed, I suspect that if we were talking about in-process COM, and not DCOM, this apparently innocuous impedance mismatch would have been silently ignored and would have caused no issues.

Of course, when process and machine boundaries come into play, there is a proxy and a stub between the client and the server. In this case, I was using type library marshaling with the free threaded marshaller. So there were two mysteries to solve:
  1. Why was I returning a different interface in the output parameters from methods on my new server?

  2. Why did this affect only Vista clients?
As my server software was hosted on servers at each of my eight offices, I decided to try pointing my Vista client at all of them in sequence to see which had problems with Vista and which didn't. Illuminating test. Some of the older servers still worked with Vista but the newer ones did not. Although some of the older servers were still running Windows 2000 while the newer ones were at 2003, that did not seem to be the issue.

After comparing the dates of the component DLLs it appeared that whenever the client pointed to servers with component DLLs dated before 2003 Vista was fine. But those that had DLLs with dates after 2003 were problematic. Believe it or nor, there were no (or at least no significant) changes to the code on the server components in many years. Apparently the differing dates were simply due to recompiles of my components on my development machine(s). And it appeared that one of those recompiles happened in 2003.

The light bulb went on. When passing Recordsets back from server to client, my ATL C++ components refer to the interface as _Recordset. This symbol comes from the type library embedded within msado15.dll. This is the line I had in the C++ code:

#import "c:\Program Files\Common Files\System\ADO\msado15.dll" no_namespace rename ( "EOF", "adoEOF" )

Don't be deceived by the 15 in msdad15.dll. Apparently this DLL has not changed name in the long series of MDAC versions.

When I compiled the application back in the day, the version of MDAC was 2.1. So _Recordset compiled with the 2.1 interface id and that is the interface returned by the servers running those components.

All the client's use the COM+ application proxy that was generated (I believe) back in 1999. The type library that defines my interfaces includes the line:

importlib("msado21.tlb");

which explains why they expect version 2.1 of Recordset in my method's output parameters. Clearly the problem was with my 2003 recompile and the fact that at that time the _Recordset symbol no longer corresponded to version 2.1. Indeed _Recordset corresponded to the 2.5 version with its distinct interface id. The solution for me was to change all references from _Recordset to Recordset21 in my C++ code. I rebuilt the components and deployed them to the new server. Voila -- the clients seemed happy again.

In conclusion, there are two nagging questions that remain for me.
  1. Why does the proxy/stub infrastructure seem to behave differently with Vista clients. It appears that Vista is making stricter checks of the interface ids coming back from method parameters than is XP.

  2. How should I have coded this differently back in 1999 so that this would not have happened. Interfaces are supposed to be immutable and when I recompiled under a newer version of MDAC, I inadvertently changed my interface because the methods now returned a different Recordset interface as an output parameter. As far as I know, the type library back then did not have a version-specific symbol -- that is, later versions of the MDAC type libraries define Recordset21, but that symbol was not available back in the 2.1 type library.
In conclusion, this saga has reminded me how much I don't miss the world of COM. Having said that, the system has stood the test of time and I'm sure that are a lot of other DNA applications still out in the wild.

Saturday, October 23, 2004

Under Fire

I’d like to have the last two weeks of my life back. I intend to post more about this topic in the future when the dust settles, but I thought I’d mention that our Manhattan office building had an electrical fire on Saturday morning, October 9th, that destroyed the entire electrical infrastructure and rendered the building unusable. This displaced the 350 attorneys that work there as well as the entire New York-based support staff.

We activated our disaster recovery facility over the river in Queens (SunGuard) where we continually replicate mail, our document management system, and other miscellaneous resources. Mail and documents were available to our attorneys at home within a few hours.

In addition to space at SunGuard, the firm obtained additional space at the Drake Hotel, 630 5th Avenue, 30 Rockefeller Plaza, and 1133 6th Avenue. Through the efforts of many we have succeeded in keeping everyone connected and working. Much of the credit for this has to go to my boss who has an amazing grasp of both technology and logistics.

As I said, I intend to post a much more thorough analysis of all that was, and still is, to be learned. After all, we still have to move back!

Tuesday, September 28, 2004

Screen Scraping the Regular Way

Recently, I was demonstrating an RSS client for one of our partners with the idea that attorneys might be interested in being notified when updates are detected on certain websites pertinent to our legal practice. He seemed intrigued by the publish/subscribe metaphor being demonstrated and asked about using it to be notified when two particular pages of the SEC's website are updated. One is for proposed rules (http://www.sec.gov/rules/proposed.shtml) and one is for final rules (http://www.sec.gov/rules/final.shtml).

After poking around on the SEC's site, I determined - much to my consternation - that this info was not being published via RSS or any other syndication format, nor was the information available through any web service that I could find. So after exhausting the elegant possibilities, I resolved myself to the fact that I'd have to resort to screen scraping.

Reading the HTML pages is simple enough with the framework. The following snippet puts the HTML into a string:

String HTML;
StreamReader Reader;
HttpWebRequest Request = (HttpWebRequest) WebRequest.Create(url);
HttpWebResponse Response = (HttpWebResponse) Request.GetResponse();

Stream Stm = Response.GetResponseStream();
using (Reader = new StreamReader(Stm)) {
HTML = Reader.ReadToEnd();
}

The next step was to parse through the HTML page and find the data I was interested in which was stored in an HTML table. My first thought was to use Microsoft's MHTML COM object to read this into the HTML Document Object Model (DOM), and traverse the DOM until I found the data of interest. I had some difficulty determining how to load up the DOM (I've only used it from within DHTML scripting) and turned to Usenet for a quick answer. I fired up some Google queries that mentioned screen scraping - under the assumption that others might have taken this approach - and instead of information about the DOM, I came across posts about regular expressions.

This struck a resonant chord. I am relatively proficient with regular expressions, and had remembered that the framework provided good regular expression support, but I hadn't previously had the opportunity to make use of it. But this seemed like the right job for regular expressions.

The pattern I came up with to match my data table within the SEC's HTML page was:

< tr >\r\n< td.* >(\r\n)?< a href="(?< url >.*?)" >(?< releaseno >.*)< /a >< /td >(\r\n)?< td.* >(?< date >.*20[0-9][0-9])< /td >(\r\n)?< td.*? >< b.*? >(?< title >.*?)< /b >

I had to brush up on the particulars of the framework's regex syntax, and it took a little trial an error, but this did the trick. The key is to "capture" the four columns of data and "label" them as url, releaseno, date, and title. These are known as regular expression groups. So when the row is matched, the framework will make these four pieces of data available because I have delineated them appropriately within the expression string. The steps are to:

1. Construct a regex instance passing this regular expression to the constructor
2. Call the matches method of this instance passing in the HTML string representing the page.
3. Loop over each match. Each match is a match of the entire expression - one table row in this instance.
4. For each match, iterate through the "groups". In my expression, I have four groups corresponding to url, releaseno, date, and title.

My goal was to parse the data into something more structured - XML. So I wrote code that created the following XML from the page using the technique outlined above:

< row >
< url >/rules/final/34-50295.htm< /url >
< releaseno >34-50295< /releaseno >
< date >Aug. 31, 2004< /date >
< title >Rule 15c3-3 Reserve Requirements for Margin Related to Security Futures Products< /title >
< /row >

So far so good. But it irked me that this solution was so tied to these two particular web pages. I wanted to make the code a bit more generic. After all, there may be other web pages that may have embedded tabular data that might lend themselves to this type of scraping. Ideally, it would be nice to only have to specify the regular expression to find the data. Since the regular expression contains "code" to parse out and name the data columns, why not parse the regular expression string itself for those data column names?

How to parse the regular expression string passed to the method? With a regular expression, of course:

\(\?< (\w+) >.*?\)

This expression allows me to parse the caller's regular expression for the data column names (which are the names of the regex groups described above). I do this once at the top of the method storing these group names in a simple Arraylist. When I loop through the matches, I create an XMLElement for each of these names and set the value by referencing the group by name from within the Match.

foreach(string FieldName in FieldNames) {
Field = Doc.CreateElement(FieldName);
Field.InnerText = Entry.Groups[FieldName].Value;


Now that I've tamed the HTML and put it into well behaved XML, all that's left is to write an application to poll periodically, store the results, and compare them to those of the previous poll. I'll easily detect when and what updates have occurred and I'll fire off a message to interested parties. The framework's excellent regular expression support did all the heavy lifting.

Sunday, September 05, 2004

Parsimonious parsing and System.Configuration

Not long after I got into .NET development, I learned of the System.Configuration namespace. Initially, I simply took advantage of using a simple <appSettings> section, and the corresponding System.Configuration.ConfigurationSettings.AppSettings NameValueCollection. For small amounts of application configuration, this has great appeal.

Later explorations led me to the use of custom sections with the configuration file. If you include a <configSections> section within your configuration file, you can list your custom configuration sections with child <section> elements. Each <section> element indicates a name for the section and a string that identifies the class (and containing assembly) that can interpret the section. The actual custom section can then appear lower down in the configuration file and can include much more complex data since it’s simply XML and you provide the class to interpret it.

So what’s involved in providing the class to read the custom section? Not much. It has to implement IConfigurationSectionHandler and provide its sole member, Create. When your code calls System.Configuration.ConfigurationSettings.GetConfig(<your section name>), the framework invokes your Create method to hydrate an object from the xml that is passed via the XmlNode section argument.

My early implementations of a Create method were crude. In a nutshell, the code I wrote traversed the XML DOM, interpreted the data therein, and stored it in a custom class that was the method’s return value. There is nothing wrong with doing it this way, but it is tedious and, as I later learned, unnecessary.

One of the fundamental principals I’ve learned over the years is that if you find yourself doing something tedious, it’s time to reexamine it and look for an easier (and usually more elegant) way. In the case of implementing IConfigurationSectionHandler::Create(), it is far easier to leverage the power of XML serialization than it is to parse the data yourself. I believe I first caught wind of this technique in an MSDN article, and if I can find it again, I’ll add it to this entry. The gist of the idea is that if your custom configuration data can be stored in a class that can be serialized to XML, then it can be deserialized from the XML fragment that is your custom configuration section.

Rather than create the class to hold my configuration information, I find it simpler to define the XML schema (xsd) file that describes it. The class definition can then be generated with

Xsd /classes config.xsd

Xsd generates the config.cs file for you that contains the class that can be deserialized from the xml section. Here is a snippet that shows how simple it is to implement the Create method using XML deserialization.

public object Create(object parent, object configContext, System.Xml.XmlNode Section) {

CheckerConfiguration Result;
XmlSerializer Hydrator;
XmlReader Reader;
Hydrator = new XmlSerializer(typeof(CheckerConfiguration));
Reader = new XmlNodeReader(section);
try {
Result = (CheckerConfiguration) Hydrator.Deserialize(Reader);
}
finally {
Reader.Close();
}
return Result;
}


Digression

Although my blog is intended to be of a technical nature, I feel compelled to provide this glimpse into my life. I am a lifelong New York Giants football fan. Not a true fanatic -- just a run-of-the-mill fan. For those of you who don't know the NFL, the Giants play in the NFC East, and our traditional rivals are the Philadelphia Eagles, the Dallas Cowboys, and the Washington Redskins.

I met my wife Pilar

twenty years ago in Washington DC. And unlike many American wives, she watches football with me. She embraces football. She LOVES football. One would think that any red blooded American husband would look at this as a good thing. But sometimes it's hard to live with the "biggest Redskins fan known to man"!

I've already told my children that they shouldn't plan on college because Pilar is turning my basement into a Redskins sports bar -- one item at a time -- through EBay auctions. Each night, after I return from work, she enthusiastically relates to me each of the day's items that she "won"! I'm including a picture of this work in progress here.


Most weeks we go to a sports bar to watch the 'Skins because we are in NY and her games are typically not televised. So in we walk, me with my Giants jersey, and Pilar with her Redskins jersey, jacket, hat, earrings, rings, bracelets, key chain, beer cooler, etc. We watch the games together, and typically empathise with each other about our respective teams' fortunes, but when the 'Skins play the Giants -- normally twice a year -- we're talking about a whole different dynamic. That involves maintaining a safety buffer of at leat four or five bar stools and a subsequent twenty-four hour cooling off period.

Anyway, some other men might feel emasculated by the fact his wife was turning his home and castle into a shrine for his team's nemesis. But not, me. I have refused to allow her or my two beloved daughters to prevent me from wearing the pants in my family. As proof of that, I give you our family dog:


Man's Dog Posted by Hello
If you want to look thinner, always photograph yourself with a large marine mammal.

Shamu and I at Tech-Ed San Diego

Saturday, September 04, 2004

Exploring SQL Server 2000 Reporting Services

In creating my new Conflict Report I had anticipated the need to produce multiple formats. SQL 2000 Reporting Services -- which is used to create the report -- allows for multiple output formats. The initial report was designed to look very much like the one in use now at my firm. That is, it is designed for the printed page, and the PDF output format fits the bill. Since the report submission process is now streamlined to the point where requests can come in via emails initiated on Blackberries, the next logical step is to produce a version of the Conflict report suitable for display on a Blackberry. My first inclination was to simply create another report project where the output was flattened into a simple

Label: Value

format. And I certainly can do that. I would also render this in PDF via Reporting Services and the Blackberry user can then view this PDF (or alternately, I might render it as an Excel file since these render reasonably well on Blackberries. However, it occurred to me that it might be easier and more elegant to leverage Reporting Services ability to render my existing report into XML. An XSL transform could then be used to render a format suitable for the Blackberry...

Impersonating someone who understands ASP.NET Web Service Security

Wednesday, I was briefly working on a web service to deal with a rare problem with KDocs document returns. I decided to write a utility that potentially would be available to domain users which would correct the situation after the fact. It would connect to the appropriate database server, confirm that the symptoms matched those known for this particular problem, and then correct it. That requires that the web service connect to the database as semi-privileged SQL user. The easiest way would be to use SQL authentication. However, I didn't think of that right away, and had decided to rely on an Integrated Security connection.

When you run the web service, your identity is the local ASPNET account. I got errors connecting with this identity.

The solution was:
<identity impersonate="true" userName="domain\priveduser" password="secret" />

The <identity> tag, a child of <system.web>, does not require the userName and password attributes. They are only required if you want to impersonate someone other than the caller.

Which brings me to my next lesson. The next morning, while riding the train, I worked on the very same project on my laptop. I was thinking that since only domain users should ever be able to invoke this web service that I should turn off Anonymous Access and leave only Windows Authentication enabled. The next time I ran the service (from my WinForms test application), I got an HTTP 404 - Access Denied error. And I struggled with this error for the entire ride. After all, I am in my laptop's local Administrators group. So what was it? At first I thought it was misset ACLs somewhere in the c:\inetpub tree. And indeed I did find entries for laptop\Pinsley that seemed to allow no access. But the problem persisted even after I corrected what I now perceive to be a red herring.

As I pulled into Grand Central, I came upon the solution. You have to explicitly set the network credentials on the web service proxy. The magic line was:

FixTool = new localhost.Fixer();
FixTool.Credentials = System.Net.CredentialCache.DefaultCredentials;

That did the trick. So if you want to disable Anonymous access to your web service, remember to set your credentials! Live and learn...

Welcome to my Blog

As I delve into the world of blogging, I envision it, at least initially, as a journal primarily for my own consumption. I intend to use it to record the projects and tasks that occupy my day. I hope to draw upon it later to learn from my past. To start this off, I'll note that yesterday I found that code I had thought was properly installed on a Lotus Notes Server was behaving, it appeared, erratically where sometimes it would run "correctly" and other times it would run as if the recent changes I had made weren't installed. It turned out that the process invoking the COM proxy was the Notes Agent -- and there were four such processes allocated by Notes.

So the steps I had to take were:
Copy the modified assembly to the target folder.

gacutil -u KDConflictProxy
gacutil -i KDConflictProxy.dll
regasm /tlb:KDConflictProxy.tlb KDConflictProxy.dll

then...

tell amgr quit
load amgr

One of the agent processes had held open an older version of the component; it was invoked seemingly randomly and the resulting intermittent behavior was driving me crazy.