Monday, November 14, 2005

Data warehousing - finally!

I finally helped some friends set up a data cube. It mostly involved creating a specialized view that had one numeric field (in our case - a count), a date field, and a number of categorical "dimension" fields. We used Seagate Info, now from Business Objects to actually work with the data. So no, I didn't write my own data cube explorer. The client couldn't afford that, and I didn't really want to do it.

What did I do? I took a conglomerated (non-normalized) survey table and built a set of relational tables for surveys, questions, answers, results (the actual answers), etc. Then I wrote an "importer" to smartly read the "scanned surveys" table and populate my relational data set with any "new" surveys, based on an "IsProcessed" flag, and whether that ScannedSurvey_ID existed in the relational table collection. This allows their OCR process to add records at any time, and a daily process runs under SQL Agent to import new surveys - a "hands off" approach. Oh yeah ... I didn't forget to wrap each record conversion in a transaction.

To give you an idea about the scanned surveys, the table looked like:

ID int identity(1,1) NOT NULL,
SurveyCode char(5) NULL,
A1 char(5) NULL,
A2 char(5) NULL,
A3 char(5) NULL,
A4 char(5) NULL,
A5 char(5) NULL,
A68 char(5) NULL

In my implementation, I pre-scanned for errors, then used a cursor, walking through surveys and questions (not survey responses) and importing all non-empty responses for each question. Ok, since I was trying for SOME speed, I wrapped the whole import in a transaction. I was worried about speed because I had to use the cursor and even dynamic SQL to import the data. I had to use Dynamic SQL so they could add questions without requiring me to come in and update the importer. When I ran into OCR errors, where someone's age = "YES", I set the IsProcessed flag to NULL. Occasionally, a data specialist could review the (rare) problems and erase invalid survey responses, or interpret the survey sheet using good old human intelligence.

While I was working on this project, I was reminded that temp tables participate in transactions, so when I ran into errors, I couldn't set IsProcessed = NULL within the transaction, and I couldn't use a temp table to hold Scanned Survey IDs. In SQL 2000, I could create a "Table Variable", but in SQL7 I had to declare a very large string, concatenate IDs into it, roll back, and then update where CHARINDEX( CONVERT(varchar, ID)+',', @MyBadIDs ) > 0. If I wasn't so lazy, I might have created a temp table and populated it using a while loop, then joined and used a normal update. Using a string to hold bad IDs meant that I could only hold about 1000 to 1500 IDs, but we rarely had more than 3 or 4 errors in a dataset.

By the way, I was surprised when I discovered that table variables don't participate in transactions. I was even more surprised when I found out that the production server was SQL 7.0! That's why I had to use the cheap CHARINDEX hack. Notice that I appended a delimiter to the ID, so an ID of 14 would not match 140 through 149, 1400 through 1499, etc.


Post a Comment

Subscribe to Post Comments [Atom]

Links to this post:

Create a Link

<< Home