Data warehousing - finally!

I finally helped some friends set up a data cube. It mostly involved creating a specialized view that had one numeric field (in our case - a count), a date field, and a number of categorical "dimension" fields. We used Seagate Info, now from Business Objects to actually work with the data. So no, I didn't write my own data cube explorer. The client couldn't afford that, and I didn't really want to do it.

What did I do? I took a conglomerated (non-normalized) survey table and built a set of relational tables for surveys, questions, answers, results (the actual answers), etc. Then I wrote an "importer" to smartly read the "scanned surveys" table and populate my relational data set with any "new" surveys, based on an "IsProcessed" flag, and whether that ScannedSurvey_ID existed in the relational table collection. This allows their OCR process to add records at any time, and a daily process runs under SQL Agent to import new surveys - a "hands off" approach. Oh yeah ... I didn't forget to wrap each record conversion in a transaction.

To give you an idea about the scanned surveys, the table looked like:


CREATE TABLE (
 ID int identity(1,1) NOT NULL,
 SurveyCode char(5) NULL,
 A1 char(5) NULL,
 A2 char(5) NULL,
 A3 char(5) NULL,
 A4 char(5) NULL,
 A5 char(5) NULL,
 ...
 A68 char(5) NULL
)

In my implementation, I pre-scanned for errors, then used a cursor, walking through surveys and questions (not survey responses) and importing all non-empty responses for each question. Ok, since I was trying for SOME speed, I wrapped the whole import in a transaction. I was worried about speed because I had to use the cursor and even dynamic SQL to import the data. I had to use Dynamic SQL so they could add questions without requiring me to come in and update the importer. When I ran into OCR errors, where someone's age = "YES", I set the IsProcessed flag to NULL. Occasionally, a data specialist could review the (rare) problems and erase invalid survey responses, or interpret the survey sheet using good old human intelligence.

While I was working on this project, I was reminded that temp tables participate in transactions, so when I ran into errors, I couldn't set IsProcessed = NULL within the transaction, and I couldn't use a temp table to hold Scanned Survey IDs. In SQL 2000, I could create a "Table Variable", but in SQL7 I had to declare a very large string, concatenate IDs into it, roll back, and then update where CHARINDEX( CONVERT(varchar, ID)+',', @MyBadIDs ) > 0. If I wasn't so lazy, I might have created a temp table and populated it using a while loop, then joined and used a normal update. Using a string to hold bad IDs meant that I could only hold about 1000 to 1500 IDs, but we rarely had more than 3 or 4 errors in a dataset.

By the way, I was surprised when I discovered that table variables don't participate in transactions. I was even more surprised when I found out that the production server was SQL 7.0! That's why I had to use the cheap CHARINDEX hack. Notice that I appended a delimiter to the ID, so an ID of 14 would not match 140 through 149, 1400 through 1499, etc.

Learning by doing

Search This Blog

Data warehousing - finally!

Comments

Popular posts from this blog

Castle ActiveRecord with DetachedCriteria

How does Rails scaffolding select HTML input tags?

Castle ActiveRecord calling a Stored Procedure