Skip to main content

Data warehousing - finally!

I finally helped some friends set up a data cube. It mostly involved creating a specialized view that had one numeric field (in our case - a count), a date field, and a number of categorical "dimension" fields. We used Seagate Info, now from Business Objects to actually work with the data. So no, I didn't write my own data cube explorer. The client couldn't afford that, and I didn't really want to do it.

What did I do? I took a conglomerated (non-normalized) survey table and built a set of relational tables for surveys, questions, answers, results (the actual answers), etc. Then I wrote an "importer" to smartly read the "scanned surveys" table and populate my relational data set with any "new" surveys, based on an "IsProcessed" flag, and whether that ScannedSurvey_ID existed in the relational table collection. This allows their OCR process to add records at any time, and a daily process runs under SQL Agent to import new surveys - a "hands off" approach. Oh yeah ... I didn't forget to wrap each record conversion in a transaction.

To give you an idea about the scanned surveys, the table looked like:

CREATE TABLE (
ID int identity(1,1) NOT NULL,
SurveyCode char(5) NULL,
A1 char(5) NULL,
A2 char(5) NULL,
A3 char(5) NULL,
A4 char(5) NULL,
A5 char(5) NULL,
...
A68 char(5) NULL
)

In my implementation, I pre-scanned for errors, then used a cursor, walking through surveys and questions (not survey responses) and importing all non-empty responses for each question. Ok, since I was trying for SOME speed, I wrapped the whole import in a transaction. I was worried about speed because I had to use the cursor and even dynamic SQL to import the data. I had to use Dynamic SQL so they could add questions without requiring me to come in and update the importer. When I ran into OCR errors, where someone's age = "YES", I set the IsProcessed flag to NULL. Occasionally, a data specialist could review the (rare) problems and erase invalid survey responses, or interpret the survey sheet using good old human intelligence.

While I was working on this project, I was reminded that temp tables participate in transactions, so when I ran into errors, I couldn't set IsProcessed = NULL within the transaction, and I couldn't use a temp table to hold Scanned Survey IDs. In SQL 2000, I could create a "Table Variable", but in SQL7 I had to declare a very large string, concatenate IDs into it, roll back, and then update where CHARINDEX( CONVERT(varchar, ID)+',', @MyBadIDs ) > 0. If I wasn't so lazy, I might have created a temp table and populated it using a while loop, then joined and used a normal update. Using a string to hold bad IDs meant that I could only hold about 1000 to 1500 IDs, but we rarely had more than 3 or 4 errors in a dataset.

By the way, I was surprised when I discovered that table variables don't participate in transactions. I was even more surprised when I found out that the production server was SQL 7.0! That's why I had to use the cheap CHARINDEX hack. Notice that I appended a delimiter to the ID, so an ID of 14 would not match 140 through 149, 1400 through 1499, etc.

Comments

Popular posts from this blog

Updated ActiveRecord Code Generator

Today, I updated the ActiveRecord Code Generator a bit. I checked in changes to use primary and foreign key details from INFORMATION_SCHEMA. The original code used naming conventions to decide what various fields were used for - ID = Primary Key, Field_ID = Foreign Key to table Fields. If you want to use naming conventions, let me know and I can add a setting in App.Config to allow this (along with any "real" key constraints).

How does Rails scaffolding select HTML input tags?

Recently, a reader saw my fix for SQL Server booleans, and asked me a followup question: why does Rails display a yes/no selection instead of a checkbox? The short answer is look in {RUBY_HOME} /lib/ruby/gems/1.8 /gems/actionpack-1.10.2 /lib/action_view/helpers, but your path may vary depending on whether you are using gem, "edge rails", etc. Anyway, look in the file "active_record_helper.rb" for a method called "all_input_tags", and notice that it calls "default_input_block" if you don't supply an input_block. Now notice that "default_input_block" creates a label and calls "input(record, column.name)" which in turn calls "InstanceTag#to_tag" which finally looks at the datatype and maps boolean to a select tag. Perhaps a wiser Rails explorer can provide us with the rationale for this, but I guess we could add a MixIn for InstanceTag that redefines the to_tag() method, or just do a dirty and unmaintainable hack l...

Features of the Code Generator

I just updated my code generator to optionally generate validation attributes. This simple change includes App.config file entries for all check boxes, and a new checkbox for "Validation" - aka validation attibute generation. While I was making this change, I realized that I really need to pass a CodeGenerationContext object to the DbTable, DbField and ModelGenerator classes. The requester can populate the context, and pass it to the code generator. Anyway, enough about the code, let's talk about the templates. I made a simple template this weekend to generate a DataGridView column array, suitable for databinding. I'm sure my new template will need some tweaks to handle Foreign Keys better (it currently just displays them as TextBox). Let's look at a template. ##FILENAME:PR_${table.GetClassName()}_Insert.sql ## ## Generate a SQL stored procedure to insert a record into the ## specified table and return the newly created primary key ## ## FUTURE: The generat...