Monday, April 11, 2011

Generic version control strategy for select table data within a heavily normalized database

Hi

Sorry for the long winded title, but the requirement/problem is rather specific.

With reference to the following sample (but very simplified) structure (in psuedo SQL), I hope to explain it a bit better.

TABLE StructureName {
  Id GUID PK,
  Name varchar(50) NOT NULL
}

TABLE Structure {
  Id GUID PK,
  ParentId GUID (FK to Structure),
  NameId GUID (FK to StructureName) NOT NULL
}

TABLE Something {
  Id GUID PK,
  RootStructureId GUID (FK to Structure) NOT NULL
}

As one can see, Structure is a simple tree structure (not worried about ordering of children for the problem). StructureName is a simplification of a translation system. Finally 'Something' is simply something referencing the tree's root structure.

This is just one of many tables that need to be versioned, but this one serves as a good example for most cases.

There is a requirement to version to any changes to the name and/or the tree 'layout' of the Structure table. Previous versions should always be available.

There seems to be a few possibilities to tackle this issue, like copying the entire structure, but most approaches causes one to 'loose' referential integrity. Example if one followed this approach, one would have to make a duplicate of the 'Something' record, given that the root structure will be a new record, and have a new ID.

Other avenues of possible solutions are looking into how Wiki's handle this or go a lot further and look how proper version control systems work.

Currently, I feel a bit clueless how to proceed on this in a generic way.

Any ideas will be greatly appreciated.

Thanks

leppie

From stackoverflow
  • The data warehousing folks have several algorithms for "slowly-changing dimensions".

    The more sophisticated algorithms provide data ranges around a dimension value to indicate when it's valid.

    Depending on your versioning requirements you could do one of these things, cribbed from Kimball's The Data Warehousing Toolkit.

    1. Assign a version number to rows of the structure table. This means you have to do some reasoning to collect a a complete structure. It includes the selected version number unioned with rows that are unchanged in an earlier version.

    2. Assign a date range or version range to rows of the structure table. This means that some rows have start dates and end dates; some rows will have end dates at some epoch in the impossible future. Or, if you use version numbers, you'll have a start-end pair or a start-infinity pair that indicates this row is still current. You can then trivially query the rows that are valid "today" or apply to the requested version.

    3. Clone the structure for each version. This unpleasant because the clone operation is costly. The queries however, are trivial because the entire structure is available with a single, consistent version number.

    leppie : I cant give both the correct answer, and you have enough :)
  • Some quick ideas:

    full copy: Create a copy of the structure, but for every table add a version_id column to the pk and all fks thus yuo can create copies of the life data with complete referential integrity: pro: easy to query the history con: large amount of (redundend data copied)

    change copy: Only copy the stuff that actually changes, along with valid_from / valid_to data pro: low data volum copied con: hard to query, because one has to join on intervalls

    variation: this applies to both schemes. Instead of creating a copy of the structure, you might keept the current record in the same table as the old versions, but tag it as current. pro: smaller number of tables, easier mixing of history and current information con: normal operation operates on much bigger tables, which will cause a performance impact.

    auditing log: depending on your actual requirements it be sufficient to just create an audit trail like this:

    id timestamp changed_table changed_column old_value new_value changed_by

    you might extend that to a full table structure: transaction, table_change, changed_colum

    pro: generic, hence easy to implement for a large number of tables con: if you need to reconstruct the state of a set of records at a given time querying will become a night mare.

    I wrote a blog about various aproaches to versioning, be warned: it's german

0 comments:

Post a Comment