Wednesday, June 6, 2018

Data Architecture - Document Model Design

Relational modeling has drawbacks to design implementation and to access speed. One differentiator is our common language around column based design patterns. Codd's normal forms still provide some guidance for us when it comes to document model design principals. A fundamental design guideline is ease of access is sometimes aided by denormalization, but more often aided by atomicity.

Below assume a document attribute is the same as a column and that a row is synonymous with a document.

NOTE: This is consistent across most engines except PostgreSQL, which has some designs that externalize highly read JSONB document elements into individual columns in the same row as the document for old school indexing purposes pre-GIN indexing.

1NF

Each column (or attribute) has no more than one value, each row (or document) is unique.
Can identify a unique candidate key.

2NF

Each column (or attribute) has no more than one value, each row (or document) is unique.
Can identify a unique candidate key.
All attributes or columns are dependent on the candidate key.

3NF

Each column (or attribute) has no more than one value, each row (or document) is unique.
Can identify a unique candidate key.
All attributes or columns are dependent on the candidate key.
No document or row creates a transitive association
  • A -> B -> C where A !-> C  
  • Driver -> Car -> Wheel Count where Driver has no Wheels. 
In columns this is easy. If one property should have its own unique id in the table, it moves to a new table. This is much more complicated in a document model. Below are different ways to model this.The first three create transitive associations. I can select driver and wheels where id = 1 and it will return the transitive value “Bob has four wheels.” The fourth creates a child document, but embeds it. 
  1. {id: “1”, [“Bob”, “car”, 4]}
  2. {id: “1”, “driver”:”Bob”, “vehicle”: “car”, “wheels”: “4”}
  3. {id: “1”, [“driver”:”Bob”, “vehicle”: “car”, “wheels”: “4”]}
  4. {id: “1”, “driver”:”Bob”, “car”:{id: “1”, "type“: "wheels”, "count": 4}}
  5. {id: “1”, “driver”:”Bob”, “car”:[{id: “1”, "type“: "wheels”, "count": 4}, {id: “2”, "type“: "hubs”, "count": 4},{id: “3”, "type“: "lug nuts”, "count": 16}]}

    Examples four and five are different forms of the same child document embedding. This form of normalization is faster to access and has less data access code to write. Well, it is faster as long as the sub-document relationship is bounded in its many-ness (somewhere between 1:1 and 1:5).

    Areas of Concern in Embedding


    When the set of possible children can balloon out the size of the document beyond reasonable access speed, it no longer represents a well defined single entity. It defines an entity relationship, one that requires multiple entities to allow for atomicity. Large heterogeneous documents are really poorly designed entity relationships hiding in an elephant costume.

    Many to many relationships also require externalization of child documents. 

    Data with different volatility requirements (write, update, delete operations) should be considered for externalized sub documnet storage.