Friday, March 6, 2009

“One Size Fits All”: An Idea Whose Time Has Come and Gone

I always enjoyed reading papers from Stonebraker, the database visionary. He can often locates the problem in historic context, and make it simple and clear. In his not-so-recent paper "One Size Fits All": An Idea Whose Time Has Come and Gone. He gives a brief overview how database architecture evolves to meet the ever-changing use senario, and aruges why "one size fits all" does not work any more, and special purpose DB engine, like stream DB, column-store DB, and etc,  will prevail.

Here is a brief summary what I've got from the paper.
  
1970 - RDBMS emerges, i.e. SYSTEM R
1980 - major DB vendors take "one size fit all" strategy to push RDBMS to the mainstream market
1990 - data warehouse: put multiple operational db into a dataware house for business intelligence
Use senario: different OLTP, often optimized for updates, warehouse often
*load the data from operational db periodically, and
*complex adhoc query, i.e. historical trend, correlation between diff op db data
Common data schema: fact and dimensional table,  star schema
Index: Prefer bitmap index( good when data has low cardinality or not frequently updated) over B-Tree

Entering 2000, special-purpose DB engine emerges.

*StreamDB, motivated by fast approaching data streams in monitoring applications
DB Model: in-bound processing for RDBMS ( process-after-store); outbound for StreamDB ( process before (optional) store )
Three reasons that the exiting DBMS can not deal with data streams
  1. RDBMS can not be optimized for in-bound process as triggers are incorporated to the existing design as an after-thought.
  2. lack of low-layer primitives like time-window
  3. RDBMS separate db process and application logic using C/S arch, while stream db need seamless integration between the two.
*Column-store DB ( for extremely large data warehouse )
Data are stored by column, not by row; optimized for "read-intensive" applications, while row-store db are good for write-intensive application.

*DB for Search Engine, represented by Google Bigtable
Use scenario: inbound stream data ( from crawlers) processing, and ad-hoc lookup on existing index; write operation append-only; read-operation sequential.
Requirement: fast response and high availability ( through replication and fast recovery)

*XML DB - still under onging debate whether it is needed.

0 comments: