Wednesday, March 18, 2009

Clustera: An Integrated Computation And Data Management System

This is an ongoing effort from Wisconsin database group on integrating various computing intensive and data intensive applications and database queries into a unified cluster computing system. It is nicely written, and the authors tried hard to convince us that (1) a general-purpose cluster computing system can compete with special computing task oriented framework such as MR; (2) many of database concepts and techniques such as data inter-dependency and partition-aware computation can be used to accelerate computation.

In details, they divide the current applications on clusters into three categories:
1. Computationally intensive tasks -independent tasks run on different nodes; example systems: Condor
2. Data intensive tasks ( addressed by MapReduce so far )
3. run SQL queries in parallel on structured datasets ( was in the research field of parallel database )

Their goal is to have a system to address all the above three type of applications. Different from the traditional "push" style cluster architecture ( Job scheduler matches the available nodes and push the job to the daemon to run), their system called Clustera has the following features:
1. Deploy the scheduler on application server due to its scalibility and auto pooling service for db servers.
2. Nodes are clients communicates through SOAP over http; ping periodically to pull the job.
3. All job info are stored in RDBMS: users/files/jobs
4. Scheduling is essentially to find matching between jobs and nodes to max certain benefit func
5. User interacts with sys in terms of logical files, while inside of the systems, nodes and jobs deal with concrete files.
6. Abstract job scheduler translate abstract jobs with logical files into concretes jobs with concretes files, and submit them to the scheduler.

They mentioned that the performance gain compared to standard architecture is from Clustera's handling of inter-job dependencies ( enabling more in-memory piping of intermediate data files and finer co-execution granularity). Their experiment result suggests that intermediate data shuffle and transferring seems to be the bottleneck.

" In Map-Reduce, map and reduce process are almost independent with the problem size; the time spent in the shuffle phase( i.e., transferring files), however, increase linearly with the problem size. This shows that intermediate data transferring is a bottleneck in MR computation - with network bandwidth serving as the limiting factor"
And in all the cases, a partition-aware scheduler is able to avoid all data transfer costs, and thus achieves much better performance.

In terms of future direction, they think that
It's the early stage of the cloud computing revolution, in which large clusters of processors are exploited to performs various computing tasks. Traditionally, large-scale parallel database systems use a model of dedicated, single-use clusters very different from that in cloud computing . Clustera opens the door to integrated systems ( through a general-purpose cluster management system) that can run SQL queries on the same platform used to run other data intensive and compute-intensive applications.

Friday, March 6, 2009

Mapreduce Debate

It is amused to read David Dewitt and Micheal Stonebraker's attack on Google's Map-reduce framework, and even more fun to read all the comments followed.

It's a typical fight between academia and industry. I particularly agreed that Joe. M. Hellerstein's comment that you'd have to admit "you lose" under the test of the real market. Rather than spending energy on arguing who is more advanced, I think a more fruitful way is to think why it happens and how to improve it.

Also liked what one of the commenter said:

"I really liked Feigenbaum's approach to these sorts of things. His group was doing things in the 80's that are only today being widely understood and adopted. But instead of arguing about how primitive modern techniques were compared to his work, he always worked with the young upstarts who were exploring an area that was new to them but old to academia, and gently guided them in the right direction without judging or bragging."

That's sth academia needed.

“One Size Fits All”: An Idea Whose Time Has Come and Gone

I always enjoyed reading papers from Stonebraker, the database visionary. He can often locates the problem in historic context, and make it simple and clear. In his not-so-recent paper "One Size Fits All": An Idea Whose Time Has Come and Gone. He gives a brief overview how database architecture evolves to meet the ever-changing use senario, and aruges why "one size fits all" does not work any more, and special purpose DB engine, like stream DB, column-store DB, and etc,  will prevail.

Here is a brief summary what I've got from the paper.
  
1970 - RDBMS emerges, i.e. SYSTEM R
1980 - major DB vendors take "one size fit all" strategy to push RDBMS to the mainstream market
1990 - data warehouse: put multiple operational db into a dataware house for business intelligence
Use senario: different OLTP, often optimized for updates, warehouse often
*load the data from operational db periodically, and
*complex adhoc query, i.e. historical trend, correlation between diff op db data
Common data schema: fact and dimensional table,  star schema
Index: Prefer bitmap index( good when data has low cardinality or not frequently updated) over B-Tree

Entering 2000, special-purpose DB engine emerges.

*StreamDB, motivated by fast approaching data streams in monitoring applications
DB Model: in-bound processing for RDBMS ( process-after-store); outbound for StreamDB ( process before (optional) store )
Three reasons that the exiting DBMS can not deal with data streams
  1. RDBMS can not be optimized for in-bound process as triggers are incorporated to the existing design as an after-thought.
  2. lack of low-layer primitives like time-window
  3. RDBMS separate db process and application logic using C/S arch, while stream db need seamless integration between the two.
*Column-store DB ( for extremely large data warehouse )
Data are stored by column, not by row; optimized for "read-intensive" applications, while row-store db are good for write-intensive application.

*DB for Search Engine, represented by Google Bigtable
Use scenario: inbound stream data ( from crawlers) processing, and ad-hoc lookup on existing index; write operation append-only; read-operation sequential.
Requirement: fast response and high availability ( through replication and fast recovery)

*XML DB - still under onging debate whether it is needed.