In details, they divide the current applications on clusters into three categories:
1. Computationally intensive tasks -independent tasks run on different nodes; example systems: Condor
2. Data intensive tasks ( addressed by MapReduce so far )
3. run SQL queries in parallel on structured datasets ( was in the research field of parallel database )
Their goal is to have a system to address all the above three type of applications. Different from the traditional "push" style cluster architecture ( Job scheduler matches the available nodes and push the job to the daemon to run), their system called Clustera has the following features:
1. Deploy the scheduler on application server due to its scalibility and auto pooling service for db servers.
2. Nodes are clients communicates through SOAP over http; ping periodically to pull the job.
3. All job info are stored in RDBMS: users/files/jobs
4. Scheduling is essentially to find matching between jobs and nodes to max certain benefit func
5. User interacts with sys in terms of logical files, while inside of the systems, nodes and jobs deal with concrete files.
6. Abstract job scheduler translate abstract jobs with logical files into concretes jobs with concretes files, and submit them to the scheduler.
They mentioned that the performance gain compared to standard architecture is from Clustera's handling of inter-job dependencies ( enabling more in-memory piping of intermediate data files and finer co-execution granularity). Their experiment result suggests that intermediate data shuffle and transferring seems to be the bottleneck.
And in all the cases, a partition-aware scheduler is able to avoid all data transfer costs, and thus achieves much better performance.
" In Map-Reduce, map and reduce process are almost independent with the problem size; the time spent in the shuffle phase( i.e., transferring files), however, increase linearly with the problem size. This shows that intermediate data transferring is a bottleneck in MR computation - with network bandwidth serving as the limiting factor"
In terms of future direction, they think that
It's the early stage of the cloud computing revolution, in which large clusters of processors are exploited to performs various computing tasks. Traditionally, large-scale parallel database systems use a model of dedicated, single-use clusters very different from that in cloud computing . Clustera opens the door to integrated systems ( through a general-purpose cluster management system) that can run SQL queries on the same platform used to run other data intensive and compute-intensive applications.
