Friday, April 3, 2009

The unreasonable effectiveness of data


Google recently published an artical on explaining the unreasonable effectiveness of data they have observed in various application of maching learning to Web data: simple n-gram models or linear classifiers based on millions of specific features perform better than elaborate models that tries to discover general rules.

So why generative rules fails?
"A small number of rules simply can not capture the complexity of the variety of vocabulary words and grammar constructions."
Why simple statistics works?
" We know that the number of grammatical English sentences is theoretically infinite. However, in practice we humans care to make only a finite number of distinctions. For many tasks, once we have a billion or so examples, we essentially have a closed set that represents what we need without generative rules.  For many tasks, words and word combinations provide all the representational machinery we need to learn from text."  Plus, "statistics methods are natural scalable as most of data analysis can be performed in parallel. "
And their conclusion is

"Choose a representation that can use unsupervised learning with unlabeled data, which is so much more plentiful; represent a nonparametric model rather than trying to summarize it with a parametric model, because with very large data sources, the data holds a lot of details. For natural language applications, trust that the human language has evolved words for the important concepts. See how far you can go by tying those words that are already there rather than inventing with clusters of words. "
 

Wednesday, March 18, 2009

Clustera: An Integrated Computation And Data Management System

This is an ongoing effort from Wisconsin database group on integrating various computing intensive and data intensive applications and database queries into a unified cluster computing system. It is nicely written, and the authors tried hard to convince us that (1) a general-purpose cluster computing system can compete with special computing task oriented framework such as MR; (2) many of database concepts and techniques such as data inter-dependency and partition-aware computation can be used to accelerate computation.

In details, they divide the current applications on clusters into three categories:
1. Computationally intensive tasks -independent tasks run on different nodes; example systems: Condor
2. Data intensive tasks ( addressed by MapReduce so far )
3. run SQL queries in parallel on structured datasets ( was in the research field of parallel database )

Their goal is to have a system to address all the above three type of applications. Different from the traditional "push" style cluster architecture ( Job scheduler matches the available nodes and push the job to the daemon to run), their system called Clustera has the following features:
1. Deploy the scheduler on application server due to its scalibility and auto pooling service for db servers.
2. Nodes are clients communicates through SOAP over http; ping periodically to pull the job.
3. All job info are stored in RDBMS: users/files/jobs
4. Scheduling is essentially to find matching between jobs and nodes to max certain benefit func
5. User interacts with sys in terms of logical files, while inside of the systems, nodes and jobs deal with concrete files.
6. Abstract job scheduler translate abstract jobs with logical files into concretes jobs with concretes files, and submit them to the scheduler.

They mentioned that the performance gain compared to standard architecture is from Clustera's handling of inter-job dependencies ( enabling more in-memory piping of intermediate data files and finer co-execution granularity). Their experiment result suggests that intermediate data shuffle and transferring seems to be the bottleneck.

" In Map-Reduce, map and reduce process are almost independent with the problem size; the time spent in the shuffle phase( i.e., transferring files), however, increase linearly with the problem size. This shows that intermediate data transferring is a bottleneck in MR computation - with network bandwidth serving as the limiting factor"
And in all the cases, a partition-aware scheduler is able to avoid all data transfer costs, and thus achieves much better performance.

In terms of future direction, they think that
It's the early stage of the cloud computing revolution, in which large clusters of processors are exploited to performs various computing tasks. Traditionally, large-scale parallel database systems use a model of dedicated, single-use clusters very different from that in cloud computing . Clustera opens the door to integrated systems ( through a general-purpose cluster management system) that can run SQL queries on the same platform used to run other data intensive and compute-intensive applications.

Friday, March 6, 2009

Mapreduce Debate

It is amused to read David Dewitt and Micheal Stonebraker's attack on Google's Map-reduce framework, and even more fun to read all the comments followed.

It's a typical fight between academia and industry. I particularly agreed that Joe. M. Hellerstein's comment that you'd have to admit "you lose" under the test of the real market. Rather than spending energy on arguing who is more advanced, I think a more fruitful way is to think why it happens and how to improve it.

Also liked what one of the commenter said:

"I really liked Feigenbaum's approach to these sorts of things. His group was doing things in the 80's that are only today being widely understood and adopted. But instead of arguing about how primitive modern techniques were compared to his work, he always worked with the young upstarts who were exploring an area that was new to them but old to academia, and gently guided them in the right direction without judging or bragging."

That's sth academia needed.

“One Size Fits All”: An Idea Whose Time Has Come and Gone

I always enjoyed reading papers from Stonebraker, the database visionary. He can often locates the problem in historic context, and make it simple and clear. In his not-so-recent paper "One Size Fits All": An Idea Whose Time Has Come and Gone. He gives a brief overview how database architecture evolves to meet the ever-changing use senario, and aruges why "one size fits all" does not work any more, and special purpose DB engine, like stream DB, column-store DB, and etc,  will prevail.

Here is a brief summary what I've got from the paper.
  
1970 - RDBMS emerges, i.e. SYSTEM R
1980 - major DB vendors take "one size fit all" strategy to push RDBMS to the mainstream market
1990 - data warehouse: put multiple operational db into a dataware house for business intelligence
Use senario: different OLTP, often optimized for updates, warehouse often
*load the data from operational db periodically, and
*complex adhoc query, i.e. historical trend, correlation between diff op db data
Common data schema: fact and dimensional table,  star schema
Index: Prefer bitmap index( good when data has low cardinality or not frequently updated) over B-Tree

Entering 2000, special-purpose DB engine emerges.

*StreamDB, motivated by fast approaching data streams in monitoring applications
DB Model: in-bound processing for RDBMS ( process-after-store); outbound for StreamDB ( process before (optional) store )
Three reasons that the exiting DBMS can not deal with data streams
  1. RDBMS can not be optimized for in-bound process as triggers are incorporated to the existing design as an after-thought.
  2. lack of low-layer primitives like time-window
  3. RDBMS separate db process and application logic using C/S arch, while stream db need seamless integration between the two.
*Column-store DB ( for extremely large data warehouse )
Data are stored by column, not by row; optimized for "read-intensive" applications, while row-store db are good for write-intensive application.

*DB for Search Engine, represented by Google Bigtable
Use scenario: inbound stream data ( from crawlers) processing, and ad-hoc lookup on existing index; write operation append-only; read-operation sequential.
Requirement: fast response and high availability ( through replication and fast recovery)

*XML DB - still under onging debate whether it is needed.

Tuesday, February 10, 2009

PNUTS: Yahoo!’s Hosted Data Serving Platform

In this VLDB paper, folks in Yahoo described their distributed database -- PNUTS, the one with richer database features than Google BigTable, while still manages achieves low latency at a massive scale.

Here are several important design desicion that differentiate their system with others:

1. Consistency model: They choose sth in between two extremes: the general serializability guaranteed by RDBMS, and the eventual consistency delieved by many modern distributed database, such Amazon dynamo. The model, called per-record timeline consistency, guarantees that all replicas of a given record apply all updates to the record in the same order.

2. The use of pub/sub system for both replication updates between regions, and redo log of the database.

3. A flexible mapping of tablets to storage units to support failover and load balancing

It also interests me on some applications PNUTS targets for:
1. User database: frequent read and write; the user must seen his own changes, but it's fine that other users read stable data.
2. Social applications: lots of small updates. Ability of high write rates is key, but dessemination of the writes are not essential. A typical relationship table can be maintained with the primary key as (friend1, friend2).

Monday, January 12, 2009

23d CNNIC Chinese Internet Annual Report

Below are some notable trends from 23rd CNNIC Chinese Internet Annual Report:

* fast growth of internet users, in particular the users from chinese rural area: almost 0.3 billion internet users, 22.6% of the population, 41.9% annual growth
* IPv4 resource becomes a bottleneck of further growth
* fast growth on news and user created conent such as blogs
* Internet games are popular among kids, while music, instant message, and news are the major applications among university students.
* Research shows that, the more use of internet, the more trust users have on internet.





Tuesday, December 16, 2008

You’re Leaving a Digital Trail. What About Privacy?

NYTimes: You're Leaving a Digital Trail. What About Privacy?

: "Such a complete and constantly updated picture will undoubtedly redefine traditional notions of privacy."