Friday, April 3, 2009

The unreasonable effectiveness of data


Google recently published an artical on explaining the unreasonable effectiveness of data they have observed in various application of maching learning to Web data: simple n-gram models or linear classifiers based on millions of specific features perform better than elaborate models that tries to discover general rules.

So why generative rules fails?
"A small number of rules simply can not capture the complexity of the variety of vocabulary words and grammar constructions."
Why simple statistics works?
" We know that the number of grammatical English sentences is theoretically infinite. However, in practice we humans care to make only a finite number of distinctions. For many tasks, once we have a billion or so examples, we essentially have a closed set that represents what we need without generative rules.  For many tasks, words and word combinations provide all the representational machinery we need to learn from text."  Plus, "statistics methods are natural scalable as most of data analysis can be performed in parallel. "
And their conclusion is

"Choose a representation that can use unsupervised learning with unlabeled data, which is so much more plentiful; represent a nonparametric model rather than trying to summarize it with a parametric model, because with very large data sources, the data holds a lot of details. For natural language applications, trust that the human language has evolved words for the important concepts. See how far you can go by tying those words that are already there rather than inventing with clusters of words. "
 

0 comments: