Ewan Birney最近的一篇博文(Five statistical things I wished I had been taught 20 years ago )讲述了统计对于生物学的重要性。

一开始从RA Fisher讲起,说生物压根就是统计。Fisher是个农业学家,他所建立的那些统计方法,都是从生物学问题出发。


1. Non parametric statistics.
These are statistical tests which make a bare minimum of assumptions of underlying distributions; in biology we are rarely confident that we know the underlying distribution, and hand waving about central limit theorem can only get you so far. Wherever possible you should use a non parameteric test. This is Mann-Whitney (or Wilcoxon if you prefer) for testing "medians" (Medians is in quotes because this is not quite true. They test something which is closely related to the median) of two distributions, Spearman's Rho (rather pearson's r2) for correlation, and the Kruskal test rather than ANOVAs (though if I get this right, you can't in Kruskal do the more sophisticated nested models you can do with ANOVA). Finally, don't forget the rather wonderful Kolmogorov-Smirnov (I always think it sounds like really good vodka) test of whether two sets of observations come from the same distribution. All of these methods have a basic theme of doing things on the rank of items in a distribution, not the actual level. So - if in doubt, do things on the rank of metric, rather than the metric itself.


今年磷酸化谱的文章,用到了免疫组化,实验结果的量化是由医生给出来打分值,癌组织和癌旁组织两组数据,免疫组化的数据不可能用参数统计,这个结果我就是用Wilcoxon signed rank test去做检验。文中所提出的其它非参统计方法,全都不会。囧

On the Effects of Non-Normality on the Distribution of the Sample Product-Moment Correlation Coefficient (Kowalski, 1975)

还有文章说用Kendall's tau比Spearman's Rho要好:
Newson R. Parameters behind "nonparametric" statistics: Kendall's tau,Somers' D and median differences. Stata Journal 2002; 2(1):45-64.



2. R (or I guess S).
R is a cranky, odd statistical language/system with a great scientific plotting package. Its a package written mainly by statisticians for statisticians, and is rather unforgiving the first time you use it. It is defnitely worth persevering. It's basically a combination of excel spreadsheets on steriods (with no data entry. an Rdata frame is really the same logical set as a excel workbook - able to handle millions of points, not 1,000s), a statistical methods compendium (it's usually the case that statistical methods are written first in R, and you can almost guarantee that there are no bugs in the major functions - unlike many other scenarios) and a graphical data exploration tool (in particular lattice and ggplot packages). The syntax is inconsistent, the documentation sometimes wonderful, often awful and the learning curve is like the face of the Eiger. But once you've met p.adjust(), xyplot() and apply(), you can never turn back.



至于画图,文中提到lattice和ggplot,lattice应该是目前R上面最复杂的图形包,功能比ggplot要强得多,画图速度也比ggplot要快,不过我没用过。只学了ggplot,因为ggplot的语法更加human friendly,我觉得,学了ggplot后,都会爱上画图的 =,=

3. The problem of multiple testing, and how to handle it, either with the Expected value, or FDR, and the backstop of many of piece of bioinformatics - large scale permutation.
Large scale permutation is sometimes frowned upon by more maths/distribution purists but often is the only way to get a sensible sense of whether something is likely "by chance" (whatever the latter phrase means - it's a very open question) given the complex, hetreogenous data we have. 10 years ago perhaps the lack of large scale compute resources meant this option was less open to people, but these days basically everyone should be working out how to appropriate permute the data to allow a good estimate of "surprisingness" of an observation.

高通量的组学数据,变得越来越常见,pvalue算的是犯一类错误的概率,组学数据观测点多,而重复少,noise很多,如果单纯卡个pvalue,越高通量的数据,犯二类错误的概率会更大,假阳性没有得到控制。这个越来越重要,这周去给学生上课,我还专门讲了Bonferroni Method、Benjamini-Hochberg Method还有q-value,不过好像我讲的时候,学生都没啥兴趣,或许有一天,他们写文章,reviewer要求给出FDR的时候,希望还能记起。

4. The relationship between Pvalue, Effect size, and Sample size
This needs to be drilled into everyone - we're far too trigger happy quoting Pvalues, when we should often be quoting Pvalues and Effect size. Once a Pvalue is significant, it's higher significance is sort of meaningless (or rather it compounds Effect size things with Sample size things, the latter often being about relative frequency). So - if something is significantly correlated/different, then you want to know about how much of an effect this observation has. This is not just about GWAS like statistics - in genomic biology we're all too happy about quoting some small Pvalue not realising that with a million or so points often, even very small deviations will be significant. Quote your r2, Rhos or proportion of variance explained...

从没接触过GWAS,不知道是怎么算的,从文中的描述看,这里讲的是power analysis,这个对实验设计有用,可以估计sample size。当然如果sample size已确定,那么设定pvalue和power,可以计算effect size,就是说,实验可以detect出多大的effect。或者知道sample size,effect size, pvalue,可以计算power,就是说effect存在的话,有多大的概率可以detect出来。这和pvalue不一样,pvalue算的是没有effect(H0)的概率。

power analysis就是四个变量,颠来倒去,知道三个,算第四个。


5. Linear models and PCA.
There is a tendency often to jump to quite complex models - networks, or biologically inspired combinations, when our first instinct should be to crack out the well established lm() (linear model) for prediction and princomp() (PCA) for dimensionality reduction. These are old school techniques - and often if you want to talk about statistical fits one needs to make gaussian assumptions about distributions - but most of the things we do could be either done well in a linear model, and most of the correlation we look at could have been found with a PCA biplot. The fact that these are 1970s bits of statistics doesn't mean they don't work well.

下面这段解释了linear model和PCA之间的不同之处:

One may also see PCA as an analogue of the least squares method to find a line that goes as “near” the points as possible – to simplify, let us assume there are just two dimensions. But while the least squares method is asymetric (the two variables play different roles: they are not interchangeable, we try to predict one from the others, we measure the distance parallel to one coordinate axis), the PCA is symetric (the distance is measured orthogonally to the line we are looking for).

John Mark在评论里写道,进阶还需要学什么,一并记录下来。

The next level - number 6 - would be to get beyond P values, and instead compute probability distributions of the quantities of interest. This leads naturally to number 7, which is to delve into the generative models that are currently solved by MCMC methods. This is basically the Bayesian approach. Just as an aside "non parametrics" in some new work is also used to mean models where the number of parameters varies, as a consequence of the method.

Related Posts

  1. 好文章啊,不错~~




    ygc China Unknow Browser Unknow Os Reply:

    真的是没啥用,只是防自己的页面,在各种rss reader里都能ctrl-c ➡


  2. PLoS Computational Biology的这两篇文章,同样很赞,值得推荐:

    A Quick Guide for Developing Effective Bioinformatics Programming Skills

    A Quick Guide to Teaching R Programming to Computational Biology Students


  3. http://ygc.name/2011/06/24/five-things-biologists-should-know-about-statistics/ 这篇文章含泪推荐,接触大数据的生物工作者都要看看。那些天天喊着数据简单的人,要知道你们处理出来的数据根本经不起生物信息学专家的一眼,一堆假阳性。我今晚是被Bonferroni correction刺激到了,泪。


Leave a Comment

NOTE - You can use these HTML tags and attributes:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>