Tag Archives: R - Page 8

bubble chart by using ggplot2

The visualization represented by Hans Rosling's TED talk was very impressive. FlowingData provides a tutorial on making bubble chart in R. I try to create bubble chart by using ggplot2.

With the dataset provided by FlowingData,The bubble chart was made by the following code.

crime <- read.csv("http://datasets.flowingdata.com/crimeRatesByState2008.csv", header=TRUE, sep="\t")
p <- ggplot(crime, aes(murder,burglary,size=population, label=state))
p <- p+geom_point(colour="red") +scale_area(to=c(1,20))+geom_text(size=3)
p + xlab("Murders per 1,000 population") + ylab("Burglaries per 1,000")

Here is what it looks like.

The avalanche of publications mentioning GO

Gene Ontology is the de facto standard for annotation of gene products. It has been widely used in biological data mining, and I believe it will play more central role in the future.

Publications mentioning GO was collected and deposited in GO ftp, and can be accessed (ftp://ftp.geneontology.org/go/doc/).

I count the number of publicans by year, and draw a histogram, which showed that the growing trend was remarkable.

> gopub <- read.delim("ftp://ftp.geneontology.org/go/doc/biblio-data.txt")
> dim(gopub)
[1] 3626   10
> p <- ggplot(gopub, aes(year))
> p + geom_histogram(aes(y=..count..)) + opts(title = "Publications mentioning GO")

GOSemSim redesign in terms of S4 classes

I started to develop GOSemSim package two years ago when I was not quite familiar with R. I am very happy to see that someone use it and found it helpful.

I try to learn S4 and redesign GOSemSim with S4 classes and methods in the pass two weeks, and the very first version was implemented. As I'm not very familiar with S4, the package may need improve in many aspect.

The newest version of GOSemSim can be installed by:

install.packages("GOSemSim",repos="http://www.bioconductor.org/packages/devel/bioc",type="source")

Here are some examples:
Read more »

upgrade R - F77 cause compilation error

I try to compile the source code of R 2.12 on CentOS, but it throw an error when trying to install *cluster*.

* installing *source* package ‘cluster’ ...
** libs
gcc -std=gnu99 -I/usr/local/lib/R/include  -I/usr/local/include    -fpic  -g -O2 -c clara.c -o clara.o
g77   -fpic  -g -O2 -c daisy.f -o daisy.o
g77   -fpic  -g -O2 -c dysta.f -o dysta.o
gcc -std=gnu99 -I/usr/local/lib/R/include  -I/usr/local/include    -fpic  -g -O2 -c fanny.c -o fanny.o
gcc -std=gnu99 -I/usr/local/lib/R/include  -I/usr/local/include    -fpic  -g -O2 -c init.c -o init.o
g77   -fpic  -g -O2 -c meet.f -o meet.o
g77   -fpic  -g -O2 -c mona.f -o mona.o
gcc -std=gnu99 -I/usr/local/lib/R/include  -I/usr/local/include    -fpic  -g -O2 -c pam.c -o pam.o
gcc -std=gnu99 -I/usr/local/lib/R/include  -I/usr/local/include    -fpic  -g -O2 -c sildist.c -o sildist.o
gcc -std=gnu99 -I/usr/local/lib/R/include  -I/usr/local/include    -fpic  -g -O2 -c spannel.c -o spannel.o
g77   -fpic  -g -O2 -c twins.f -o twins.o
gcc -std=gnu99 -shared -L/usr/local/lib -o cluster.so clara.o daisy.o dysta.o fanny.o init.o meet.o mona.o pam.o sildist.o spannel.o twins.o -L/usr/lib/gcc/i386-redhat-linux/3.4.6 -lg2c -lm
installing to /usr/local/lib/R/library/cluster/libs
** R
** data
**  moving datasets to lazyload DB
** inst
** preparing package for lazy loading
** help
*** installing help indices
** building package indices ...
** testing if installed package can be loaded
Error in dyn.load(file, DLLpath = DLLpath, ...) : 
  unable to load shared object '/usr/local/lib/R/library/cluster/libs/cluster.so':
  /usr/local/lib/R/library/cluster/libs/cluster.so: undefined symbol: cl_daisy_
ERROR: loading failed
* removing ‘/usr/local/lib/R/library/cluster’
* restoring previous ‘/usr/local/lib/R/library/cluster’

The downloaded packages are in
        ‘/tmp/RtmpP1rf0B/downloaded_packages’
Updating HTML index of packages in '.Library'
Warning message:
In install.packages("cluster") :
  installation of package 'cluster' had non-zero exit status

Read more »

Listing gene IDs from hyperGTest

hyperGTest compute Hypergeomtric p-values for over or under-representation of each GO term in the specified category among the specified gene set.

*geneSample* was used as an example.

> geneSample
 [1] "3987"      "11167"     "8683"      "23576"     "80173"     "857"       "64960"     "3178"      "93099"     "100302736" "3916"      "8663"      "3383"      "445582"   
[15] "10564"     "5339"      "6732"      "4678"      "10989"     "55276"     "29127"     "10735"     "51449"     "55720"     "11100"     "2314"      "51204"     "11083"    
[29] "5694"      "6605"     

After using hyperGTest to test GO terms for over-representation, I get the result which were shown below:

> slotNames(hgOver)
[1] "goDag"         "pvalue.order"  "conditional"   "annotation"    "geneIds"       "testName"      "pvalueCutoff"  "testDirection"
> summary(hgOver)
      GOBPID       Pvalue OddsRatio  ExpCount Count Size                                       Term
1 GO:0044419 0.0002743002  10.32175 0.5988965     5  343 interspecies interaction between organisms

I want to know which subset of the input genes, which does not reported, represented in the significant GO term.

This can be done by using the genome wide annotation data, for human at this example, org.Hs.eg.db, for mapping Entrez gene IDs to GO IDs.

Since GO ontology is a directed acyclic graph, all genes that are annotated with a child GO term are also annotated with their parent terms. So, org.Hs.egGO2ALLEGS is using for mapping rather than org.Hs.egGO.

In the example above, we can get the corresponding gene set by:

> geneSample[geneSample %in% get("GO:0044419", org.Hs.egGO2ALLEGS)]
[1] "857"  "3178" "3383" "6732" "5694"

The gene set can further map to other identifiers or annotation data by biomaRt package.