Category Archives: Biology

phylomoji with ggtree

If you search the hashtag, #phylomoji, in twitter, you can find many creative phylogenetic trees constructed with emoji.

Now with ggtree, you can play #phylomoji in R.
Read more »

Comparison of clusterProfiler and GSEA-P

Thanks @mevers for raising the issue to me and his efforts in benchmarking clusterProfiler.

He pointed out two issues:

  • outputs from gseGO and GSEA-P are poorly overlap.
  • pvalues from gseGO are generally smaller and don't show a lot of variation

For GSEA analysis, we have two inputs, a ranked gene list and gene set collections.

First of all, the gene set collections are very different. The GMT file used in his test is, which is a tiny subset of GO CC, while clusterProfiler used the whole GO CC corpus.

For instance, with his gene list as input, clusterProfiler annotates 195 genes as ribosome, while GSEA-P (using only annotates 38 genes.

As the gene set collections is so different, I don't believe the comparison can produce any valuable results.

The first step should be extending clusterProfiler to support using GMT file as gene set annotation, thereafter we can use identical input (both gene list and gene sets) and then benchmarking will be valuable for detecting issues that exclusively attributed to the implementation of GSEA algorithm.
Read more »

use simplify to remove redundancy of enriched GO terms

To simplify enriched GO result, we can use slim version of GO and use enricher function to analyze.

Another strategy is to use GOSemSim to calculate similarity of GO terms and remove those highly similar terms by keeping one representative term. To make this feature available to clusterProfiler users, I develop a simplify method to reduce redundant GO terms from output of enrichGO function.

?View Code RSPLUS
data(geneList, package="DOSE")
de < - names(geneList)[abs(geneList) > 2]
bp < - enrichGO(de, ont="BP")

Read more »

[BioC 3.2] NEWS of my BioC packages

In BioC 3.2 release, all my packages including GOSemSim, clusterProfiler, DOSE, ReactomePA, and ChIPseeker switch from Sweave to R Markdown for package vignettes.


To make it consistent between GOSemSim and clusterProfiler, 'worm' was deprecated and instead we should use 'celegans'. As usual, information content data was updated.


Enrichment results may contains terms that are very general (less informative) and we do not want to use them. In this release, we provide dropGO function that can be used to drop selected GO terms or specific level of GO terms. It can be applied to output from both enrichGO and compareCluster. This is a feature request from @ahorvath.

Another feature request is to visualize GO enrichment result with GO topology. I implement plotGOgraph function by extending topGO to support output of both enrichGO or gseGO.

dotplot is another feature request and was implemented in DOSE as a general function for visualize enrichment result. clusterProfiler import this function.

merge_result function was implemented for merging enrichment results and then the results can be visualized simultaneously for comparison. This function was developed for comparing functional enrichment of GTEx paper. An example of comparing results from clusterProfiler and DAVID can be found in github.

A section 'Functional analysis of NGS data' was added in the vignette. The blog post illustrated using enricher and GSEA function to analyze user defined annotation.
Read more »


In the problem of finding a motif in DNA, I used sliding windows to find the motif. Of course we can use KMP or Boyer-Moore to speed up. But if we want to find many patterns in a text, for instance searching a genome for a collection of known genes, we would prefer parsing the genome once instead of several times. In this case, we need TRIE.

Given a collection of strings, their trie (often pronounced "try" to avoid ambiguity with the general term tree) is a rooted tree formed as follows. For every unique first symbol in the strings, an edge is formed connecting the root to a new vertex. This symbol is then used to label the edge.

We may then iterate the process by moving down one level as follows. Say that an edge connecting the root to a node v is labeled with 'A'; then we delete the first symbol from every string in the collection beginning with 'A' and then treat v as our root. We apply this process to all nodes that are adjacent to the root, and then we move down another level and continue.

As a result of this method of construction, the symbols along the edges of any path in the trie from the root to a leaf will spell out a unique string from the collection, as long as no string is a prefix of another in the collection (this would cause the first string to be encoded as a path terminating at an internal node).

Given: A list of at most 100 DNA strings of length at most 100 bp, none of which is a prefix of another.

Return: The adjacency list corresponding to the trie T for these patterns, in the following format. If T has n nodes, first label the root with 1 and then label the remaining nodes with the integers 2 through n in any order you like. Each edge of the adjacency list of T will be encoded by a triple containing the integer representing the edge's parent node, followed by the integer representing the edge's child node, and finally the symbol labeling the edge.
Read more »

Page 1 of 19 1 2 3 4 5 6 7 8 ...Last »