In the problem of finding a motif in DNA, I used sliding windows to find the motif. Of course we can use KMP or Boyer-Moore to speed up. But if we want to find many patterns in a text, for instance searching a genome for a collection of known genes, we would prefer parsing the genome once instead of several times. In this case, we need TRIE.

Given a collection of strings, their trie (often pronounced "try" to avoid ambiguity with the general term tree) is a rooted tree formed as follows. For every unique first symbol in the strings, an edge is formed connecting the root to a new vertex. This symbol is then used to label the edge.

We may then iterate the process by moving down one level as follows. Say that an edge connecting the root to a node v is labeled with 'A'; then we delete the first symbol from every string in the collection beginning with 'A' and then treat v as our root. We apply this process to all nodes that are adjacent to the root, and then we move down another level and continue.

As a result of this method of construction, the symbols along the edges of any path in the trie from the root to a leaf will spell out a unique string from the collection, as long as no string is a prefix of another in the collection (this would cause the first string to be encoded as a path terminating at an internal node).

Given: A list of at most 100 DNA strings of length at most 100 bp, none of which is a prefix of another.

Return: The adjacency list corresponding to the trie T for these patterns, in the following format. If T has n nodes, first label the root with 1 and then label the remaining nodes with the integers 2 through n in any order you like. Each edge of the adjacency list of T will be encoded by a triple containing the integer representing the edge's parent node, followed by the integer representing the edge's child node, and finally the symbol labeling the edge.
Read more »

ChIPseq data mining with ChIPseeker

ChIP-seq is rapidly becoming a common technique and there are a large number of dataset available in the public domain. Results from individual experiments provide a limited understanding of chromatin interactions, as there is many factors cooperate to regulate transcription. Unlike other tools that designed for single dataset, ChIPseeker is designed for comparing profiles of ChIP-seq datasets at different levels.

We provide functions to compare profiles of peaks binding to TSS regions, annotation, and enriched functional profiles. More importantly, ChIPseeker incorporates statistical testing of co-occurrence of different ChIP-seq datasets and can be used to identify co-factors.

?View Code RSPLUS
> library(ChIPseeker)
> ff=getSampleFiles()
> x = enrichPeakOverlap(ff[[5]], unlist(ff[1:4]), nShuffle=10000, pAdjustMethod="BH", chainFile=NULL)
>> permutation test of peak overlap...		 2015-09-24 14:23:43
  |======================================================================| 100%
> x
ARmo_0M    GSM1295077_CBX7_BF_ChipSeq_mergedReps_peaks.bed.gz
ARmo_1nM   GSM1295077_CBX7_BF_ChipSeq_mergedReps_peaks.bed.gz
ARmo_100nM GSM1295077_CBX7_BF_ChipSeq_mergedReps_peaks.bed.gz
CBX6_BF    GSM1295077_CBX7_BF_ChipSeq_mergedReps_peaks.bed.gz
                                                      tSample qLen tLen N_OL
ARmo_0M                       GSM1174480_ARmo_0M_peaks.bed.gz 1663  812    0
ARmo_1nM                     GSM1174481_ARmo_1nM_peaks.bed.gz 1663 2296    8
ARmo_100nM                 GSM1174482_ARmo_100nM_peaks.bed.gz 1663 1359    3
CBX6_BF    GSM1295076_CBX6_BF_ChipSeq_mergedReps_peaks.bed.gz 1663 1331  968
               pvalue   p.adjust
ARmo_0M    0.88901110 0.88901110
ARmo_1nM   0.15118488 0.30236976
ARmo_100nM 0.37296270 0.49728360
CBX6_BF    0.00009999 0.00039996

Read more »

subsetting data in ggtree

Subsetting is commonly used in ggtree as we would like to for example separating internal nodes from tips. We may also want to display annotation to specific node(s)/tip(s).

Some software may stored clade information (e.g. bootstrap value) as internal node labels. Indeed we want to manipulate such information and taxa labels separately.

In current ggplot2 (version=1.0.1, access date:2015-09-23), it support subset. For instance:

?View Code RSPLUS
ggtree(tree) + geom_text(aes(label=label), subset=.(!isTip), hjust=-.2)

But this feature was removed in github version of ggplot2 and will not be available in next release.

In github version of ggtree, we implemented geom_text2, geom_point2, and geom_segment2, that works exactly like geom_text, geom_point and geom_segment respectively with subset supported. The syntax is slightly different.
Read more »





ssh -L 5901: -N -f -l user server_ip_address

这个命令会创建ssh通道把localhost连接到VNC,OS X中的Screen Sharing便是一个vncviewer,所以不需要额外装软件,在Finder中使用⌘+k,在弹出窗口中输入:


便可以启动Screen Sharing连接VNC。使用VNC还有一个好处,就是远程挂图形界面的程序,本地不用也一直开着。
Read more »

comic phylogenetic tree with ggtree and comicR

ggtree applies the concepts of grammar of graphic in phylogenetic tree presentation and make it easy to add multiple layers of text and even figures above a 🌲.

Here, I cartoonize a phylogenetic tree generated by ggtree with comicR, which is a funny package to generate comic (xkcd-like) graph in R. Have fun with ggtree and comicR.
Read more »

Page 1 of 72 1 2 3 4 5 6 7 8 ...Last »