# Category Archives: Biology

## TRIE

In the problem of finding a motif in DNA, I used sliding windows to find the motif. Of course we can use KMP or Boyer-Moore to speed up. But if we want to find many patterns in a text, for instance searching a genome for a collection of known genes, we would prefer parsing the genome once instead of several times. In this case, we need TRIE.

Problem: http://rosalind.info/problems/trie/
Given a collection of strings, their trie (often pronounced "try" to avoid ambiguity with the general term tree) is a rooted tree formed as follows. For every unique first symbol in the strings, an edge is formed connecting the root to a new vertex. This symbol is then used to label the edge.

We may then iterate the process by moving down one level as follows. Say that an edge connecting the root to a node v is labeled with 'A'; then we delete the first symbol from every string in the collection beginning with 'A' and then treat v as our root. We apply this process to all nodes that are adjacent to the root, and then we move down another level and continue.

As a result of this method of construction, the symbols along the edges of any path in the trie from the root to a leaf will spell out a unique string from the collection, as long as no string is a prefix of another in the collection (this would cause the first string to be encoded as a path terminating at an internal node).

Given: A list of at most 100 DNA strings of length at most 100 bp, none of which is a prefix of another.

Return: The adjacency list corresponding to the trie T for these patterns, in the following format. If T has n nodes, first label the root with 1 and then label the remaining nodes with the integers 2 through n in any order you like. Each edge of the adjacency list of T will be encoded by a triple containing the integer representing the edge's parent node, followed by the integer representing the edge's child node, and finally the symbol labeling the edge.

## ChIPseq data mining with ChIPseeker

ChIP-seq is rapidly becoming a common technique and there are a large number of dataset available in the public domain. Results from individual experiments provide a limited understanding of chromatin interactions, as there is many factors cooperate to regulate transcription. Unlike other tools that designed for single dataset, ChIPseeker is designed for comparing profiles of ChIP-seq datasets at different levels.

We provide functions to compare profiles of peaks binding to TSS regions, annotation, and enriched functional profiles. More importantly, ChIPseeker incorporates statistical testing of co-occurrence of different ChIP-seq datasets and can be used to identify co-factors.

?View Code RSPLUS
 ```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 ``` ```> library(ChIPseeker) > ff=getSampleFiles() > x = enrichPeakOverlap(ff[[5]], unlist(ff[1:4]), nShuffle=10000, pAdjustMethod="BH", chainFile=NULL) >> permutation test of peak overlap... 2015-09-24 14:23:43 |======================================================================| 100% > x qSample ARmo_0M GSM1295077_CBX7_BF_ChipSeq_mergedReps_peaks.bed.gz ARmo_1nM GSM1295077_CBX7_BF_ChipSeq_mergedReps_peaks.bed.gz ARmo_100nM GSM1295077_CBX7_BF_ChipSeq_mergedReps_peaks.bed.gz CBX6_BF GSM1295077_CBX7_BF_ChipSeq_mergedReps_peaks.bed.gz tSample qLen tLen N_OL ARmo_0M GSM1174480_ARmo_0M_peaks.bed.gz 1663 812 0 ARmo_1nM GSM1174481_ARmo_1nM_peaks.bed.gz 1663 2296 8 ARmo_100nM GSM1174482_ARmo_100nM_peaks.bed.gz 1663 1359 3 CBX6_BF GSM1295076_CBX6_BF_ChipSeq_mergedReps_peaks.bed.gz 1663 1331 968 pvalue p.adjust ARmo_0M 0.88901110 0.88901110 ARmo_1nM 0.15118488 0.30236976 ARmo_100nM 0.37296270 0.49728360 CBX6_BF 0.00009999 0.00039996```

## subsetting data in ggtree

Subsetting is commonly used in ggtree as we would like to for example separating internal nodes from tips. We may also want to display annotation to specific node(s)/tip(s).

Some software may stored clade information (e.g. bootstrap value) as internal node labels. Indeed we want to manipulate such information and taxa labels separately.

In current ggplot2 (version=1.0.1, access date:2015-09-23), it support subset. For instance:

?View Code RSPLUS
 ```1 2 3 4 ``` ```library(ggplot2) library(ggtree) tree=read.tree(text="((A:2,B:2)95:2,(C:2,D:2)100:2);") ggtree(tree) + geom_text(aes(label=label), subset=.(!isTip), hjust=-.2)```

But this feature was removed in github version of ggplot2 and will not be available in next release.

In github version of ggtree, we implemented geom_text2, geom_point2, and geom_segment2, that works exactly like geom_text, geom_point and geom_segment respectively with subset supported. The syntax is slightly different.

## embeding a subplot in ggplot via subview

I implemented a function, subview, in ggtree that make it easy to embed a subplot in ggplot.

An example is shown below:

?View Code RSPLUS
 ```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 ``` ```library(ggplot2) library(ggtree)   dd < - data.frame(x=LETTERS[1:3], y=1:3) pie <- ggplot(dd, aes(x=1, y, fill=x)) + geom_bar(stat="identity", width=1) + coord_polar(theta="y") + theme_tree() + xlab(NULL) + ylab(NULL) + theme_transparent()   x <- sample(2:9) y <- sample(2:9) width <- sample(seq(0.05, 0.15, length.out=length(x))) height <- width   p <- ggplot(data=data.frame(x=c(0, 10), y=c(0, 10)), aes(x, y))+geom_blank() print(p) for (i in seq_along(x)) { p %<>% subview(pie, x[i], y[i], width[i], height[i]) print(p) }```

## functional enrichment analysis with NGS data

I found a Bioconductor package, seq2pathway, that can apply functional analysis to NGS data. It consists of two components, seq2gene and gene2pathway. seq2gene converts genomic coordination to genes while gene2pathway performs functional analysis at gene level.

I think it would be interesting to incorporate seq2gene with clusterProfiler. But it fail to run due to it call absolute path of python installed in the author's computer.

Page 1 of 18 1 2 3 4 5 6 7 8 ...Last »