**Problem**
A matrix is a rectangular table of values divided into rows and columns. An m×n matrix has m rows and n columns. Given a matrix A, we write A_{i,j} to indicate the value found at the intersection of row i and column j.
Say that we have a collection of DNA strings, all having the same length n. Their profile matrix is a 4×n matrix P in which P_{1,j} represents the number of times that 'A' occurs in the jth position of one of the strings, P_{2,j} represents the number of times that C occurs in the jth position, and so on (see below).
A consensus string c is a string of length n formed from our collection by taking the most common symbol at each position; the jth symbol of c therefore corresponds to the symbol having the maximum value in the j-th column of the profile matrix. Of course, there may be more than one most common symbol, leading to multiple possible consensus strings.
A T C C A G C T
G G G C A A C T
A T G G A T C T
DNA Strings A A G C A A C C
T T G G A A C T
A T G C C A T T
A T G G C A C T
A 5 1 0 0 5 5 0 0
Profile C 0 0 1 4 2 0 6 1
G 1 1 6 3 0 1 0 0
T 1 5 0 0 0 1 1 6
Consensus A T G C A A C T
Given: A collection of at most 10 DNA strings of equal length (at most 1 kbp) in FASTA format.
Return: A consensus string and profile matrix for the collection. (If several possible consensus strings exist, then you may return any one of them.)
**Sample Dataset**
>Rosalind_1
ATCCAGCT
>Rosalind_2
GGGCAACT
>Rosalind_3
ATGGATCT
>Rosalind_4
AAGCAACC
>Rosalind_5
TTGGAACT
>Rosalind_6
ATGCCATT
>Rosalind_7
ATGGCACT
**Sample Output**
ATGCAACT
A: 5 1 0 0 5 5 0 0
C: 0 0 1 4 2 0 6 1
G: 1 1 6 3 0 1 0 0
T: 1 5 0 0 0 1 1 6

FASTA format was introduced previously in Computing GC content, in which there is no need to store the sequence. In this problem, the situation is a little more complicated since the storage issue should be considered. The storage issue is easy to solve if using high level languages, for instance R, which support memory dynamic allocation.

Here, I will using C to implement readFASTA function. The FASTA file of this problem is easy to parse for all its description lines are in equal length and all the sequence lines are also in equal length. But it is not the true in reality. The description/sequence lines are very common to have unequal lengths, and the sequence lines can contain return characters. All these situations should be considered when implementing a general function.

If we predefined character array to store the description and sequence line, the drawback is obvious. If the array size is not large enough, the function will lost generality, but if it is large, the function will waste a lot of memory.

To solve the storage issue, I use Link List to store description and sequence lines, re-store them in character array and then pack them in SEQ structure. The sequence number in FASTA file is also unknown, so the SEQ structures are also organized as Link List. It's easy to implement a wrapper function to return the SEQ structure in pointer array if necessary.

Read more »

## Recent Comments