Changes

BLAStoise

8,133 bytes added, 20:12, 12 February 2017

→‎Assignment 1

----

'''DNA Edit Distance and Sequence Alignment Tool - Analysis by Jonathan D.'''

This program is a bioinformatics tool which will calculate the edit distance between two sequences. Furthermore, if the option is selected, then the program will also perform a sequence alignment on the two sequences. The two sequences must be given in FASTA format, but there are already ready to use sequences within the test directory of the project.

Original source code can be be found [https://github.com/kbiscanic/bioinformatics here].

'''Compilation'''

To compile the program, navigate to the project’s src directory and run the following command:

$ g++ -std=c++11 -g -pg -O2 BasicEditDistance.cpp main.cpp Parser.cpp Result.cpp Sequence.cpp Solver.cpp SubmatrixCalculator.cpp Writer.cpp

'''Running the Program'''

There are three options to run the program, as noted in the github readme:

b - basic edit distance (Needleman-Wunsch)

d - edit distance (Masek-Paterson)

a - edit distance and alignment (Masek-Paterson)

Initial testing of the program was done by using the "b" option which utilizes the Needleman-Wunsch algorithm to calculate basic edit distance and maximum sequence length of 100 nucleotides. (FASTA files were included in the project under the test directory)

INPUT FILE (test-100.fa):

>Chromosome_3565586_3565685_0:0:0_0:0:0_0/1

ACCGGTTGCCCGCTACATGCTCCAACCATCCGGCGATGGTTACCTGCTGCCGGACTGGTATAGCGCAGAGCCGCGTCGACACCGCGTATCCGTGCCCCCC

>Chromosome_3568561_3568661_0:0:1_0:0:1_1/1

TGGGGATTGCCAGTCCGTCCGGGGAGGTATTCAGAAAGGTACACCGGTCTGTTGATATTCATGTAACAGGTATTAATGATGAAGAAAGGAATGGCAAACA

Run the following command to align the two sequences:

$ ./a.out a ../test/data/test-100.fa test.maf

SCREEN OUTPUT:

Submatrix dimension: 1

String A size: 100

String B size: 100

Submatrices in edit table: 100x100

Allocating 230 locations.

Allocation time: 1e-05s

1 / 5 (submatrices: 45 )

5 / 5 (submatrices: 225 )

Submatrix calculation time: 8.9e-05s

Edit path calculation (Masek-Paterson): 0.000149

TEXT OUTPUT (test.maf)

a score=58

s Chromosome_3565586_3565685_0:0:0_0:0:0_0/1 0 100 + 107 ACCGG-TTGCCCGCTACATGC-TCCAACCATCCGGCGATGGT-TACCTG-CTGCCG-GACTGGTATAGC--GCAGAGCCGCGTCGACACCGCGTATCCGTGCCCCCC

s Chromosome_3568561_3568661_0:0:1_0:0:1_1/1 0 100 + 107 TGGGGATTGCCAG-TCCGTCCGGGGAGGTATTCAG-AAAGGTACACCGGTCTGTTGATATTCATGTAACAGGTATTAATG-ATGAAGAAAG-GAAT--G-GCAAACA

As you can see, the two sequences are now aligned which will show conserved regions of nucleotides between the two.

'''Analysis'''

In order to perform gprof analysis, sequence length of 10000 was used:

$ ./a.out a ../test/data/test-10000.fa test.maf

$ gprof -p -b > FULL_O2.analysis.10000.flt

Flat profile:

Each sample counts as 0.01 seconds.

% cumulative self self total

time seconds seconds calls s/call s/call name

36.59 0.45 0.45 302086997 0.00 0.00 std::vector<int, std::allocator<int> >::operator[](unsigned long)

22.77 0.73 0.28 1 0.28 1.05 Solver::fill_edit_matrix()

13.82 0.90 0.17 201301730 0.00 0.00 std::vector<std::vector<int, std::allocator<int> >, std::allocator<std::vector<int, std::allocator<int> > > >::operator[](unsigned long)

12.20 1.05 0.15 25000000 0.00 0.00 SubmatrixCalculator::sumSteps(int)

7.32 1.14 0.09 727566 0.00 0.00 std::vector<int, std::allocator<int> >::size() const

3.25 1.18 0.04 5 0.01 0.01 std::vector<std::vector<int, std::allocator<int> >, std::allocator<std::vector<int, std::allocator<int> > > >::vector()

0.81 1.19 0.01 284812 0.00 0.00 std::_Vector_base<int, std::allocator<int> >::_M_deallocate(int*, unsigned long)

Even with a low sequence length of 10000, we can already begin to see that a lot of time is used in the function call Solver::fill_edit_matrix().

If we run the program for a larger sequence of 50000, the following flat profile is generated.

Flat profile:

Each sample counts as 0.01 seconds.

% cumulative self self total

time seconds seconds calls s/call s/call name

85.67 5.61 5.61 1 5.61 5.61 Solver::fill_edit_matrix()

8.09 6.14 0.53 5267025 0.00 0.00 SubmatrixCalculator::calculateFinalSteps(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)

3.05 6.34 0.20 63261681 0.00 0.00 std::vector<std::vector<int, std::allocator<int> >, std::allocator<std::vector<int, std::allocator<int> > > >::_M_fill_insert(__gnu_cxx::__normal_iterator<std::vector<int, std::allocator<int> >*, std::vector<std::vector<int, std::allocator<int> >, std::allocator<std::vector<int, std::allocator<int> > > > >, unsigned long, std::vector<int, std::allocator<int> > const&)

3.05 6.54 0.20 1 0.20 0.93 SubmatrixCalculator::calculate()

At 50000, it is even more clear that the Solver::fill_edit_matrix() function takes up the majority of the execution since it's taking up 85% of the elapsed time.

Below is a snippet of the code:

void Solver::fill_edit_matrix() {

all_columns.resize(row_num + 1, vector<int>(column_num + 1, 0));

all_rows.resize(row_num + 1, vector<int>(column_num + 1, 0));

top_left_costs.resize(row_num + 1, vector<int>(column_num + 1, 0));

int initialVector = 0;

for (int i = 0; i < submatrix_dim; i++){

initialVector = initialVector * 10 + 2; // 222 == "222" == (1, 1, 1)

}

// padding string b step vectors

for (int submatrix_j = 1; submatrix_j <= column_num; submatrix_j++) {

if ((submatrix_j * submatrix_dim - 1) >= string_b_real_size) {

vector<int> temp_vec(submatrix_dim, 0);

for (int i = 0;

i < (string_b_real_size - ((submatrix_j - 1) * submatrix_dim));

i++)

temp_vec[i] = 1;

all_rows[0][submatrix_j] = SubmatrixCalculator::stepsToInt(temp_vec);

} else {

all_rows[0][submatrix_j] = initialVector;

}

// padding string a step vectors

for (int submatrix_i = 1; submatrix_i <= row_num; submatrix_i++) {

if ((submatrix_i * submatrix_dim - 1) >= string_a_real_size) {

vector<int> temp_vec(submatrix_dim, 0);

for (int i = 0;

i < (string_a_real_size - ((submatrix_i - 1) * submatrix_dim));

i++)

temp_vec[i] = 1;

all_columns[submatrix_i][0] =

SubmatrixCalculator::stepsToInt(temp_vec);

} else {

all_columns[submatrix_i][0] = initialVector;

}

for (int submatrix_i = 1; submatrix_i <= row_num; submatrix_i++) {

if ((submatrix_i - 1) * submatrix_dim > string_a_real_size)

top_left_costs[submatrix_i][1] = string_a_real_size;

else

top_left_costs[submatrix_i][1] = (submatrix_i - 1) * submatrix_dim;

for (int submatrix_j = 1; submatrix_j <= column_num; submatrix_j++) {

pair<int, int> final_steps = subm_calc->resultIndex[

// offset calculation

str_a_offsets[submatrix_i] + // left string

str_b_offsets[submatrix_j] + // top string

// left steps

subm_calc->stepOffsets[0][all_columns[submatrix_i][submatrix_j - 1]] +

// top steps

subm_calc->stepOffsets[1][all_rows[submatrix_i - 1][submatrix_j]]

];

all_columns[submatrix_i][submatrix_j] = final_steps.first;

all_rows[submatrix_i][submatrix_j] = final_steps.second;

if (submatrix_j != 1) {

top_left_costs[submatrix_i][submatrix_j] =

top_left_costs[submatrix_i][submatrix_j - 1];

top_left_costs[submatrix_i][submatrix_j] += subm_calc->sumSteps(

all_rows[submatrix_i - 1][submatrix_j - 1]);

}

</nowiki>

=== Assignment 2 ===

=== Assignment 3 ===

Jdesmond1

26

edits

Changes

BLAStoise

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

get involved with CDOT

courses

course projects

links

Tools