EE6435 Programming homework

EE6435 Programming homework on hidden Markov model

Python. Debug may take you time.

Full mark: 70 pts

Mark your calendar. No late work will be graded.

Implementation and application of the Viterbi algorithm

The base composition of most genome sequences is not homogeneous. In particular, GC composition can

vary regionally. In the human genome, for instance, there are "CpG islands" which show a very strong

statistical signal (even stronger at the dinucleotide composition level than the mononucleotide composition

level) and which tend to mark the 5' end of many genes. Therefore the problem of objectively segmenting

a genome sequence into regions of different compositions arises naturally. One way to segment a genome

is with HMM algorithms.

Let's assume that a genome can be modeled as a two-state HMM, with two states A and B. Assume that the

parameters of this HMM are as follows. State A has an AT-biased emission distribution {0.35, 0.15, 0.15,

0.35} for the probabilities of {A,C,G,T}. State B has a GC-biased emission distribution {0.15, 0.35, 0.35,

0.15}. State A switches to state B with probability 0.001. State B switches back to state A with probability

0.01. Assume that the initial distribution for the HMM is uniform; 0.5, 0.5 for the two states.

The format of the hidden Markov model can be found in example.hmm and example.hmm.readme.

The file example.fa contains a simulated 1 Mb genome sequence, generated by this HMM. Lower case

residues were generated by state A; upper case residues by state B. Note that *.fa means a format named

“fasta”. In fasta files, the first line is a header starting with “>”. This is the standard format for describing

sequences. The line starts with “>” contains the information of the following sequence and the Viterbi

parsing is applied to the sequence starting with the next line (not the line with >).

Viterbi parsing

Implement a Viterbi parser for the HMM above using Python, including the initialization, matrix fill, and

traceback stages. Your program should output the alignment as a series of segments using the following

format.:

1 153 state A (or state 1)

154 252 state B (or state 2)

253 1651 state A

(... etc...)

Each line is a continuous segment with the same state. The format of each line is: start-position

end-position state X

Note that the start position is 1-indexed (not 0-indexed)

Run your program on the 1 Mb simulated sequence in example.fa. How many segments of the genome

are in state B? Output this information as well. (The true state path is in the file example.positions.)

Name your program as .

Your program should run like this: your-student-ID.py

Don't hard code any input parameters because we will change them during testing. For example, my

testing file may be named as "test3.txt" or "inputdata.fa". But they always have the fasta format.

When we test your program, we will run additional input files besides the one provided on

Canvas. Below please find the detailed instructions about what to submit.

1. The source codes .

2. A readme.txt file: include the complete command line to run your program, tell us

the environment and version of Python you used.

3. In your source codes, clearly comment the part about reading files, table generation,

table initialization, table filling, and the traceback part.

More details about grading.

1. Can generate correct results on 3 input files (each 15 pts, total 45 pts). The three

input files have the same format as the example.fa. It is a fasta format file and

contain multiple lines or just two lines.

e.g.

>test1

AACCGGA

>test2

AACGG

AGCT

AAACGTA

2. The format (input, output, not hardcode any inputs) follows the instructions (15 pts,

no partial credit)

3. The output of your program on example.fa. How many segments of the genome are in

state B? Output this information as well. (10 pts)

QQ：99515681
WeChat：codinghelp
Email：99515681@qq.com
Work Time：8:00-23:00

Hots

Ghostwriter Cs1b Spring 2024 Tth Hw08h... 2024-04-19
Help With Managing Financial Risk Prob... 2024-04-19
Ghostwriter Cs 0449 – Project 5: /Dev/ 2024-04-19
Ghostwriter Elec 2141 Digital Circuit ... 2024-04-19
Help With Csc171 — Videogame Projecthe 2024-04-19
Help With Comp3411 Artificial Intellig 2024-04-19
Help With Stat3061: Random Processes &... 2024-04-19
Ghostwriter Accounting 452, Spring 202... 2024-04-19
Ghostwriter Finc5001 Foundations In Fi... 2024-04-19
Ghostwriter 7Ssmm712 – Topics In Appli 2024-04-19
Help With Com 337 - Film Studies For T... 2024-04-19
Ghostwriter Mes202tc - Digital Vlsi Sy... 2024-04-19
Ghostwriter Geography 2041B Distance S... 2024-04-19
Ghostwriter Ecos3006 International Tra... 2024-04-19
Help With Fit5225 2024 Sm1 Creating An... 2024-04-19
Help With Cit 593: Introduction To Com... 2024-04-19
Help With Math 4931: Take Home Examgho... 2024-04-19
Ghostwriter Csci 547|Info 533: Systems... 2024-04-19
Ghostwriter Cs536-S24 Intro To Pls And... 2024-04-19
Help With Fit5212 - Assignment 1Ghostw... 2024-04-19

Programming Assignment Help！