中国高校课件下载中心 》 教学资源 》 大学文库

麻省理工大学:《Foundations of Biology》课程教学资源(英文版)Lecture 4 Database Searching

文档信息
资源类别:文库
文档格式:PDF
文档页数:75
文件大小:599.82KB
团购合买:点击进入团购
内容简介
Outline FASTA, Blast searching, Smith-Waterman Psi-Blast Review of Genomic DNA structure Substitution patterns and mutation rates Synonymous and non-Synonymous substitutions Jukes-Cantor Model Kimura's Two-Parameter Model
刷新页面文档预览

7. 91-Lecture #4 Michael Yaffe Database Searching Molecular Phylogenetics ABCD ABc ((A,B)c)D)

7.91 – Lecture #4 Database Searching & Molecular Phylogenetics A A B B C D D (((A,B)C)D) C Michael Yaffe

Outline FASTA, Blast searching, Smith-Waterman ·Psi- Blast Review of genomic dNA structure Substitution patterns and mutation rates Synonymous and non -synonymous substitutions · Jukes- Cantor model Kimura's Two-Parameter Model Molecular clocks Phylogenetic Trees-rooted and unrooted Distance matrix Methods Neighbor-Joining Method and Related Neighbor Methods · Maximum likelihood

Outline • FASTA, Blast searching, Smith-Waterman • Psi-Blast • Review of Genomic DNA structure • Substitution patterns and mutation rates • Synonymous and non-Synonymous substitutions • Jukes-Cantor Model • Kimura’s Two-Parameter Model • Molecular Clocks • Phylogenetic Trees – rooted and unrooted • Distance Matrix Methods • Neighbor-Joining Method and Related Neighbor Methods • Maximum Likelihood

Outline( cont) · Parsimony Branch and bound Heuristic Seaching · Consensus trees Software(PHYLIP, PAUP The tree of life Reading: Mount,p.237-280,283286,291-308

Outline (cont) • Parsimony Branch and Bound Heuristic Seaching • Consensus Trees • Software (PHYLIP, PAUP) • The Tree of Life Reading: Mount, p. 237-280, 283-286, 291-308

Database searching Problem is simple: I want to find homologues to my protein in the database How do i do it Do the obvious - compare my protein against every other protein in the database and look for local alignments by dynamic programming Uh oh! n For k sequences in the 12345678 Database 12345678. this becomes an o(mnk) 12345678 problem 12345678 12345678 12345678 essentially an o(mn) problem

Database Searching Problem is simple: I want to find homologues to my protein in the database How do I do it? Do the obvious – compare my protein against every other protein in the database and look for local alignments by dynamic programming Uh Oh! 1 n For k sequences in the 1 12345678…. Database 12345678…. this becomes an O(mnk) 12345678…. problem! 12345678…. 12345678…. m 12345678…. ….essentially an O(mn) problem

Database Searching Still, this can be done -- 50x slower than blast/FASTA, Smith-Waterman algorithm SSEARCH (tp. virginia. edu/pub/fasta)-do it locally But in the old days, needed a faster method 2 approaches- Blast, FASTA -both heuristic l.e. tried and true)-almost always finds related Proteins but cannot guarantee optimal solution FASTA: Basic Idea Search for matching sequence patterns or words Called k-tuples, which are exact matches of"k"characters between the two sequences i.e. RW= 2-tuple Seq 1: AHFYRWNKLCV Seq 2: DRWNLFCVATYWE

Database Searching Still, this can be done - ~ 50x slower than Blast/FASTA, Smith-Waterman algorithm… SSEARCH (ftp.virginia.edu/pub/fasta) – do it locally! But in the old days, needed a faster method… 2 approaches – Blast, FASTA – both heuristic (i.e. tried and true) – almost always finds related Proteins but cannot guarantee optimal solution FASTA: Basic Idea 1- Search for matching sequence patterns or words Called k-tuples, which are exact matches of “k” characters between the two sequences i.e. RW = 2-tuple Seq 1: AHFYRWNKLCV Seq 2: DRWNLFCVATYWE

Database searching FASTA: Basic Idea 2- Repeat for all possible k-tuples i.e. cV=2-tuple Seq 1: AHFYRWNKLCV Seg 2: DRWNLFCVATYWE 3-Make a Hash Table Hashing) that has the position of each k-tuple in each sequence 2-tuple pos in Seg1 pos in Seg2 offset (pos1-p0s2 RW 5 2 CV 10 AH

Database Searching FASTA: Basic Idea 2- Repeat for all possible k-tuples i.e. CV = 2-tuple Seq 1: AHFYRWNKLCV Seq 2: DRWNLFCVATYWE 3- Make a Hash Table (Hashing) that has the position of each k-tuple in each sequence 2-tuple pos. in Seq1 RW 5 CV 10 AH 1 i.e. pos in Seq 2 Offset (pos1-pos2) 2 3 7 3 ---- ----

Database searching Seq 1: AHFYRWNKLEV Seg 2: DRNLFCVATYWE 3.Make a Hash Table HAshing)that has the position of each k-tuple heach sequence e 2-tuple pos in Seq1 pos in Seg ofset(pos1-pos2 RW CV 10 3 AH 4-Look for words(k-tuples) with same offset These are in-phase and reveal a region of alignment between the two sequences 5-Build a local alignment based on these, extend it outwards Seg 1: AHFYRWNKLCV Seg 2: DRWNLFCVATYWE

Database Searching Seq 1: AHFYRWNKLCV Seq 2: DRWNLFCVATYWE 3- Make a Hash Table i.e. (Hashing) that has the position of each k-tuple in each sequence 2-tuple pos. in Seq1 pos in Seq 2 Offset (pos1-pos2) 3 3 RW 5 2 CV 10 7 AH 1 ---- ---- 4- Look for words (k-tuples) with same offset These are in-phase and reveal a region of alignment between the two sequences. 5- Build a local alignment based on these, extend it outwards Seq 1: AHFYRWNKLCV Seq 2: DRWNLFCVATYWE

Database searching With hashing, number of comparisons is proportional To the average sequence length i.e. an o(n) problem) Not an o(mn)problem as in dynamic programming Proteins-ktup=1-2 Nucleotides, ktup=4-6 One big problem- low complexity regions Seq 1: AHFYPPPPPPPPFSER Seq 2: DVATPPPPPPPPPPPNLFK

Database Searching With hashing, number of comparisons is proportional To the average sequence length (i.e. an O(n) problem), Not an O(mn) problem as in dynamic programming. Proteins – ktup = 1-2, Nucleotides, ktup=4-6 One big problem – low complexity regions. Seq 1: AHFYPPPPPPPPFSER Seq 2: DVATPPPPPPPPPPPNLFK

Database searching BLAST Same basic idea as fasta but faster and more sensitivel How? BLAST searches for common words or k-tuples, but limits the search for k-tuples that are most significant by using the log-odds values in the Blosum62 amino acid substitution matrix . e. look for WHK and might accept WHR but not HFK as a possible match(note 8000 possibilities) Repeat for all 3-tuples in the query Search the database for a match to the top 50 3-tuples that match the first query position in the sequence, the second query position, etc Use any match to seed an ungapped alignment (old BLasT)

Database Searching BLAST Same basic idea as FASTA, but faster and more sensitive! How? BLAST searches for common words or k-tuples, but limits the search for k-tuples that are most significant, by using the log-odds values in the Blosum62 amino acid substitution matrix i.e. look for WHK and might accept WHR but not HFK as a possible match (note 8000 possibilities) Repeat for all 3-tuples in the query Search the database for a match to the top 50 3-tuples that match the first query position in the sequence, the second query position, etc. Use any match to seed an ungapped alignment (old BLAST)

Database searching Word length is fixed: 3-tuple for proteins 11-tuple for nucleotides By default, filters out low complexity regions Determine if the alignment is statistically significant calculates the probability of observing a score greater than or equal to your alignment based on extreme value distribution Calculates an E-value expectation value This is the probability of finding an unrelated sequence that shows this good an alignment just by chance. Remember if p=.0001 and my database has 500,000 sequences, I will have an E=50!(normal starting E=10

Database Searching Word length is fixed: 3-tuple for proteins 11-tuple for nucleotides By default, filters out low complexity regions. Determine if the alignment is statistically significant. calculates the probability of observing a score greater than or equal to your alignment based on extreme value distribution. Calculates an E-value = expectation value: This is the probability of finding an unrelated sequence that shows this good an alignment just by chance. Remember if p=.0001 and my database has 500,000 sequences, I will have an E=50! (normal starting E=10)

刷新页面下载完整文档
VIP每日下载上限内不扣除下载券和下载次数;
按次数下载不扣除下载券;
注册用户24小时内重复下载只扣除一次;
顺序:VIP每日次数-->可用次数-->下载券;
相关文档