麻省理工大学:《Foundations of Biology》课程教学资源(英文版)Lecture 4 Database Searching

7. 91-Lecture #4 Michael Yaffe Database Searching Molecular Phylogenetics ABCD ABc ((A,B)c)D)
7.91 – Lecture #4 Database Searching & Molecular Phylogenetics A A B B C D D (((A,B)C)D) C Michael Yaffe

Outline FASTA, Blast searching, Smith-Waterman ·Psi- Blast Review of genomic dNA structure Substitution patterns and mutation rates Synonymous and non -synonymous substitutions · Jukes- Cantor model Kimura's Two-Parameter Model Molecular clocks Phylogenetic Trees-rooted and unrooted Distance matrix Methods Neighbor-Joining Method and Related Neighbor Methods · Maximum likelihood
Outline • FASTA, Blast searching, Smith-Waterman • Psi-Blast • Review of Genomic DNA structure • Substitution patterns and mutation rates • Synonymous and non-Synonymous substitutions • Jukes-Cantor Model • Kimura’s Two-Parameter Model • Molecular Clocks • Phylogenetic Trees – rooted and unrooted • Distance Matrix Methods • Neighbor-Joining Method and Related Neighbor Methods • Maximum Likelihood

Outline( cont) · Parsimony Branch and bound Heuristic Seaching · Consensus trees Software(PHYLIP, PAUP The tree of life Reading: Mount,p.237-280,283286,291-308
Outline (cont) • Parsimony Branch and Bound Heuristic Seaching • Consensus Trees • Software (PHYLIP, PAUP) • The Tree of Life Reading: Mount, p. 237-280, 283-286, 291-308

Database searching Problem is simple: I want to find homologues to my protein in the database How do i do it Do the obvious - compare my protein against every other protein in the database and look for local alignments by dynamic programming Uh oh! n For k sequences in the 12345678 Database 12345678. this becomes an o(mnk) 12345678 problem 12345678 12345678 12345678 essentially an o(mn) problem
Database Searching Problem is simple: I want to find homologues to my protein in the database How do I do it? Do the obvious – compare my protein against every other protein in the database and look for local alignments by dynamic programming Uh Oh! 1 n For k sequences in the 1 12345678…. Database 12345678…. this becomes an O(mnk) 12345678…. problem! 12345678…. 12345678…. m 12345678…. ….essentially an O(mn) problem

Database Searching Still, this can be done -- 50x slower than blast/FASTA, Smith-Waterman algorithm SSEARCH (tp. virginia. edu/pub/fasta)-do it locally But in the old days, needed a faster method 2 approaches- Blast, FASTA -both heuristic l.e. tried and true)-almost always finds related Proteins but cannot guarantee optimal solution FASTA: Basic Idea Search for matching sequence patterns or words Called k-tuples, which are exact matches of"k"characters between the two sequences i.e. RW= 2-tuple Seq 1: AHFYRWNKLCV Seq 2: DRWNLFCVATYWE
Database Searching Still, this can be done - ~ 50x slower than Blast/FASTA, Smith-Waterman algorithm… SSEARCH (ftp.virginia.edu/pub/fasta) – do it locally! But in the old days, needed a faster method… 2 approaches – Blast, FASTA – both heuristic (i.e. tried and true) – almost always finds related Proteins but cannot guarantee optimal solution FASTA: Basic Idea 1- Search for matching sequence patterns or words Called k-tuples, which are exact matches of “k” characters between the two sequences i.e. RW = 2-tuple Seq 1: AHFYRWNKLCV Seq 2: DRWNLFCVATYWE

Database searching FASTA: Basic Idea 2- Repeat for all possible k-tuples i.e. cV=2-tuple Seq 1: AHFYRWNKLCV Seg 2: DRWNLFCVATYWE 3-Make a Hash Table Hashing) that has the position of each k-tuple in each sequence 2-tuple pos in Seg1 pos in Seg2 offset (pos1-p0s2 RW 5 2 CV 10 AH
Database Searching FASTA: Basic Idea 2- Repeat for all possible k-tuples i.e. CV = 2-tuple Seq 1: AHFYRWNKLCV Seq 2: DRWNLFCVATYWE 3- Make a Hash Table (Hashing) that has the position of each k-tuple in each sequence 2-tuple pos. in Seq1 RW 5 CV 10 AH 1 i.e. pos in Seq 2 Offset (pos1-pos2) 2 3 7 3 ---- ----

Database searching Seq 1: AHFYRWNKLEV Seg 2: DRNLFCVATYWE 3.Make a Hash Table HAshing)that has the position of each k-tuple heach sequence e 2-tuple pos in Seq1 pos in Seg ofset(pos1-pos2 RW CV 10 3 AH 4-Look for words(k-tuples) with same offset These are in-phase and reveal a region of alignment between the two sequences 5-Build a local alignment based on these, extend it outwards Seg 1: AHFYRWNKLCV Seg 2: DRWNLFCVATYWE
Database Searching Seq 1: AHFYRWNKLCV Seq 2: DRWNLFCVATYWE 3- Make a Hash Table i.e. (Hashing) that has the position of each k-tuple in each sequence 2-tuple pos. in Seq1 pos in Seq 2 Offset (pos1-pos2) 3 3 RW 5 2 CV 10 7 AH 1 ---- ---- 4- Look for words (k-tuples) with same offset These are in-phase and reveal a region of alignment between the two sequences. 5- Build a local alignment based on these, extend it outwards Seq 1: AHFYRWNKLCV Seq 2: DRWNLFCVATYWE

Database searching With hashing, number of comparisons is proportional To the average sequence length i.e. an o(n) problem) Not an o(mn)problem as in dynamic programming Proteins-ktup=1-2 Nucleotides, ktup=4-6 One big problem- low complexity regions Seq 1: AHFYPPPPPPPPFSER Seq 2: DVATPPPPPPPPPPPNLFK
Database Searching With hashing, number of comparisons is proportional To the average sequence length (i.e. an O(n) problem), Not an O(mn) problem as in dynamic programming. Proteins – ktup = 1-2, Nucleotides, ktup=4-6 One big problem – low complexity regions. Seq 1: AHFYPPPPPPPPFSER Seq 2: DVATPPPPPPPPPPPNLFK

Database searching BLAST Same basic idea as fasta but faster and more sensitivel How? BLAST searches for common words or k-tuples, but limits the search for k-tuples that are most significant by using the log-odds values in the Blosum62 amino acid substitution matrix . e. look for WHK and might accept WHR but not HFK as a possible match(note 8000 possibilities) Repeat for all 3-tuples in the query Search the database for a match to the top 50 3-tuples that match the first query position in the sequence, the second query position, etc Use any match to seed an ungapped alignment (old BLasT)
Database Searching BLAST Same basic idea as FASTA, but faster and more sensitive! How? BLAST searches for common words or k-tuples, but limits the search for k-tuples that are most significant, by using the log-odds values in the Blosum62 amino acid substitution matrix i.e. look for WHK and might accept WHR but not HFK as a possible match (note 8000 possibilities) Repeat for all 3-tuples in the query Search the database for a match to the top 50 3-tuples that match the first query position in the sequence, the second query position, etc. Use any match to seed an ungapped alignment (old BLAST)

Database searching Word length is fixed: 3-tuple for proteins 11-tuple for nucleotides By default, filters out low complexity regions Determine if the alignment is statistically significant calculates the probability of observing a score greater than or equal to your alignment based on extreme value distribution Calculates an E-value expectation value This is the probability of finding an unrelated sequence that shows this good an alignment just by chance. Remember if p=.0001 and my database has 500,000 sequences, I will have an E=50!(normal starting E=10
Database Searching Word length is fixed: 3-tuple for proteins 11-tuple for nucleotides By default, filters out low complexity regions. Determine if the alignment is statistically significant. calculates the probability of observing a score greater than or equal to your alignment based on extreme value distribution. Calculates an E-value = expectation value: This is the probability of finding an unrelated sequence that shows this good an alignment just by chance. Remember if p=.0001 and my database has 500,000 sequences, I will have an E=50! (normal starting E=10)
按次数下载不扣除下载券;
注册用户24小时内重复下载只扣除一次;
顺序:VIP每日次数-->可用次数-->下载券;
- 麻省理工大学:《Foundations of Biology》课程教学资源(英文版)Lecture 2 More Pairwise Sequence Comparisons.pdf
- 麻省理工大学:《Foundations of Biology》课程教学资源(英文版)Lecture 3 More Multiple Sequence Alignment.pdf
- 麻省理工大学:《Foundations of Biology》课程教学资源(英文版)Lecture 1 Michael Yaffe Introduction to Bioinformatics.pdf
- 《微生物遗传学》第四章 基因工程技术在改进微生物.ppt
- 《分子生物学》课程教学资源(练习题)试题详解(含参考答案).doc
- 南京军区南京总医院:《组织芯片应用的现状与前景》讲义.pdf
- 《酶学》课程教学资源(讲义)第四章 酶的结构和功能.doc
- 《酶学》课程教学资源(讲义)第十一章 酶在医学方面的应用.doc
- 《酶学》课程教学资源(讲义)第六章 多种因素对酶反应速度的影响.doc
- 《酶学》课程教学资源(讲义)第八章 酶的别构效应.doc
- 《酶学》课程教学资源(讲义)第五章 酶催化动力学基础.doc
- 《酶学》课程教学资源(讲义)第二章 酶的一般性质和分类.doc
- 《酶学》课程教学资源(讲义)第九章 固定化生物催化剂.doc
- 《酶学》课程教学资源(讲义)第三章 酶活性的测定及分离纯化.doc
- 《酶学》课程教学资源(讲义)第七章 多底物酶反应动力学.doc
- 《酶学》课程教学资源(讲义)第一章 绪论.doc
- 孝感学院:《植物解剖学》第四章 种子植物的繁殖和繁殖器官.ppt
- 孝感学院:《植物解剖学》第三讲 叶.ppt
- 孝感学院:《植物解剖学》第一讲 花的重要组成结构及形成.ppt
- 《现代生物学导论》课程教学资源(PPT课件)第四章 生命的基本化学组成.ppt
- 麻省理工大学:《Foundations of Biology》课程教学资源(英文版)Lecture 5 Molecular Phylogenetics.pdf
- 麻省理工大学:《Foundations of Biology》课程教学资源(英文版)Lecture 2 The Language of genomics.pdf
- 麻省理工大学:《Foundations of Biology》课程教学资源(英文版)Lecture 1 Genome Sequencing.pdf
- 麻省理工大学:《Foundations of Biology》课程教学资源(英文版)Lecture 3 Review of DNA Seq.pdf
- 麻省理工大学:《Foundations of Biology》课程教学资源(英文版)Lecture 6 Predicting rna Secondary structure.pdf
- 麻省理工大学:《Foundations of Biology》课程教学资源(英文版)Lecture 4 Organization of topics.pdf
- 麻省理工大学:《Foundations of Biology》课程教学资源(英文版)Lecture 6 Structure Prediction.pdf
- 麻省理工大学:《Foundations of Biology》课程教学资源(英文版)Lecture 5 Markov models.pdf
- 麻省理工大学:《Foundations of Biology》课程教学资源(英文版)Lecture 5 Review -Homology Modeling.pdf
- 麻省理工大学:《Foundations of Biology》课程教学资源(英文版)Lecture 1 Review of protein structure hierarchy.pdf
- 麻省理工大学:《Foundations of Biology》课程教学资源(英文版)Lecture 1 How are X-ray crystal structures.pdf
- 麻省理工大学:《Foundations of Biology》课程教学资源(英文版)Lecture 3 For a molecular simulation or model.pdf
- 麻省理工大学:《Foundations of Biology》课程教学资源(英文版)Lecture 2 Comparing protein Structures.pdf
- 麻省理工大学:《Foundations of Biology》课程教学资源(英文版)Lecture 7 The protein interactome.pdf
- 麻省理工大学:《Foundations of Biology》课程教学资源(英文版)Lecture 7 DNA Microarrays Clustering.pdf
- 麻省理工大学:《Foundations of Biology》课程教学资源(英文版)Lecture 6 Ab initio structure prediction.pdf
- 《植物与植物生理学》课程PPT教学课件(高职高专)第三章 植物的矿质营养.ppt
- 《植物与植物生理学》课程PPT教学课件(高职高专)第二章 植物的水分代谢.ppt
- 《植物与植物生理学》课程PPT教学课件(高职高专)第五章 植物的呼吸作用.ppt
- 《植物与植物生理学》课程PPT教学课件(高职高专)第四章 植物的光合作用.ppt