Landmark-Based Speech Recognition

Landmark-Based Speech Recognition The marriage of high-Dimensional machine Learning Techniques with Modern Linguistic Representations Mark hasegawa-Johnson Thasegamauiuc edu Research performed in colla boration with James Baker( Carnegie Mellon), Sarah Borys(lino is) Ken Chen(linois), Emily Coogan(llinois). Steven Greenberg(Berkeley), Amit Juneja( Maryland), Katrin Kirchhoff (Washington), Karen Livescu(MIT), Srividya Mohan(Johns Hopkins), Jen muller( dept of Defense), Kemal Sonmez (sri, and Tianyu wang (georgia Tech)
Landmark-Based Speech Recognition The Marriage of High-Dimensional Machine Learning Techniques with Modern Linguistic Representations Mark Hasegawa-Johnson jhasegaw@uiuc.edu Research performed in collaboration with James Baker (Carnegie Mellon), Sarah Borys (Illinois), Ken Chen (Illinois), Emily Coogan (Illinois), Steven Greenberg (Berkeley), Amit Juneja (Maryland), Katrin Kirchhoff (Washington), Karen Livescu (MIT), Srividya Mohan (Johns Hopkins), Jen Muller (Dept. of Defense), Kemal Sonmez (SRI), and Tianyu Wang (Georgia Tech)

What are landmarks Time-frequency regions of high mutual information between phone and signal (maxima of i(phone label; acoustics(t,f)) Acoustic events with similar importance in all languages, and across all speaking styles Acoustic events that can be detected even in extremely noisy environments Where do these things happen? Syllable onset consonant release Syllable nucleus Vowel Center Syllable coda a consonant closure I(phone; acoustics)experiment: Hasegawa-Johnson, 2000
What are Landmarks? • Time-frequency regions of high mutual information between phone and signal (maxima of I(phone label; acoustics(t,f)) ) • Acoustic events with similar importance in all languages, and across all speaking styles • Acoustic events that can be detected even in extremely noisy environments • Syllable Onset ≈ Consonant Release • Syllable Nucleus ≈ Vowel Center • Syllable Coda ≈ Consonant Closure Where do these things happen? I(phone;acoustics) experiment: Hasegawa-Johnson, 2000

Landmark-Based Speech Recognition Lattice hypothesis 3 backed up 已5 Words Times Scores Pronunciation 0 Variants W卜 200 backed up 100 backup 02 0.4 06 0.8 T back up ONSET ONSET backt ihp Syllable NUCLEUS UCLEUS wack ihp Structure CODA CODA
Landmark-Based Speech Recognition ONSET NUCLEUS CODA NUCLEUS CODA ONSET Pronunciation Variants: … backed up … … backtup .. … back up … … backt ihp … … wackt ihp… … Lattice hypothesis: … backed up … Syllable Structure Scores Words Times

Talk outline Overview 1. Acoustic Modeling Speech data and acoustic features Landmark detection Estimation of real-valued"distinctive features" using support vector machines(SVM 2. Pronunciation Modeling A Dynamic Bayesian network(DBn)implementation of Articulatory Phonology A Discriminative Pronunciation model implemented using Maximum Entropy(MaxEnt) 3. Technological Evaluation Rescoring of word lattice output from an hMm-based recognizer New errors that we caused: Pronunciation models trained on 3 hours can't compete with triphone models trained on 3000 hours Future plans
Talk Outline Overview 1. Acoustic Modeling – Speech data and acoustic features – Landmark detection – Estimation of real-valued “distinctive features” using support vector machines (SVM) 2. Pronunciation Modeling – A Dynamic Bayesian network (DBN) implementation of Articulatory Phonology – A Discriminative Pronunciation model implemented using Maximum Entropy (MaxEnt) 3. Technological Evaluation – Rescoring of word lattice output from an HMM-based recognizer – Errors that we fixed: Channel noise, Laughter, etcetera – New errors that we caused: Pronunciation models trained on 3 hours can’t compete with triphone models trained on 3000 hours. – Future Plans

Overview History Research described in this talk was performed between June 30 and August 17, 2004, at the Johns Hopkins summer workshop WS04 Scientific goal To use high-dimensional machine learning technologies (SVM, DBn to create representations capable of learning from data, the types of speech knowledge that humans exhibit in psychophysical speech perception experiments Technological Goal Long-term: To create a better speech recognizer Short-term: lattice rescoring, applied to word lattices produced by SrIs nn/hmm hybrid
Overview • History – Research described in this talk was performed between June 30 and August 17, 2004, at the Johns Hopkins summer workshop WS04 • Scientific Goal – To use high-dimensional machine learning technologies (SVM, DBN) to create representations capable of learning, from data, the types of speech knowledge that humans exhibit in psychophysical speech perception experiments • Technological Goal – Long-term: To create a better speech recognizer – Short-term: lattice rescoring, applied to word lattices produced by SRI’s NN/HMM hybrid

Overview of Systems to be described Rescoring: log-Linear score combination p(MFCC, PLPword), p(word words) First-Pass asr Word lattice Ip(SVMword word label start end times Pronunciation Model (dbn or MaxEnt) p(landmarkS) Acoustic model: svms concatenate 4-15 frames MFCC(5ms lms frame period), Formants, Phonetic auditory model Parameters
… … Acoustic Model: SVMs p(landmark|SVM) MFCC (5ms & 1ms frame period), Formants, Phonetic & Auditory Model Parameters concatenate 4-15 frames Pronunciation Model (DBN or MaxEnt) First-Pass ASR Word Lattice p(SVM|word) Rescoring: Log-Linear Score Combination p(MFCC,PLP|word), p(word|words) word label, start & end times Overview of Systems to be Described

I Acoustic Modeling Goal: Learn precise and generalizable models of the acoustic boundary associated with each distinctive feature Methods Large input vector space(many acoustic feature types) Regularized binary classifiers(SVMs) SVM outputs"smoothed" using dynamic programming SVM outputs converted to posterior probabi estimates once/5ms using histogram
I. Acoustic Modeling • Goal: Learn precise and generalizable models of the acoustic boundary associated with each distinctive feature. • Methods: – Large input vector space (many acoustic feature types) – Regularized binary classifiers (SVMs) – SVM outputs “smoothed” using dynamic programming – SVM outputs converted to posterior probability estimates once/5ms using histogram

Speech Databases SI Ize Phonetic Word lattices T transcr NTIMIT 14hrs manual WS96&97 3.5hrs manual SWB1 WS04 subset 12hrs auto-SRI BBN Evalo1 10hrs bbn sri rto3 Dev 6hrs SRI RTO3 Eval 6hrs SRI
Speech Databases Size Phonetic Transcr. Word Lattices NTIMIT 14hrs manual - WS96&97 3.5hrs manual - SWB1 WS04 subset 12hrs auto-SRI BBN Eval01 10hrs - BBN & SRI RT03 Dev 6hrs - SRI RT03 Eval 6hrs - SRI

Acoustic and auditory Features MFCCS, 25ms window(standard asr features) Spectral shape: energy, spectral tilt, and spectral compactness, once/millisecond Noise-robust MUSIC-based formant frequencies amplitudes, and bandwidths(zheng hasegawa Johnson, ICSLP 2004) Acoustic-phonetic parameters formant-based relative spectral measures and time-domain measures Bitar espy-Wilson, 1996) Rate-place model of neural response fields in the cat auditory cortex ( Carlyon shamma, JASA 2003)
Acoustic and Auditory Features • MFCCs, 25ms window (standard ASR features) • Spectral shape: energy, spectral tilt, and spectral compactness, once/millisecond • Noise-robust MUSIC-based formant frequencies, amplitudes, and bandwidths (Zheng & HasegawaJohnson, ICSLP 2004) • Acoustic-phonetic parameters (Formant-based relative spectral measures and time-domain measures; Bitar & Espy-Wilson, 1996) • Rate-place model of neural response fields in the cat auditory cortex (Carlyon & Shamma, JASA 2003)

What are distinctive Features? What are landmarks? · Distinctive feature= a binary partition of the phonemes (Jakobson, 1952) that compactly describes pronunciation variability (halle and correlates with distinct acoustic cues(Stevens) Landmark Change in the value of a manner Feature [+sonorant to [sonorant], [-sonorant to [+sonorant 5 manner features: Consonantal, continuant, syllabic, silence] Place and Voicing features: SVMs are only trained at landmarks Primary articulator: lips, tongue blade, or tongue body Features of primary articulator: anterior, strident Features of secondary articulator nasal, voiced
What are Distinctive Features? What are Landmarks? • Distinctive feature = – a binary partition of the phonemes (Jakobson, 1952) – … that compactly describes pronunciation variability (Halle) – … and correlates with distinct acoustic cues (Stevens) • Landmark = Change in the value of a Manner Feature – [+sonorant] to [–sonorant], [–sonorant] to [+sonorant] – 5 manner features: [consonantal, continuant, syllabic, silence] • Place and Voicing features: SVMs are only trained at landmarks – Primary articulator: lips, tongue blade, or tongue body – Features of primary articulator: anterior, strident – Features of secondary articulator: nasal, voiced
按次数下载不扣除下载券;
注册用户24小时内重复下载只扣除一次;
顺序:VIP每日次数-->可用次数-->下载券;
- 中国科学技术大学:《现代密码学理论与实践》课程教学资源(PPT课件讲稿)第9章 公钥密码学与RSA.pptx
- 中国科学技术大学:《数据结构及其算法》课程电子教案(PPT课件讲稿)第六章 二叉树和树.pps
- 计算机外设及电源故障处理(PPT课件讲稿).ppt
- 《计算机系统结构》课程教学资源(PPT课件讲稿)第三章 流水线技术.ppt
- 四川大学:《Java面向对象编程》课程PPT教学课件(Object-Oriented Programming - Java)Unit 1.2 Designing Classes.ppt
- 软件开发环境与工具的选用(PPT课件讲稿)Select software development tool.ppt
- 电子科技大学:《微机原理与接口技术》课程教学资源(PPT实验讲稿,习友宝).ppt
- 北京师范大学:《多媒体技术与网页制作》课程教学资源(PPT课件)数字音频技术.ppt
- 清华大学出版社:《C语言程序设计》课程教学资源(PPT课件讲稿,共十二章,田丽华、岳俊华、孙颖馨).ppt
- 《算法设计与分析》课程教学资源(PPT讲稿)第十五讲 NP完全性理论与近似算法.pptx
- 西安电子科技大学:《现代密码学》课程教学资源(PPT课件讲稿)第八章 密钥分配与密钥管理.pptx
- 河南中医药大学(河南中医学院):《计算机网络》课程教学资源(PPT课件讲稿)第二章 物理层(阮晓龙).pptx
- 中国人民大学:A Survey on PIM(PPT讲稿).ppt
- 《电脑组装与维护实例教程》教学资源(PPT课件讲稿)第13章 计算机的保养.ppt
- 《计算机网络》课程教学资源(PPT课件讲稿)Chapter 06 广域网技术.ppt
- 《Link Layer Computer Networking:A Top Down Approach》课程教学资源(PPT课件讲稿)Chapter 5 The Data Link Layer.ppt
- 《计算机辅助设计——CAD制图》课程标准.pdf
- 合肥工业大学:《网络安全概论》课程教学资源(PPT课件讲稿)无线网络安全.ppt
- 《单片机原理及应用》课程教学资源(PPT课件讲稿)第3章 MCS-51单片机的指令系统.pptx
- 中国科学技术大学:《微机原理》课程教学资源(PPT课件讲稿)第八章 中断系统.pptx
- 《微型计算机原理及应用》课程教学资源(PPT课件讲稿)第2章 微处理器.ppt
- 《计算机网络》课程教学资源(PPT课件讲稿)第六章 IP路由.ppt
- Urandaline Investments The Perils of Down Under:Chinese Investment in Australia.pptx
- 四川大学:《数据库技术》课程教学资源(PPT课件讲稿)第1章 数据库技术概论.ppt
- 《数据结构》课程教学资源(PPT课件讲稿)第四章 串.ppt
- 西安电子科技大学:《Mobile Programming》课程PPT教学课件(Android Programming)Lecture 7 数据持久化 Data Persistence.pptx
- 《轻松学习C语言》教学资源(PPT课件讲稿,繁体版,共十二章).pptx
- 《计算机组装维修及实训教程》课程教学资源(PPT课件)第2章 中央处理器.ppt
- 《操作系统》课程教学资源(PPT课件)第六章 设备管理 Devices Management.ppt
- 《编译原理》课程教学资源(PPT课件讲稿)第三章 语法分析.ppt
- Object-Oriented Programming(Java).ppt
- Threads, SMP, and MicroKernels.ppt
- 对等网络 Peer-to-Peer Networks(P2P).ppt
- 香港浸会大学:《网络管理 Network Management》课程教学资源(PPT课件讲稿)Chapter 02 Network Management Model.ppt
- 中国科学技术大学:《高级操作系统 Advanced Operating System》课程教学资源(PPT课件讲稿)第四章 分布式进程和处理机管理(主讲:熊焰).ppt
- 兰州大学:《SOA & Web Service》教学资源(PPT课件讲稿)Lecture 5 Web Service Program(苏伟).ppt
- 哈尔滨工业大学:开放式中文实体关系抽取研究(导师:秦兵).pptx
- 《计算机控制技术》课程教学资源(PPT课件讲稿)第二章 模拟量输出通道.ppt
- 中国科学技术大学:《并行算法实践》课程教学资源(PPT课件讲稿)上篇 并行程序设计导论 单元I 并行程序设计基础 第三章 并行程序设计简介.ppt
- 《多媒体技术基础》课程教学资源(PPT课件讲稿)单元1 多媒体概述.ppt