中国高校课件下载中心 》 教学资源 》 大学文库

清华大学:Mandarin Pronunciation Variation Modeling

清华大学:Mandarin Pronunciation Variation Modeling

NCMMSC 01 20-22 NOV 01, Shenzhen china Mandarin pronunciation Variation Modeling Thomas Fang Zheng Center of Speech Technology State Key Lab of Intelligent Technology and systems Department of Computer Science Technology Tsinghua University,http:/sp.cs.tsinghuaeducn/fzheng

Mandarin Pronunciation Variation Modeling Thomas Fang Zheng Center of Speech Technology State Key Lab of Intelligent Technology and Systems Department of Computer Science & Technology Tsinghua University, NCMMSC’01 20-22 NOV 01, Shenzhen, China

Motivation o In spontaneous speech, pronunciations of individual words are different there are often 今 Sound changes,and 今 Phone changes Change includes insertion deletion and substitution ☆上 or chinese an additional accent problem even people are speaking mandarin due to different dialect backgrounds(in Chinese, 7 major dialects) colloquialism, grammar, style a Goal: modelling the pronunciation variations s Establishing a corpus with spontaneous phenomena, because we should know what the canonical phones change to Finding solutions to the pronunciation modelling theoretically and practically Center of speech Technology, Tsinghua University Slide 2

Center of Speech Technology, Tsinghua University Slide 2 Motivation ❑ In spontaneous speech, pronunciations of individual words are different, there are often ❖ Sound changes, and ❖ Phone changes. ❖ For Chinese ➢ an additional accent problem even people are speaking Mandarin, due to different dialect backgrounds (in Chinese, 7 major dialects) ➢ colloquialism, grammar, style ❑ Goal: modelling the pronunciation variations ❖ Establishing a corpus with spontaneous phenomena, because we should know what the canonical phones change to. ❖ Finding solutions to the pronunciation modelling theoretically and practically Change includes insertion, deletion and substitution

Overview Authors Paper Source Database Method WER T. Fukada. Y. Sagisaka Automatic generation of a pronunciation dictionary based Japanese AnN 75.54% (ATR, Japan) on a pronunciation network( EuroSpeech97) Prediction 6744% M-K LIu Bo Xu Mandarin accent adaptation based on CI/cD Shangha Confusion 45.13% (NLPR, China) pronunciation modeling(ICASSP2000) Accent(Intel MatrIx 40.24% M Saraclar(CLSP, JHU) Pronunciation modeling by sharing Gaussian densities Switchboard Gaussian 50.10% H Nock(CUED, Cam, UK)I across phonetic models(EuroSpeech99) 48.70% K Ma, G. Zavaliagkos Pronunciation modeling for large vocabulary Switchboard 5460% (GTE /BBN, USA) conversational speech recognit ion(ICSLP'98) Callhome 5349% M. Riley(AT&T Labs) Stochastic pronunciation modelling from hand-labelled TIMIT+ICSIDecision 44.66% W. Byrne(CLSP, JHU) phonetic corpora(Speech Communicaion, 1999(29) Tree 44.05% D. Povey, P.C. Wooland Improved discriminative training techniques for large Discriminant.60% ( CUED, Cambridge, UK) vocabulary continuous speech recognit ion(ICASSP'2001) Switchboard Training 44.30% T Hain P C Woodland New features in the cu-htk system for transcription of NIST Hubs VTLN 5160% CUED, Cambridge, UK) conversational telephone speech(ICASSP 2001) (Telephone) MMIE 4700% Center of speech Technology, Tsinghua University Slide 3

Center of Speech Technology, Tsinghua University Slide 3 Overview Authors Paper Source Database Method WER T. Fukada, Y. Sagisaka (ATR, Japan) Automatic generation of a pronunciation dictionary based on a pronunciation network (EuroSpeech’97) Japanese Spontaneous ANN Prediction 75.54 % 67.44 % M-K Liu, Bo Xu (NLPR, China) Mandarin accent adaptation based on CI/CD pronunciation modeling (ICASSP’2000) Shanghai Accent (Intel) Confusion Matrix 45.13 % 40.24 % M. Saraclar (CLSP, JHU) H. Nock (CUED, Cam., UK) Pronunciation modeling by sharing Gaussian densities across phonetic models (EuroSpeech’99) Switchboard Gaussian Sharing 50.10 % 48.70 % K. Ma, G. Zavaliagkos (GTE / BBN, USA) Pronunciation modeling for large vocabulary conversational speech recognition (ICSLP’98) Switchboard Callhome Lexical Adaptation 54.60 % 53.49 % M. Riley (AT&T Labs) W. Byrne (CLSP, JHU) Stochastic pronunciation modelling from hand-labelled phonetic corpora (Speech Communicaion, 1999 (29)) TIMIT + ICSI Decision Tree 44.66 % 44.05 % D. Povey, P.C. Wooland (CUED, Cambridge, UK) Improved discriminative training techniques for large vocabulary continuous speech recognition (ICASSP’2001) NAB, Switchboard Discriminant Training 46.60 % 44.30 % T. Hain, P.C. Woodland (CUED, Cambridge, UK) New features in the cu-htk system for transcription of conversational telephone speech (ICASSP’2001) NIST Hub5E (Telephone) VTLN MMIE 51.60 % 47.00 %

Necessity to establish a new annotated spontaneous speech corpus a The existing databases(incl. Broadcast News, CallHome, CallFriend, ..)do not cover all the Chinese spoken language phenomena pl , Sound changes: voiced, unvoiced, nasalization ,s Phone changes: retroflexed, OoV-phoneme a The existing databases do not contain pronunciation variation Intormation for use of bootstrap training o A Chinese annotated Spontaneous Speech(CAss) Corpus was established before wsoo on lsp in jhu Completely spontaneous(discourses, lectures, . Remarkable background noise, accent background Recorded onto tapes and then digitalized Center of speech Technology, Tsinghua University Slide 4

Center of Speech Technology, Tsinghua University Slide 4 ❑ The existing databases (incl. Broadcast News, CallHome, CallFriend, …) do not cover all the Chinese spoken language phenomena ❖ Sound changes: voiced, unvoiced, nasalization, … ❖ Phone changes: retroflexed, OOV-phoneme, … ❑ The existing databases do not contain pronunciation variation information for use of bootstrap training ❑ A Chinese Annotated Spontaneous Speech (CASS) Corpus was established before WS00 on LSP in JHU ❖ Completely spontaneous (discourses, lectures, ...) ❖ Remarkable background noise, accent background, ... ❖ Recorded onto tapes and then digitalized Necessity to establish a new annotated spontaneous speech corpus

Chinese Annotated Spontaneous speech (CASS) Corpus o CAss w/Five-Tier Transcription 令 Character level base form Syllable(or Pinyin) Level (w/tone base form Initial/Final (F level w/time boundary for baseform 令 SAMPA- C Level surface form 今 Miscellaneous level used for garbage modeling Lengthening, breathing, laughing, coughing, disfluency, noise, silence, murmur(unclear), modal, smack, non-Chinese xample Character 我们 认 点 SⅤable wo3 menO rent shio alan rer CASS Syllable wo3 menO duol ren 4 shio diana ren2 IF uom@_nt uo z'@_n i't iE n z'@ GIF uo @n tvu z@_ zan Misc noise Center of speech Technology, Tsinghua University Slide 5

Center of Speech Technology, Tsinghua University Slide 5 ❑ CASS w/ Five-Tier Transcription ❖ Character level : base form ❖ Syllable (or Pinyin) Level (w/ tone) : base form ❖ Initial/Final (IF) Level : w/ time boundary for baseform ❖ SAMPA-C Level : surface form ❖ Miscellaneous Level : used for garbage modeling ➢ Lengthening, breathing, laughing, coughing, disfluency, noise, silence, murmur (unclear), modal, smack, non-Chinese ❖ Example Character 我 们 多 认 识 点 人 Syllable wo3 men0 duo1 ren4 shi0 dian3 ren2 CASS Syllable wo3 men0 duo1 ren4 shi0 dianr3 ren2 IF uo m @_n t uo z` @_n s` i` t iE_n z` @_n GIF uo @_n t_v uo z` @_n s`_v t_v ia` z` @_n Misc noise mum Chinese Annotated Spontaneous Speech (CASS) Corpus

SAMPA-C: Machine readable Ipa a Phonologic consonants 23 a Phonologic vowels o Initials 21 口 finals 38 口 Retroflexed finals 38 o Tones and silences a Sound changes a Spontaneous phenomenon labels Center of speech Technology, Tsinghua University Slide 6

Center of Speech Technology, Tsinghua University Slide 6 ❑ Phonologic Consonants - 23 ❑ Phonologic Vowels - 9 ❑ Initials - 21 ❑ Finals - 38 ❑ Retroflexed finals - 38 ❑ Tones and Silences ❑ Sound Changes ❑ Spontaneous Phenomenon Labels SAMPA-C: Machine Readable IPA

Key points in PM (1) a Choosing and generating speech recognition unit (SrU set , So as to well describe the phone changes and sound changes ,s Could be syllable, semi-syllable, or INITIAL/FINAL a Constructing a multi-pronunciation lexicon(MPL) s a syllable-to-sru lexicon to reflect the relation between the ammatical units and acoustic models a Acoustically modeling spontaneous speech Theoretical framework . s CD modeling confusion matrix; data-driven Center of speech Technology, Tsinghua University Slide 7

Center of Speech Technology, Tsinghua University Slide 7 Key Points in PM (1) ❑ Choosing and generating speech recognition unit (SRU) set ❖ So as to well describe the phone changes and sound changes ❖ Could be syllable, semi-syllable, or INITIAL/FINAL. ❑ Constructing a multi-pronunciation lexicon (MPL) ❖ A syllable-to-SRU lexicon to reflect the relation between the grammatical units and acoustic models ❑ Acoustically modeling spontaneous speech ❖ Theoretical framework ❖ CD modeling; confusion matrix; data-driven

Key points in PM (2) a Customizing decoding algorithm according to new lexicon Improved time-synchronous search algorithm to reduce the path expansion(caused by CD modeling) a based algorithm based tree-trellis search algorithm to score multiple pronunciation variations simul taneously in the path a Modifying statistical language model W=arg max P(X W)P(W) W= arg max P(XIn)P() W W=Baseform() w=argmax P(X)(W)P(W) W=Baseform(l Center of speech Technology, Tsinghua University Slide 8

Center of Speech Technology, Tsinghua University Slide 8 Key Points in PM (2) ❑ Customizing decoding algorithm according to new lexicon ❖ Improved time-synchronous search algorithm to reduce the path expansion (caused by CD modeling) ❖ A* based algorithm based tree-trellis search algorithm to score multiple pronunciation variations simultaneously in the path ❑ Modifying statistical language model ˆ arg max ( | ) ( ) W W P X W P W = ( ) ˆ argmax ( | ) ( ) W Baseform V W P X V P V = = ( ) ˆ argmax ( | ) ( | ) ( ) W Baseform V W P X V P V W P W = =

Establishment of multi-Pron Lexicon a Two major approaches ☆ Define ed by linguists and phonetist Data-driven confusion matrix. rewritten rules decision tree 口 Our metho Find all possible pronunciations in SAMPA-C from database Reduce the size according to occurring frequencies Center of speech Technology, Tsinghua University Slide g

Center of Speech Technology, Tsinghua University Slide 9 ❑ Two major approaches ❖ Defined by linguists and phonetists ❖ Data-driven: confusion matrix, rewritten rules, decision tree ... ❑ Our method: ❖ Find all possible pronunciations in SAMPA-C from database ❖ Reduce the size according to occurring frequencies Establishment of Multi-Pron. Lexicon

Surface form for IF and syllable o Learning pronunciations Definition of Generalized Initial-Finals(GIFs) Collect all of them and choose the ts canonical most frequent ones ts v voiced as GIFs ts changed ts v changed to voiced ch canonica 7 troflexed or changed to ' e changed . Definition of Generalized Syllables(Gss)the lexicon Define them chang 0. tsh AN accordin ing to GIF chang 0. 1215 ts hv AN set chaI ng [0.0280] ts v AN chang [0.0187 AN chang [0.0187]z AN chang [0.0093 IAN P(GIFI GIF I Syllable) chang 0.0093]tsh AN chang [0.0093]tsh Center of Speech Technology, Tsinghua University Slide 10

Center of Speech Technology, Tsinghua University Slide 10 ❑ Learning pronunciations ❖ Definition of Generalized Initial-Finals (GIFs) ➢ z ts : canonical ➢ z ts_v : voiced ➢ z ts` : changed to ‘zh’ ➢ z ts`_v : changed to voiced ‘zh’ ➢ e 7 : canonical ➢ e 7` : retroflexed or changed to ‘er’ ➢ e @ : changed ❖ Definition of Generalized Syllables (GSs) – the lexicon ➢ chang [0.7850] ts`_h AN ➢ chang [0.1215] ts`_h_v AN ➢ chang [0.0280] ts`_v AN ➢ chang [0.0187] AN ➢ chang [0.0187] z` AN ➢ chang [0.0093] iAN ➢ chang [0.0093] ts_h AN ➢ chang [0.0093] ts`_h UN P ( [GIFi ] GIFf | Syllable ) Define them according to GIF set. Collect all of them and choose the most frequent ones as GIFs. Probabilistic lexicon. Surface form for IF and Syllable
