《深度自然语言处理》课程教学课件（Natural language processing with deep learning）09 Language Model & Distributed Representation（6/6）

文档信息

资源类别：文库
文档格式：PDF
文档页数：51
文件大小：1.37MB
团购合买：点击进入团购

内容简介

西安交通大学Natural languageprocessingwith deeplearningXIANHAOTONGUNIVERSITYLanguage Model&Distributed Representation (6)交通大学ChenLicli@xjtu.edu.cn2023

Chen Li cli@xjtu.edu.cn 2023 Language Model & Distributed Representation (6) Natural language processing with deep learning

Outlines1. Pre-training LMGPT2. 3. Bert大4. T5

Outlines 1. Pre-training LM 2. GPT 3. Bert 4. T5

Outlines1.Pre-trainingLMGPT2. 3.Bert4. T5

Outlines 1. Pre-training LM 2. GPT 3. Bert 4. T5

AdvancedLMTaxonomy of pre-training LMCBOW,Skip-Gram[123]Non-ContextualGloVe[127]Contextual?ContextualELMo[129].GPT[136],BERT[35]LSTMELMo [129], CoVe [120]TransformerEncBERT[35].SpanBERT[1]].XLNet[202].RoBERTa[11]]ArchitecturesGPT (L36], GPT-2 [137]Transformer Dec.MASS [154], BART [98]TransformerXNLG[19],mBART[L12]MTCoVe[120]SupervisedLMELMo[129],GPT[136].GPT-2[137],UniLM[38]BERT[35],SpanBERT[],RoBERTa[]XLM-R[28]Task TypesMLMTLMXLM[27]Seq2SeqMLMMASS [154],T5 [138]Unsupervised/PTMsSelf-SupervisedPLMXLNet [202]DAEBART(98]RTDCBOW-NS[123],ELECTRA[24]NSPCTLBERT [35],UniLM [38]SOPALBERT[9]],Suruc(BERT[187]

Advanced LM l Taxonomy of pre-training LM

AdvancedLMTaxonomyofpre-trainingLMERNIETHU)20Z.KnOWBERTL30].K-BERT107)Knowledge-EnrichedSentiLR[82].KEPLER[189].WKEM[195]XLUmBERT35],Unicoder[67].XLM27].XLM-R[28].MultiFit[4]MultilingualXLGMASS[154.mBART12.XNLG19ERNIE(Baidu)[64].BERT-wwm-Chinese[29].NEZHA[9],ZEN[36]Language-SpecificBERTje[32].CamemBERT[119].FlauBERT[93].RobBERT[34]ViLBERT[L14].LXMERT[169]ImageVisualBERT00B2T22].VL-BERT57]ExtensionsMulti-ModalVideoVideoBERT[L59].CBT[158]SpeechSpeechBERT[22]Domain-SpecificSentiLR[82],BioBERT[96],SciBERT[]],PatentBERT[95]Model PruningCompressingBERT[50]QuantizationQ-BERT50].Q8BERT[204ALBERT[9]ModelCompressionParameter SharingDistillationDistilBERT[46],TinyBERT[4].MiniLM[188]Module ReplacingBERT-of-Theseus[196]

Advanced LM l Taxonomy of pre-training LM

Pre-training LMPretraining Language Models with three architecturesLanguage models!Whatwe've seen so far.DecodersNice to generate from; can't condition on future wordsGetsbidirectional context-can condition onfuture!EncodersWait,how do we pretrain them?交道大学Encoder-Goodpartsofdecodersandencoders?Whats thebest wayto pretrain them?Decoders

Pre-training LM l Pretraining Language Models with three architectures Decoders • Language models! What we've seen so far. • Nice to generate from; can't condition on future words Encoders • Gets bidirectional context – can condition on future! • Wait, how do we pretrain them? Encoder- Decoders • Good parts of decoders and encoders? • What’s the best way to pretrain them?

Pre-trainingLMPretrainingLanguage ModelswiththreearchitecturesPretraining for three types of architecturesThe neural architecture influences the type of pretraining, and natural use cases.Languagemodels!Whatwe'veseensofarDecodersNice to generate from; can't condition on future wordsancondiiononfutEncoderswepretrairihem?逸大Encoderparts-ofdecoDecoders

Pre-training LM Pretraining for three types of architectures The neural architecture influences the type of pretraining, and natural use cases. • Language models! What we’ve seen so far. • Nice to generate from; can’t condition on future words Decoders • Gets bidirectional context – can condition on future! • Wait, how do we pretrain them? Encoders Encoder- Decoders • Good parts of decoders and encoders? • What’s the best way to pretrain them? l Pretraining Language Models with three architectures

Pre-trainingLMPretrainingdecodersWhen using language modelpretrained decoders,wecan ignorethattheyweretrainedtomodel(/1:-1)交通大学

Pre-training LM When using language model pretrained decoders, we can ignore that they were trained to model �(� |�1: −1 ) l Pretraining decoders

Pre-training LMPretraining decodersWhen using language modelpretrained decoders,wecan ignorethattheyweretrainedtomodel(/1:-1)?/?A,bLinearWe canfinetunethembytraininga classifieronthelastword'shiddenstate.hi,..,hrh,....h, =Decoder(wi..., Wr)y~ Awr +bWi..,WT交通大[Notehowthelinearlayerhasn'tbeenpretrainedandmustbelearnedfromscratch.]

Pre-training LM When using language model pretrained decoders, we can ignore that they were trained to model �(� |�1: −1 ) We can finetune them by training a classifier on the last word’s hidden state. 1 1 ,., Decoder( ,., ) T T h h  w w ~ T y Aw  b [Note how the linear layer hasn’t been pretrained and must be learned from scratch.] l Pretraining decoders

Pre-training LMPretraining decodersWhen using language modelpretrained decoders,wecan ignorethat they were trained to model ( 11: -1)/?A,bLinearWecan finetunethembytraininga classifier onthelastword'shiddenstate.hi,..,hrh,.... h, = Decoder(w...., W.)y~ Awr +bWi,...,WT通大Where andarerandomly initialized[Notehowthe linear layerhasn'tbeenandspecifiedbythedownstreamtaskpretrainedandmustbelearnedfromscratch.]Gradientsbackpropagatethroughthewholenetwork

Pre-training LM When using language model pretrained decoders, we can ignore that they were trained to model �(� |�1: −1 ) We can finetune them by training a classifier on the last word’s hidden state. 1 1 ,., Decoder( ,., ) T T h h  w w ~ T y Aw  b Where � and � are randomly initialized and specified by the downstream task. Gradients backpropagate through the whole network. [Note how the linear layer hasn’t been pretrained and must be learned from scratch.] l Pretraining decoders

共51页，可试读17页，点击继续阅读 ↓

刷新页面下载完整文档

VIP每日下载上限内不扣除下载券和下载次数；
按次数下载不扣除下载券；
注册用户24小时内重复下载只扣除一次；
顺序：VIP每日次数-->可用次数-->下载券；

点击下载完整版文档（PDF）