中国高校课件下载中心 》 教学资源 》 大学文库

广东工业大学:《机器学习》课程教学资源(课件讲义)第19讲 ViT及注意力机制改进(各式各样的Attention)

文档信息
资源类别:文库
文档格式:PDF
文档页数:40
文件大小:1.15MB
团购合买:点击进入团购
内容简介
广东工业大学:《机器学习》课程教学资源(课件讲义)第19讲 ViT及注意力机制改进(各式各样的Attention)
刷新页面文档预览

各式各樣的Attention Hung-yi Lee李宏毅

各式各樣的 Attention Hung-yi Lee 李宏毅

Prerequisite https://youtu.be/hYdO9CscNes https://youtu.be/gmsMY5kc-zw 【機器學習2021】自注意力 【機器學習2021】自注意力 機制(Self-attention)(上) 機制(Self-attention)(下)

Prerequisite https://youtu.be/hYdO9CscNes https://youtu.be/gmsMY5kc-zw 【機器學習2021】自注意力 機制 (Self-attention) (上) 【機器學習2021】自注意力 機制 (Self-attention) (下)

To Learn More.… Big Bird Transformer Synthesizer Performer ○Linformer ●Reformer Sinkhorn 0 Linear Transformer Long Range Arena:A Benchmark for Efficient Local Attention ● Transformers https://arxiv.org/abs/2011.04006 100 150 200 250 300 350 Speed (examples per sec) Recurrence Pefome sarTeiore Low Rank/ Memory Kernels ETC Transformer Synthe Big Bird Leamable Fixed/Factorlzed/ Patterns Random Patterns Efficient Transformers:A Survey eoe https://arxiv.org/abs/2009.06732 Axia

To Learn More … https://arxiv.org/abs/2009.06732 Efficient Transformers: A Survey Long Range Arena: A Benchmark for Efficient Transformers https://arxiv.org/abs/2011.04006 3

How to make self-attention efficient? key Sequence length =N anb 三三 Attention Matrix N×W

How to make self-attention efficient? Attention Matrix key query 𝑁 𝑁 𝑁 × 𝑁 Sequence length = 𝑁

Output Probabilities Notice Softmax Self-attention is only a Add Norm module in a larger Feed Forward network. Add Norm Add Norm ·Self-attention Multi-Head Feed Attention dominates computation Forward when N is large. Add Norm Add Norm Masked Multi-Head Multi-Head Usually developed for Attention Attention image processing Positional Positional Encoding Encoding N= Input Output 256 Embedding Embedding 256*256 Inputs Outputs 256 (shifted right)

Notice • Self-attention is only a module in a larger network. • Self-attention dominates computation when 𝑁 is large. • Usually developed for image processing 𝑁 = 256 ∗ 256 256 256

Skip Some Calculations with Human Knowledge Can we fill in some values with human knowledge?

Skip Some Calculations with Human Knowledge Can we fill in some values with human knowledge?

Local Attention Truncated Attention Set to 0 Similar with CNN Calculate attention key weight

Local Attention / Truncated Attention Calculate attention weight Set to 0 …… Similar with CNN key query

Stride Attention

Stride Attention … …

Global Attention special token="token中的里長伯" Add special token into original sequence Attend to every token-collect global information Attended by every token->it knows global information No attention between non- special token

… … Global Attention Add special token into original sequence • Attend to every token → collect global information • Attended by every token → it knows global information special token = “token中的里長伯“ No attention between non￾special token

Many Different Choices .. 小孩子才做摆擇··。 Different heads use different patterns

Many Different Choices … Different heads use different patterns. 小孩子才做選擇...

刷新页面下载完整文档
VIP每日下载上限内不扣除下载券和下载次数;
按次数下载不扣除下载券;
注册用户24小时内重复下载只扣除一次;
顺序:VIP每日次数-->可用次数-->下载券;
相关文档