广东工业大学：《机器学习》课程教学资源（课件讲义）第19讲 ViT及注意力机制改进（各式各样的Attention）

点击下载完整版文档（PDF）

文档信息

资源类别：文库
文档格式：PDF
文档页数：40
文件大小：1.15MB
团购合买：点击进入团购

内容简介

广东工业大学：《机器学习》课程教学资源（课件讲义）第19讲 ViT及注意力机制改进（各式各样的Attention）

刷新页面文档预览

各式各樣的Attention Hung-yi Lee李宏毅

各式各樣的 Attention Hung-yi Lee 李宏毅

Prerequisite https://youtu.be/hYdO9CscNes https://youtu.be/gmsMY5kc-zw 【機器學習2021】自注意力【機器學習2021】自注意力機制(Self-attention)(上) 機制(Self-attention)(下)

Prerequisite https://youtu.be/hYdO9CscNes https://youtu.be/gmsMY5kc-zw 【機器學習2021】自注意力機制 (Self-attention) (上) 【機器學習2021】自注意力機制 (Self-attention) (下)

To Learn More.… Big Bird Transformer Synthesizer Performer ○Linformer ●Reformer Sinkhorn 0 Linear Transformer Long Range Arena:A Benchmark for Efficient Local Attention ● Transformers https://arxiv.org/abs/2011.04006 100 150 200 250 300 350 Speed (examples per sec) Recurrence Pefome sarTeiore Low Rank/ Memory Kernels ETC Transformer Synthe Big Bird Leamable Fixed/Factorlzed/ Patterns Random Patterns Efficient Transformers:A Survey eoe https://arxiv.org/abs/2009.06732 Axia

To Learn More … https://arxiv.org/abs/2009.06732 Efficient Transformers: A Survey Long Range Arena: A Benchmark for Efficient Transformers https://arxiv.org/abs/2011.04006 3

How to make self-attention efficient? key Sequence length =N anb 三三 Attention Matrix N×W

How to make self-attention efficient? Attention Matrix key query 𝑁 𝑁 𝑁 × 𝑁 Sequence length = 𝑁

Output Probabilities Notice Softmax Self-attention is only a Add Norm module in a larger Feed Forward network. Add Norm Add Norm ·Self-attention Multi-Head Feed Attention dominates computation Forward when N is large. Add Norm Add Norm Masked Multi-Head Multi-Head Usually developed for Attention Attention image processing Positional Positional Encoding Encoding N= Input Output 256 Embedding Embedding 256*256 Inputs Outputs 256 (shifted right)

Notice • Self-attention is only a module in a larger network. • Self-attention dominates computation when 𝑁 is large. • Usually developed for image processing 𝑁 = 256 ∗ 256 256 256

Skip Some Calculations with Human Knowledge Can we fill in some values with human knowledge?

Local Attention Truncated Attention Set to 0 Similar with CNN Calculate attention key weight

Local Attention / Truncated Attention Calculate attention weight Set to 0 …… Similar with CNN key query

Stride Attention

Stride Attention … …

Global Attention special token="token中的里長伯" Add special token into original sequence Attend to every token-collect global information Attended by every token->it knows global information No attention between non- special token

… … Global Attention Add special token into original sequence • Attend to every token → collect global information • Attended by every token → it knows global information special token = “token中的里長伯“ No attention between nonspecial token

Many Different Choices .. 小孩子才做摆擇··。 Different heads use different patterns

Many Different Choices … Different heads use different patterns. 小孩子才做選擇．．．

共40页，可试读14页，点击继续阅读 ↓

刷新页面下载完整文档

VIP每日下载上限内不扣除下载券和下载次数；
按次数下载不扣除下载券；
注册用户24小时内重复下载只扣除一次；
顺序：VIP每日次数-->可用次数-->下载券；

点击下载完整版文档（PDF）