广东工业大学：《机器学习》课程教学资源（课件讲义）第10讲神经网络的优化（梯度消失和梯度爆炸BN）

点击下载完整版文档（PDF）

文档信息

资源类别：文库
文档格式：PDF
文档页数：14
文件大小：540.24KB
团购合买：点击进入团购

内容简介

广东工业大学：《机器学习》课程教学资源（课件讲义）第10讲神经网络的优化（梯度消失和梯度爆炸BN）

刷新页面文档预览

Quick Introduction of Batch Normalization Hung-yi Lee李宏毅 1

Quick Introduction of Batch Normalization Hung-yi Lee 李宏毅 1

Changing Landscape W2 Loss L smooth W +△y small W1+△W1 e X1 1,2 +△ small mall b L= W2 +△L 1 X2 small 2

Changing Landscape 1 + 1, 2 …… w1 w2 Loss L 𝑦 ො 𝑦 𝑒 𝑏 𝑤1 𝑤2 𝐿 = ෍𝑒 small 𝑥1 𝑥2 +∆𝑤1 +∆y +∆e +∆𝐿 small smooth small small 2

Changing Landscape Loss L Loss L smooth +△y large W1 y← X1 e 1,2 +△e small same large b L= W2 range △W2 X2 100,200 +△L .s. large large

Changing Landscape 1 + 1, 2 …… 100, 200 …… w1 w2 Loss L w1 w2 Loss L 𝑦 ො 𝑦 𝑒 𝑏 𝑤1 𝑤2 𝐿 = ෍𝑒 small large 𝑥1 𝑥2 +∆𝑤2 +∆y +∆e +∆𝐿 large smooth steep same range large large 3

Feature normalization 3 x For each x dimension i: mean:mi : standard deviation:oi x{-mi The means of all dims are 0, ← Oi and the variances are all 1 In general,feature normalization makes gradient descent converge faster. 4

Feature Normalization ……………… …… …… …… …… 𝒙 𝟏 𝒙 𝟐 𝒙 𝟑 𝒙 𝒓 𝒙 𝑹 mean: 𝑚𝑖 standard deviation: 𝜎𝑖 𝒙෥𝑖 𝒓 ← 𝒙𝑖 𝒓 − 𝑚𝑖 𝜎𝑖 The means of all dims are 0, and the variances are all 1 For each dimension 𝑖: 𝒙1 𝟏 𝒙2 𝟏 𝒙1 𝟐 𝒙2 𝟐 In general, feature normalization makes gradient descent converge faster. 4

Considering Deep Learning Different dims have different ranges. Wi igmoid W2 元2 W1 Sigmoid W2 。。。。e 3 W1 63 Sigmoid W2 Also difficult to optimize Feature Also need Normalization normalization

𝒂 𝟑 𝒂 𝟐 𝑎 𝑊1 1 𝑊1 𝑊1 𝒛 𝟏 𝒛 𝟐 𝒛 𝟑 𝑊2 𝑊2 𝑊2 Sigmoid …… …… …… Sigmoid Sigmoid Feature Normalization 𝒙෥ 𝟏 𝒙෥ 𝟐 𝒙෥ 𝟑 Also need normalization Different dims have different ranges. Also difficult to optimize Considering Deep Learning 5

Considering Deep Learning 3 Wi 3 W1 ∑a-w2 3 W1 6

𝑊1 𝑊1 𝑊1 𝒛 𝟏 𝒛 𝟐 𝒛 𝟑 𝝁 𝝈 𝝁 = 1 3 ෍ 𝑖=1 3 𝒛 𝒊 𝝈 = 1 3 ෍ 𝑖=1 3 𝒛 𝒊 − 𝝁 2 𝒙෥ 𝟏 𝒙෥ 𝟐 𝒙෥ 𝟑 Considering Deep Learning 6

Considering Deep Learning z2-0 This is a large network! G ≌ Wi igmoid Wi 2 22 Sigmoid 3 W1 Sigmoid Consider a batch uand o Batch Normalization depends on zi

𝑊1 𝑊1 𝑊1 𝒛 𝟏 𝒛 𝟐 𝒛 𝟑 𝝁 𝝈 𝒛෤ 𝒊 = 𝒛 𝒊 − 𝝁 𝝈 𝒂 𝟑 𝒂 𝟐 𝒂 𝟏 Sigmoid Sigmoid Sigmoid 𝒛෤ 𝟏 𝒛෤ 𝟐 𝒛෤ 𝟑 This is a large network! 𝒙෥ 𝟏 𝒙෥ 𝟐 𝒙෥ 𝟑 𝝁 and 𝝈 depends on 𝒛 𝒊 Considering Deep Learning Consider a batch Batch Normalization 7 ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆

刘、 z2- Batch normalization 0 2=Y⊙z+B W1 W1 2 ≥2 22 3 W1 3 3 63 u and o depends on zt B

Batch normalization 𝑊1 𝑊1 𝑊1 𝒛 𝟏 𝒛 𝟐 𝒛 𝟑 𝝁 𝝈 𝒛 ො 𝒊 = 𝜸⨀𝒛෤ 𝒊 + 𝜷 𝒛 ො 𝟑 𝒛 ො 𝟐 𝒛 ො 𝟏 𝒛෤ 𝟏 𝒛෤ 𝟐 𝒛෤ 𝟑 𝜷 𝜸 𝒛෤ 𝒊 = 𝒛 𝒊 − 𝝁 𝝈 𝝁 and 𝝈 depends on 𝒛 𝒊 𝒙෥ 𝟏 𝒙෥ 𝟐 𝒙෥ 𝟑 8

Batch normalization -Testing i= z-汉 Wi 衣ā Z u,o are from batch? We do not always have batch at testing stage. Computing the moving average of u and o of the batches during training. ul 2 心3 π←pπ+(1-p)u

Batch normalization – Testing 𝒙෥ 𝑊1 𝒛 𝒛෤ 𝒛෤ = 𝒛 − 𝝁 𝝈 𝝁, 𝝈 are from batch? We do not always have batch at testing stage. Computing the moving average of 𝝁 and 𝝈 of the batches during training. …… 𝝁ഥ 𝝈ഥ 𝝁 𝟏 𝝁 𝟐 𝝁 𝟑 𝝁 …… 𝒕 𝝁ഥ ← 𝑝𝝁ഥ + 1 − 𝑝 𝝁 𝒕

Batch normalization 0.8 Original paper:https://arxiv.org/abs/1502.03167 0.7 0.6 -Inception --BN-Baseline 0.5 ，。。gt,。 BN-x5 BN-x30 BN-x5-Sigmoid ◆ Steps to match Inception 0.4 5M 10M 15M 20M 25M 30M 10

Batch normalization Original paper: https://arxiv.org/abs/1502.03167 10

共14页，试读已结束，阅读完整版请下载

刷新页面下载完整文档

VIP每日下载上限内不扣除下载券和下载次数；
按次数下载不扣除下载券；
注册用户24小时内重复下载只扣除一次；
顺序：VIP每日次数-->可用次数-->下载券；

点击下载完整版文档（PDF）

广东工业大学：《机器学习》课程教学资源（课件讲义）第10讲 神经网络的优化（梯度消失和梯度爆炸BN）

广东工业大学：《机器学习》课程教学资源（课件讲义）第10讲神经网络的优化（梯度消失和梯度爆炸BN）