《机器学习 Machine Learning》课程教学资源(书籍文献)SVM Tutorial

SVM Tutorial Zoya Gavrilov Just the basics with a little bit of spoon-feeding... 。· data of one class. margin support vectors wTx+b=1 wTx+b=0 decision boundary data of another class wTx+b=-1 1 Simplest case:linearly-separable data,binary classification Goal:we want to find the hyperplane(i.e.decision boundary)linearly separating our classes.Our boundary will have equation:wTx+b=0. Anything above the decision boundary should have label 1. i.e.,xi s.t.wTxi+b>0 will have corresponding yi =1. Similarly,anything below the decision boundary should have label-1. i.e.,xi s.t.wxi+b<0 will have corresponding yi =-1
SVM Tutorial Zoya Gavrilov Just the basics with a little bit of spoon-feeding... 1 Simplest case: linearly-separable data, binary classification Goal: we want to find the hyperplane (i.e. decision boundary) linearly separating our classes. Our boundary will have equation: wT x + b = 0. Anything above the decision boundary should have label 1. i.e., xi s.t. wT xi + b > 0 will have corresponding yi = 1. Similarly, anything below the decision boundary should have label −1. i.e., xi s.t. wT xi + b < 0 will have corresponding yi = −1

2 Zoya Gavrilov The reason for this labelling scheme is that it lets us condense the formula- tion for the decision function to f(x)=sign(wx+b)since f(x)=+1 for all x above the boundary,and f(x)=-1 for all x below the boundary. Thus,we can figure out if an instance has been classified properly by checking that y(wTx+b)>1 (which will be the case as long as either both y,wx+b>0 or else y,wTx+b<0). You'll notice that we will now have some space between our decision bound- ary and the nearest data points of either class.Thus,let's rescale the data such that anything on or above the boundary wx+b=1 is of one class (with label 1),and anything on or below the boundary wx+b=-1 is of the other class (with label -1). What is the distance between these newly added boundaries? First note that the two lines are parallel,and thus share their parameters w,b. Pick an arbirary point x1 to lie on line wTx+b=-1.Then,the closest point on line wTx+b=1 to x1 is the point x2 =x1+Aw (since the closest point will always lie on the perpendicular;recall that the vector w is perpendicular to both lines).Using this formulation,Aw will be the line segment connecting x1 and x2,and thus,llwll,the distance between xi and x2,is the shortest distance between the two lines/boundaries.Solving for A: →wTx2+b=1 where x2=x1+λw →wT(x1+λw)+b=1 →wTx1+b+λwTw=1 where wTx1+b=-1 →-1+λwrw=1 →λwTw=2 今入=w子=都 2 And so,the distancewisw It's intuitive that we would want to maximize the distance between the two
2 Zoya Gavrilov The reason for this labelling scheme is that it lets us condense the formulation for the decision function to f(x) = sign(wT x + b) since f(x) = +1 for all x above the boundary, and f(x) = −1 for all x below the boundary. Thus, we can figure out if an instance has been classified properly by checking that y(wT x+b) ≥ 1 (which will be the case as long as either both y, wT x+b > 0 or else y, wT x + b < 0). You’ll notice that we will now have some space between our decision boundary and the nearest data points of either class. Thus, let’s rescale the data such that anything on or above the boundary wT x + b = 1 is of one class (with label 1), and anything on or below the boundary wT x + b = −1 is of the other class (with label −1). What is the distance between these newly added boundaries? First note that the two lines are parallel, and thus share their parameters w, b. Pick an arbirary point x1 to lie on line wT x + b = −1. Then, the closest point on line wT x + b = 1 to x1 is the point x2 = x1 + λw (since the closest point will always lie on the perpendicular; recall that the vector w is perpendicular to both lines). Using this formulation, λw will be the line segment connecting x1 and x2, and thus, λkwk, the distance between x1 and x2, is the shortest distance between the two lines/boundaries. Solving for λ: ⇒ wT x2 + b = 1 where x2 = x1 + λw ⇒ wT (x1 + λw) + b = 1 ⇒ wT x1 + b + λwT w = 1 where wT x1 + b = −1 ⇒ −1 + λwT w = 1 ⇒ λwT w = 2 ⇒ λ = 2 wT w = 2 kwk2 And so, the distance λkwk is 2 kwk2 kwk = 2 kwk = √ 2 wTw It’s intuitive that we would want to maximize the distance between the two

SVM Tutorial 3 boundaries demarcating the classes (Why?We want to be as sure as possible that we are not making classification mistakes,and thus we want our data points from the two classes to lie as far away from each other as possible).This distance is called the margin,so what we want to do is to obtain the marimal margin. Thus,we want to maximizewhichsequivalent to minimizing 2 which is in turn equivalent to minimizing(since square root is a monotonic 2 function). This quadratic programming problem is expressed as: minw.空 subject to:yi(wxi+b)>1 (V data points xi). 2 Soft-margin extention Consider the case that your data isn't perfectly linearly separable.For instance, maybe you aren't guaranteed that all your data points are correctly labelled,so you want to allow some data points of one class to appear on the other side of the boundary. We can introduce slack variables-an ei>0 for each xi.Our quadratic program- ming problem becomes: minw.,e+C∑:G subject to:y(wTx+b)≥1-ei and ei≥0(付data points xi. 3 Nonlinear decision boundary Mapping your data vectors,xi,into a higher-dimension(even infinite)feature space may make them linearly separable in that space (whereas they may not be linearly separable in the original space).The formulation of the quadratic programming problem is as above,but with all xi replaced with o(xi),where o provides the higher-dimensional mapping.So we have the standard SVM formu- lation:
SVM Tutorial 3 boundaries demarcating the classes (Why? We want to be as sure as possible that we are not making classification mistakes, and thus we want our data points from the two classes to lie as far away from each other as possible). This distance is called the margin, so what we want to do is to obtain the maximal margin. Thus, we want to maximize √ 2 wTw , which is equivalent to minimizing √ wTw 2 , which is in turn equivalent to minimizing wTw 2 (since square root is a monotonic function). This quadratic programming problem is expressed as: minw,b wTw 2 subject to: yi(wT xi + b) ≥ 1 (∀ data points xi). 2 Soft-margin extention Consider the case that your data isn’t perfectly linearly separable. For instance, maybe you aren’t guaranteed that all your data points are correctly labelled, so you want to allow some data points of one class to appear on the other side of the boundary. We can introduce slack variables - an i ≥ 0 for each xi . Our quadratic programming problem becomes: minw,b, wTw 2 + C P i i subject to: yi(wT xi + b) ≥ 1 − i and i ≥ 0 (∀ data points xi). 3 Nonlinear decision boundary Mapping your data vectors, xi , into a higher-dimension (even infinite) feature space may make them linearly separable in that space (whereas they may not be linearly separable in the original space). The formulation of the quadratic programming problem is as above, but with all xi replaced with φ(xi), where φ provides the higher-dimensional mapping. So we have the standard SVM formulation:

Zoya Gavrilov mimw.be"+C∑i6 subject to:yi(wTo(xi)+b)>1-ei and 0(V data points xi). 4 Reformulating as a Lagrangian We can introduce Lagrange multipliers to represent the condition: yi(wT(xi)+b)must be as close to 1 as possible. This condition is captured by:marazowT)+b)] This ensures that when yi(wTo(xi)+b)>1,the expression above is maximal when ai=0(since [1-yi(wTo(xi)+b)]ends up being negative).Otherwise, yi(wTo(xi)+b)<1,so [1-y(wTo(xi)+b)]is a positive value,and the expres- sion is maximal when a;-oo.This has the effect of penalizing any misclassified data points,while assigning 0 penalty to properly classified instances. We thus have the following formulation: mimw.b["+∑:mara4≥0al-贴(wT(x)+b] To allow for slack (soft-margin),preventing the a variables from going to oo,we can impose constraints on the Lagrange multipliers to lie within:0<oi<C. We can define the dual problem by interchanging the max and min as follows (i.e.we minimize after fixing alpha): mara≥o[minw.bJ(w,b:a)]whereJ(w,bsa)=""+∑:al-s(wTp(x)+bj Since we're solving an optimization problem,we set0and discover that the optimal setting of w is(),while setting0yields the con- straint∑:aii=0. Thus,after substituting and simplifying,we get: minw.bJ(w,ba)=∑ia-号∑ijjij(xT(x) And thus our dual is: maxa≥ol∑:a:-量∑aah斯(x)T(x】 Subject to:∑iaih=0and0≤ai≤C
4 Zoya Gavrilov minw,b, wTw 2 + C P i i subject to: yi(wT φ(xi) + b) ≥ 1 − i and i ≥ 0 (∀ data points xi). 4 Reformulating as a Lagrangian We can introduce Lagrange multipliers to represent the condition: yi(wT φ(xi) + b) must be as close to 1 as possible. This condition is captured by: maxαi≥0αi [1 − yi(wT φ(xi) + b)] This ensures that when yi(wT φ(xi) + b) ≥ 1, the expression above is maximal when αi = 0 (since [1 − yi(wT φ(xi) + b)] ends up being negative). Otherwise, yi(wT φ(xi) +b) < 1, so [1−yi(wT φ(xi) +b)] is a positive value, and the expression is maximal when αi → ∞. This has the effect of penalizing any misclassified data points, while assigning 0 penalty to properly classified instances. We thus have the following formulation: minw,b[ wTw 2 + P i maxαi≥0αi [1 − yi(wT φ(xi) + b)]] To allow for slack (soft-margin), preventing the α variables from going to ∞, we can impose constraints on the Lagrange multipliers to lie within: 0 ≤ αi ≤ C. We can define the dual problem by interchanging the max and min as follows (i.e. we minimize after fixing alpha): maxα≥0[minw,bJ(w, b; α)] where J(w, b; α) = wTw 2 + P i αi [1 − yi(wT φ(xi) + b)] Since we’re solving an optimization problem, we set ∂J ∂w = 0 and discover that the optimal setting of w is P i αiyiφ(xi), while setting ∂J ∂b = 0 yields the constraint P i αiyi = 0. Thus, after substituting and simplifying, we get: minw,bJ(w, b; α) = P i αi − 1 2 P i,j αiαjyiyjφ(xi) T φ(xj) And thus our dual is: maxα≥0[ P i αi − 1 2 P i,j αiαjyiyjφ(xi) T φ(xj)] Subject to: P i αiyi = 0 and 0 ≤ αi ≤ C

SVM Tutorial 5 5 Kernel Trick Because we're working in a higher-dimension space (and potentially even an infinite-dimensional space),calculating o(xi)To(xi)may be intractable.How- ever,it turns out that there are special kernel functions that operate on the lower dimension vectors xi and xj to produce a value equivalent to the dot- product of the higher-dimensional vectors. For instance,consider the function R3R10,where: (x)=(1,V2x1),V2x2),V2x3),[x1]2,x2)]2,x3]2,V2xax②),V2x)x3),V2x2)x3) Instead,we have the kernel trick,which tells us that K(xi,xj)=(1+xixj)2= (xi)To(xj)for the given o.Thus,we can simplify our calculations. Re-writing the dual in terms of the kernel,yields: maxa≥o[∑,ai-是∑jK(xi,x为】 6 Decision function To classify a novel instance x once you've learned the optimal ai parameters,all you have to do is calculate f(x)=sign(wx+b)=iaiyiK(xi,x)+b (by setting w=>ii(xi)and using the kernel trick). Note that ai is only non-zero for instances (xi)on or near the boundary-those are called the support vectors since they alone specify the decision boundary.We can toss out the other data points once training is complete.Thus,we only sum over the xi which constitute the support vectors. 7 Learn more A great resource is the MLSS 2006(Machine Learning Summer School)talk by Chih-Jen Lin,available at:videolectures.net/mlss06tw_lin_svm/.Make sure to download his presentation slides.Starting with the basics,the talk continues on
SVM Tutorial 5 5 Kernel Trick Because we’re working in a higher-dimension space (and potentially even an infinite-dimensional space), calculating φ(xi) T φ(xj) may be intractable. However, it turns out that there are special kernel functions that operate on the lower dimension vectors xi and xj to produce a value equivalent to the dotproduct of the higher-dimensional vectors. For instance, consider the function φ : R 3 7→ R 10, where: φ(x) = (1, √ 2x (1) , √ 2x (2) , √ 2x (3) , [x (1)] 2 , [x (2)] 2 , [x (3)] 2 , √ 2x (1)x (2) , √ 2x (1)x (3) , √ 2x (2)x (3)) Instead, we have the kernel trick, which tells us that K(xi , xj) = (1 + xi T xj) 2 = φ(xi) T φ(xj) for the given φ. Thus, we can simplify our calculations. Re-writing the dual in terms of the kernel, yields: maxα≥0[ P i αi − 1 2 P i,j αiαjyiyjK(xi , xj)] 6 Decision function To classify a novel instance x once you’ve learned the optimal αi parameters, all you have to do is calculate f(x) = sign(wT x + b) = P i αiyiK(xi , x) + b (by setting w = P i αiyiφ(xi) and using the kernel trick). Note that αi is only non-zero for instances φ(xi) on or near the boundary - those are called the support vectors since they alone specify the decision boundary. We can toss out the other data points once training is complete. Thus, we only sum over the xi which constitute the support vectors. 7 Learn more A great resource is the MLSS 2006 (Machine Learning Summer School) talk by Chih-Jen Lin, available at: videolectures.net/mlss06tw lin svm/. Make sure to download his presentation slides. Starting with the basics, the talk continues on

6 Zoya Gavrilov to discussing practical issues,parameter/kernel selection,as well as more ad- vanced topics. The typical reference for such matters:Pattern Recognition and Machine Learn- ing by Christopher M.Bishop
6 Zoya Gavrilov to discussing practical issues, parameter/kernel selection, as well as more advanced topics. The typical reference for such matters: Pattern Recognition and Machine Learning by Christopher M. Bishop
按次数下载不扣除下载券;
注册用户24小时内重复下载只扣除一次;
顺序:VIP每日次数-->可用次数-->下载券;
- 《机器学习 Machine Learning》课程教学资源(书籍文献)A random forest guided tour.pdf
- 《机器学习 Machine Learning》课程教学资源(书籍文献)Real-Time Human Pose Recognition in Parts from Single Depth Images.pdf
- 《机器学习 Machine Learning》课程教学资源(书籍文献)An introduction to neural networks.pdf
- 《机器学习 Machine Learning》课程教学资源(书籍文献)An introduction to neural networks for beginners.pdf
- 《机器学习 Machine Learning》课程教学资源(书籍文献)Handwritten Digit Recognition with a Back-Propagation Network.pdf
- 《机器学习 Machine Learning》课程教学资源(书籍文献)Gradient-Based Learning Applied to Document Recognition.pdf
- 《机器学习 Machine Learning》课程教学资源(书籍文献)Attention Is All You Need.pdf
- 《机器学习 Machine Learning》课程教学资源(书籍文献)Learning representations by back-propagating errors.pdf
- 《机器学习 Machine Learning》课程教学资源(书籍文献)Finding Structure in Time.pdf
- 《机器学习 Machine Learning》课程教学资源(书籍文献)[美] 弗朗索瓦·肖莱《Python深度学习 Deep Learning with Python》.pdf
- 《机器学习 Machine Learning》课程教学资源(书籍文献)[德] Andreas C. Müller [美] Sarah Guido《Python机器学习基础教程 Introduction to Machine Learning with Python》.pdf
- 《机器学习 Machine Learning》课程教学资源(实践资料)ModelArts花卉识别(基于MindSpore的图像识别全流程代码实战).pdf
- 《机器学习 Machine Learning》课程教学资源(实践资料)MNIST手写数字识别的Atlas 200DK推理应用.pdf
- 《机器学习 Machine Learning》课程教学资源(实践资料)MNIST手写体识别实验.pdf
- 《机器学习 Machine Learning》课程教学资源(实践资料)Xshell远程登陆开发板方法(华为atlas800 - 910).pdf
- 《机器学习 Machine Learning》课程教学资源(实践资料)华为Atlas人工智能计算解决方案产品彩页.pdf
- 电子科技大学:《先进计算机网络技术》课程教学资源(课件讲稿)Unit 11 AI Enabled Wireless Access Control and Handoff.pdf
- 电子科技大学:《先进计算机网络技术》课程教学资源(课件讲稿)Unit 10 Network Coding and Traffic Balancing.pdf
- 电子科技大学:《先进计算机网络技术》课程教学资源(课件讲稿)Unit 9 Network Traffic Engineering.pdf
- 电子科技大学:《先进计算机网络技术》课程教学资源(课件讲稿)Unit 8 Traffic Management and Modeling.pdf
- 电子科技大学:《机器学习 Machine Learning》课程教学资源(课件讲稿)第1讲 概论 Introduction(主讲:郝家胜).pdf
- 电子科技大学:《机器学习 Machine Learning》课程教学资源(课件讲稿)第2讲 模型评估与选择 Evaluation and Selection of Models.pdf
- 电子科技大学:《机器学习 Machine Learning》课程教学资源(课件讲稿)第3讲 线性模型 Linear Models.pdf
- 电子科技大学:《机器学习 Machine Learning》课程教学资源(课件讲稿)第4讲 支持向量机 Support Vector Machines.pdf
- 电子科技大学:《机器学习 Machine Learning》课程教学资源(课件讲稿)第5讲 人工神经网络分类器 Classifiers with ANN.pdf
- 电子科技大学:《机器学习 Machine Learning》课程教学资源(课件讲稿)第6讲 近邻法与Logist回归 Nearest Neighbors & Logist Regression.pdf
- 电子科技大学:《机器学习 Machine Learning》课程教学资源(课件讲稿)第7讲 其他分类方法 Classifiers for More.pdf
- 电子科技大学:《机器学习 Machine Learning》课程教学资源(课件讲稿)第8讲 非监督学习 Unsupervised Learning.pdf
- 电子科技大学:《机器学习 Machine Learning》课程教学资源(课件讲稿)第9讲 特征选择 Feature Selection.pdf
- 电子科技大学:《机器学习 Machine Learning》课程教学资源(课件讲稿)第10讲 特征提取 Feature Extraction.pdf
- 电子科技大学:《机器学习 Machine Learning》课程教学资源(课件讲稿)第11讲 特征提取 Feature Extraction.pdf
- 电子科技大学:《机器学习 Machine Learning》课程教学资源(课件讲稿)第12讲 特征学习 Feature Learning.pdf
- 电子科技大学:《机器学习 Machine Learning》课程教学资源(课件讲稿)第13讲 卷积神经网络 Convolution Neural Nets.pdf
- 电子科技大学:《机器学习 Machine Learning》课程教学资源(课件讲稿)第14讲 深度CNN Deep CNN.pdf
- 电子科技大学:《机器学习 Machine Learning》课程教学资源(课件讲稿)第16讲 生成对抗网络 GAN.pdf
- 电子科技大学:《机器学习 Machine Learning》课程教学资源(课件讲稿)第17讲 循环神经网络 Recurrent Neural Networks.pdf
- 电子科技大学:《机器学习 Machine Learning》课程教学资源(课件讲稿)第18讲 强化学习 Reinforcement Learning.pdf
- 电子科技大学:《机器学习 Machine Learning》课程教学资源(课件讲稿)第12讲 超参数优化与自动学习 Hyperparameters Optimization & AutoML.pdf
- 《C++程序设计》课程教学资源(课件讲稿)第三篇 基于对象的程序设计 第9章 关于类和对象的进一步讨论.pdf
- 杭州电子科技大学:《计算机视觉》课程教学资源(PPT课件讲稿)第五讲 目标分割.pdf