《大数据 Big Data》课程教学资源(参考文献)Parallel and Distributed Stochastic Learning - Towards Scalable Learning for Big Data Intelligence(南京大学:李武军)

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence 李武军 LAMDA Group 南京大学计算机科学与技术系 软件新技术国家重,点实验室 liwujun@nju.edu.cn Dec10,2016 4口,4@,4242,定分QC Wu-Jun Li (http://cs.nju.edu.cn/lvj) PDSL CS,NJU 1/36
Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence o… LAMDA Group HÆåÆOéÅâÆÜE‚X ^á#E‚I[:¢ø liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj) PDSL CS, NJU 1 / 36

Introduction Outline Introduction AsySVRG SCOPE Conclusion 4口,4@下4242,定分QC Wu-Jun Li (http://cs.nju.edu.cn/lvj) PDSL CS,NJU 2/36
Introduction Outline 1 Introduction 2 AsySVRG 3 SCOPE 4 Conclusion Wu-Jun Li (http://cs.nju.edu.cn/lwj) PDSL CS, NJU 2 / 36

Introduction Machine Learning o Supervised Learning: Given a set of training examples {(xi,yi)1,supervised learning tries to solve the following regularized empirical risk minimization problem: where fi(w)is the loss function(plus some regularization term) defined on example i,and w is the parameter to learn. Examples: ·Logistic regression:f(w)=员∑1log(l+e-xw)+lw鬥] 。SVM:f(w)=是∑1Imax{0,1-xTw}+lw鬥] Deep learning models o Unsupervised Learning: Many unsupervised learning models,such as PCA and matrix factorization,can also be reformulated as similar problems. Wu-Jun Li (http://cs.nju.edu.cn/lvj) PDSL CS,NJU 3/36
Introduction Machine Learning Supervised Learning: Given a set of training examples {(xi , yi)} n i=1, supervised learning tries to solve the following regularized empirical risk minimization problem: min w f(w) = 1 n Xn i=1 fi(w), where fi(w) is the loss function (plus some regularization term) defined on example i, and w is the parameter to learn. Examples: Logistic regression: f(w) = 1 n Pn i=1[log(1 + e −yix T i w) + λ 2 kwk 2 ] SVM: f(w) = 1 n Pn i=1[max{0, 1 − yix T i w} + λ 2 kwk 2 ] Deep learning models Unsupervised Learning: Many unsupervised learning models, such as PCA and matrix factorization, can also be reformulated as similar problems. Wu-Jun Li (http://cs.nju.edu.cn/lwj) PDSL CS, NJU 3 / 36

Introduction Machine Learning for Big Data For big data applications,first-order methods have become much more popular than other higher-order methods for learning (optimization). Gradient descent methods are the most representative first-order methods. (Deterministic)gradient descent(GD): w+1←L-nm片∑ fi(wt), where t is the iteration number. Linear convergence rate:O(p) Iteration cost is O(n) Stochastic gradient descent(SGD):In the tth iteration,randomly choosing an example it∈{1,2,,n},then update wt+1←-wt-EVfi,(wt) Iteration cost is O(1) The convergence rate is sublinear.O(1/t). Wu-Jun Li (http://cs.nju.edu.cn/lvj) PDSL CS,NJU 4/36
Introduction Machine Learning for Big Data For big data applications, first-order methods have become much more popular than other higher-order methods for learning (optimization). Gradient descent methods are the most representative first-order methods. (Deterministic) gradient descent (GD): wt+1 ← wt − ηt [ 1 n Xn i=1 ∇fi(wt)], where t is the iteration number. Linear convergence rate: O(ρ t ) Iteration cost is O(n) Stochastic gradient descent (SGD): In the t th iteration, randomly choosing an example it ∈ {1, 2, ..., n}, then update wt+1 ← wt − ηt∇fit (wt) Iteration cost is O(1) The convergence rate is sublinear: O(1/t) Wu-Jun Li (http://cs.nju.edu.cn/lwj) PDSL CS, NJU 4 / 36

Introduction Stochastic Learning for Big Data Researchers recently proposed improved versions of SGD: SAG [Roux et al.,NIPS 2012],SDCA [Shalev-Shwartz and Zhang,JMLR 2013],SVRG [Johnson and Zhang,NIPS 2013] Number of gradient(Vfi)evaluation to reach e for smooth and strongly convex problems: 。GD:O(nk log(2) ·SGD:O((2) ·SAG:O(nlog()when n≥8k 。SDCA:O(n+)1log(2) ·SVRG:O(m+)log(2) where1 is the condition number of the objective function. Stochastic Learning: Stochastic GD o Stochastic coordinate descent Stochastic dual coordinate ascent 4口,4@下1242,2QC Wu-Jun Li (http://cs.nju.edu.cn/lvj) PDSL CS.NJU 5/36
Introduction Stochastic Learning for Big Data Researchers recently proposed improved versions of SGD: SAG [Roux et al., NIPS 2012], SDCA [Shalev-Shwartz and Zhang, JMLR 2013], SVRG [Johnson and Zhang, NIPS 2013] Number of gradient (∇fi) evaluation to reach for smooth and strongly convex problems: GD: O(nκ log( 1 )) SGD: O(κ( 1 )) SAG: O(n log( 1 )) when n ≥ 8κ SDCA: O((n + κ) log( 1 )) SVRG: O((n + κ) log( 1 )) where κ = L µ > 1 is the condition number of the objective function. Stochastic Learning: Stochastic GD Stochastic coordinate descent Stochastic dual coordinate ascent Wu-Jun Li (http://cs.nju.edu.cn/lwj) PDSL CS, NJU 5 / 36

Introduction Parallel and Distributed Stochastic Learning To further improve the learning scalability (speed): o Parallel stochastic learning: One machine with multiple cores and a shared memory o Distributed stochastic learning: A cluster with multiple machines Key issues:cooperation Parallel stochastic learning: lock vs.lock-free:waiting cost and lock cost Distributed stochastic learning: synchronous vs.asynchronous:waiting cost and communication cost 4口,4@下¥2生,2分Q0 Wu-Jun Li (http://cs.nju.edu.cn/lwj) PDSL CS,NJU 6/36
Introduction Parallel and Distributed Stochastic Learning To further improve the learning scalability (speed): Parallel stochastic learning: One machine with multiple cores and a shared memory Distributed stochastic learning: A cluster with multiple machines Key issues: cooperation Parallel stochastic learning: lock vs. lock-free: waiting cost and lock cost Distributed stochastic learning: synchronous vs. asynchronous: waiting cost and communication cost Wu-Jun Li (http://cs.nju.edu.cn/lwj) PDSL CS, NJU 6 / 36

Introduction Our Contributions Parallel stochastic learning:AsySVRG Fast Asynchronous Parallel Stochastic Gradient Descent:A Lock-Free Approach with Convergence Guarantee. Distributed stochastic learning:SCOPE Scalable Composite Optimization for Learning 4口,4@,4242,定分Q0 Wu-Jun Li (http://cs.nju.edu.cn/lwj) PDSL CS,NJU 7/36
Introduction Our Contributions Parallel stochastic learning: AsySVRG Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee. Distributed stochastic learning: SCOPE Scalable Composite Optimization for Learning Wu-Jun Li (http://cs.nju.edu.cn/lwj) PDSL CS, NJU 7 / 36

ASySVRG Outline Introduction ② AsySVRG SCOPE Conclusion 4口,4@下4242,定分QC Wu-Jun Li (http://cs.nju.edu.cn/lvj) PDSL CS,NJU 8/36
AsySVRG Outline 1 Introduction 2 AsySVRG 3 SCOPE 4 Conclusion Wu-Jun Li (http://cs.nju.edu.cn/lwj) PDSL CS, NJU 8 / 36

AsySVRG Motivation and Contribution Motivation: Existing asynchronous parallel SGD:Hogwild![Recht et al.2011], and PASSCoDe [Hsieh,Yu,and Dhillon 2015] No parallel methods for SVRG. Lock-free:empirically effective,but no theoretical proof. Contribution: A fast asynchronous method to parallelize SVRG,called AsySVRG. A lock-free parallel strategy for both read and write Linear convergence rate with theoretical proof o Outperforms Hogwild!in experiments AsySVRG is the first lock-free parallel SGD method with theoretical proof of convergence. 4口,4@下¥24=, Wu-Jun Li (http://cs.nju.edu.cn/lvj) PDSL CS,NJU 9/36
AsySVRG Motivation and Contribution Motivation: Existing asynchronous parallel SGD: Hogwild! [Recht et al. 2011], and PASSCoDe [Hsieh, Yu, and Dhillon 2015] No parallel methods for SVRG. Lock-free: empirically effective, but no theoretical proof. Contribution: A fast asynchronous method to parallelize SVRG, called AsySVRG. A lock-free parallel strategy for both read and write Linear convergence rate with theoretical proof Outperforms Hogwild! in experiments AsySVRG is the first lock-free parallel SGD method with theoretical proof of convergence. Wu-Jun Li (http://cs.nju.edu.cn/lwj) PDSL CS, NJU 9 / 36

AsySVRG AsySVRG:a multi-thread version of SVRG Initialization:p threads,initialize wo,n; fort=0,1,2,…do uo Wt; All threads parallelly compute the full gradient Vf(uo)=员∑2-1Vf(o: u=Wt: For each thread,do: for m =1 to M do Read current value of u,denoted as u,from the shared memory. And randomly pick up an i from {1,...,n}; Compute the update vector:=Vfi(u)-Vfi(uo)+Vf(uo); u←-u-7V: end for Take wt+1 to be the current value of u in the shared memory; end for 4口,49卡,重,4=,2QC Wu-Jun Li (http://cs.nju.edu.cn/lvj) PDSL CS.NJU 10/36
AsySVRG AsySVRG: a multi-thread version of SVRG Initialization: p threads, initialize w0, η; for t = 0, 1, 2, ... do u0 = wt ; All threads parallelly compute the full gradient ∇f(u0) = 1 n Pn i=1 ∇fi(u0); u = wt ; For each thread, do: for m = 1 to M do Read current value of u, denoted as uˆ, from the shared memory. And randomly pick up an i from {1, . . . , n}; Compute the update vector: vˆ = ∇fi(uˆ) − ∇fi(u0) + ∇f(u0); u ← u − ηvˆ; end for Take wt+1 to be the current value of u in the shared memory; end for Wu-Jun Li (http://cs.nju.edu.cn/lwj) PDSL CS, NJU 10 / 36
按次数下载不扣除下载券;
注册用户24小时内重复下载只扣除一次;
顺序:VIP每日次数-->可用次数-->下载券;
- 《大数据 Big Data》课程教学资源(参考文献)Learning to Hash for Big Data - A Tutorial.pdf
- 《大数据 Big Data》课程教学资源(参考文献)大数据机器学习 Big Data Machine Learning.pdf
- 《大数据 Big Data》课程教学资源(参考文献)Learning to Hash for Big Data.pdf
- 《大数据 Big Data》课程教学资源(参考文献)Learning to Hash for Big Data.pdf
- 《大数据 Big Data》课程教学资源(参考文献)大数据机器学习 Big Data Machine Learning.pdf
- 《大数据 Big Data》课程教学资源(参考文献)Learning to Hash for Big Data Retrieval and Mining(南京大学:李武军).pdf
- 《大数据 Big Data》课程教学资源(参考文献)Learning to Hash for Big Data Retrieval and Mining(南京大学:李武军).pdf
- 南京大学:《形式语言与自动机 Formal Languages and Automata》课程教学资源(PPT课件讲稿)Decidability, Complexity(P, NP, NPC and related).pptx
- 南京大学:《形式语言与自动机 Formal Languages and Automata》课程教学资源(PPT课件讲稿)Timed Automata.ppt
- 南京大学:《形式语言与自动机 Formal Languages and Automata》课程教学资源(PPT课件讲稿)Petri Net.pptx
- 南京大学:《形式语言与自动机 Formal Languages and Automata》课程教学资源(PPT课件讲稿)Transition System.pptx
- 南京大学:《形式语言与自动机 Formal Languages and Automata》课程教学资源(PPT课件讲稿)Turing Machine.pptx
- 南京大学:《形式语言与自动机 Formal Languages and Automata》课程教学资源(PPT课件讲稿)Properties of CFL(The Pumping Lemma for CFL’s).pptx
- 南京大学:《形式语言与自动机 Formal Languages and Automata》课程教学资源(PPT课件讲稿)Pushdown Automata.pptx
- 南京大学:《形式语言与自动机 Formal Languages and Automata》课程教学资源(PPT课件讲稿)Regular Expression.pptx
- 南京大学:《形式语言与自动机 Formal Languages and Automata》课程教学资源(PPT课件讲稿)Context Free Grammar.pptx
- 南京大学:《形式语言与自动机 Formal Languages and Automata》课程教学资源(PPT课件讲稿)Finite Automata.pptx
- 南京大学:《软件安全 Software Security》课程教学资源(PPT课件讲稿)Byzantine Generals Problem.ppt
- 南京大学:《软件安全 Software Security》课程教学资源(PPT课件讲稿)Use-after-free.pptx
- 南京大学:《软件安全 Software Security》课程教学资源(PPT课件讲稿)Taint Analysis.pptx
- 《人工智能、机器学习与大数据》课程教学资源(参考文献)Coherence functions for multicategory margin-based classification methods.pdf
- 《人工智能、机器学习与大数据》课程教学资源(参考文献)Latent Wishart processes for relational kernel learning.pdf
- 《人工智能、机器学习与大数据》课程教学资源(参考文献)Latent Wishart processes for relational kernel learning(讲稿).pdf
- 《人工智能、机器学习与大数据》课程教学资源(参考文献)agiCoFi - Tag informed collaborative filtering.pdf
- 《人工智能、机器学习与大数据》课程教学资源(参考文献)Localized content-based image retrieval through evidence region identification.pdf
- 《人工智能、机器学习与大数据》课程教学资源(参考文献)Relation regularized matrix factorization.pdf
- 《人工智能、机器学习与大数据》课程教学资源(参考文献)Relation regularized matrix factorization(讲稿).pdf
- 《人工智能、机器学习与大数据》课程教学资源(参考文献)Probabilistic relational PCA.pdf
- 《人工智能、机器学习与大数据》课程教学资源(参考文献)Gaussian process latent random field.pdf
- 《人工智能、机器学习与大数据》课程教学资源(参考文献)Multiple-instance learning via disambiguation.pdf
- 《人工智能、机器学习与大数据》课程教学资源(参考文献)Generalized latent factor models for social network analysis.pdf
- 《人工智能、机器学习与大数据》课程教学资源(参考文献)Social relations model for collaborative filtering.pdf
- 《人工智能、机器学习与大数据》课程教学资源(参考文献)Sparse probabilistic relational projection.pdf
- 《人工智能、机器学习与大数据》课程教学资源(参考文献)Emoticon smoothed language models for Twitter sentiment analysis.pdf
- 《人工智能、机器学习与大数据》课程教学资源(参考文献)Double-bit quantization for hashing.pdf
- 《人工智能、机器学习与大数据》课程教学资源(参考文献)Manhattan hashing for large-scale image retrieval.pdf
- 《人工智能、机器学习与大数据》课程教学资源(参考文献)Isotropic hashing.pdf
- 《人工智能、机器学习与大数据》课程教学资源(参考文献)Collaborative topic regression with social regularization for tag recommendation.pdf
- 《人工智能、机器学习与大数据》课程教学资源(参考文献)Online Egocentric models for citation networks.pdf
- 《人工智能、机器学习与大数据》课程教学资源(参考文献)Large-Scale Supervised Multimodal Hashing with Semantic Correlation Maximization.pdf