电子科技大学:《大数据分析与挖掘 Big Data Analysis and Mining》课程教学资源(课件讲稿)Lecture 7 Hadoop-Spark

Lecture 7 Hadoop/Spark
Lecture 7 Hadoop/Spark

What is Hadoop Hadoop is a software framework for distributed processing of large datasets across large clusters of computers Large datasets>Terabytes or petabytes of data Large clusters>hundreds or thousands of nodes Hadoop is open-source implementation for Google MapReduce Hadoop is based on a simple programming model called MapReduce Hadoop is based on a simple data model,any data will fit 3
What is Hadoop • Hadoop is a software framework for distributed processing of large datasets across large clusters of computers • Large datasets Æ Terabytes or petabytes of data • Large clusters Æ hundreds or thousands of nodes • Hadoop is open-source implementation for Google MapReduce • Hadoop is based on a simple programming model called MapReduce • Hadoop is based on a simple data model, any data will fit 3

Design Principles of Hadoop Need to process big data Need to parallelize computation across thousands of nodes Commodity hardware Large number of low-end cheap machines working in parallel to solve a computing problem This is in contrast to Parallel DBs Small number of high-end expensive machines 1
Design Principles of Hadoop • Need to process big data • Need to parallelize computation across thousands of nodes • Commodity hardware • Large number of low-end cheap machines working in parallel to solve a computing problem • This is in contrast to Parallel DBs • Small number of high-end expensive machines 4

Divide and Conquer "Work" Partition W1 W2 W3 ”h0tkr worker workeE "Result Combine
Divide and Conquer “worker ” “worker ” “worker ” Partition Combine

It's a bit more complex... Fundamental issues scheduling,data distribution,synchronization, Different programming models inter-process communication,robustness,fault Message Passing Shared Memory tolerance,... P P2 P3 P4 Ps P:P2 P3 P4 Ps Architectural issues Flynn's taxonomy (SIMD,MIMD,etc.), network typology,bisection bandwidth UMA vs.NUMA,cache coherence Different programming constructs mutexes,conditional variables,barriers,... masters/slaves,producers/consumers,work queues,... Common problems livelock,deadlock,data starvation,priority inversion... dining philosophers,sleeping barbers,cigarette smokers,... The reality:programmer shoulders the burden of managing concurrency
It’s a bit more complex… Message Passing P1 P2 P3 P4 P5 Shared Memory P1 P2 P3 P4 P5 Memory Different programming models Different programming constructs mutexes, conditional variables, barriers, … masters/slaves, producers/consumers, work queues, … Fundamental issues scheduling, data distribution, synchronization, inter-process communication, robustness, fault tolerance, … Common problems livelock, deadlock, data starvation, priority inversion… dining philosophers, sleeping barbers, cigarette smokers, … Architectural issues Flynn’s taxonomy (SIMD, MIMD, etc.), network typology, bisection bandwidth UMA vs. NUMA, cache coherence The reality: programmer shoulders the burden of managing concurrency…

▣ 7 Source:Ricardo Guimaraes Herrmann
Source: Ricardo Guimarães Herrmann

Design Principles of Hadoop Automatic parallelization distribution Hidden from the end-user Fault tolerance and automatic recovery Nodes/tasks will fail and will recover automatically Clean and simple programming abstraction ·Users only provide two functions“map'and“reduce” 8
Design Principles of Hadoop • Automatic parallelization & distribution • Hidden from the end-user • Fault tolerance and automatic recovery • Nodes/tasks will fail and will recover automatically • Clean and simple programming abstraction • Users only provide two functions “map” and “reduce” 8

Distributed File System Don't move data to workers...move workers to the data! Store data on the local disks of nodes in the cluster Start up the workers on the node that has the data local ·Why? Not enough RAM to hold all the data in memory Disk access is slow,but disk throughput is reasonable A distributed file system is the answer 。 GFS(Google File System) HDFS for Hadoop(=GFS clone)
Distributed File System • Don’t move data to workers… move workers to the data! • Store data on the local disks of nodes in the cluster • Start up the workers on the node that has the data local • Why? • Not enough RAM to hold all the data in memory • Disk access is slow, but disk throughput is reasonable • A distributed file system is the answer • GFS (Google File System) • HDFS for Hadoop (= GFS clone)

Hadoop:How it Works 10
Hadoop: How it Works 10

Hadoop HBase Pig Hive Chukwa MapReduce HDFS Z00 Keeper Hadoop's stated mission(Doug Cutting interview): Commoditize infrastructure for web-scale, data-intensive applications
Hadoop HBase MapReduce Core Avro HDFS Zoo Keeper Pig Hive Chukwa Hadoop’s stated mission (Doug Cutting interview): Commoditize infrastructure for web-scale, data-intensive applications
按次数下载不扣除下载券;
注册用户24小时内重复下载只扣除一次;
顺序:VIP每日次数-->可用次数-->下载券;
- 电子科技大学:《大数据分析与挖掘 Big Data Analysis and Mining》课程教学资源(课件讲稿)Lecture 6 Graph Mining.pdf
- 电子科技大学:《大数据分析与挖掘 Big Data Analysis and Mining》课程教学资源(课件讲稿)Lecture 5 Data Stream Mining.pdf
- 电子科技大学:《大数据分析与挖掘 Big Data Analysis and Mining》课程教学资源(课件讲稿)Lecture 4 Sampling for Big Data.pdf
- 电子科技大学:《大数据分析与挖掘 Big Data Analysis and Mining》课程教学资源(课件讲稿)Lecture 3 Hashing.pdf
- 电子科技大学:《大数据分析与挖掘 Big Data Analysis and Mining》课程教学资源(课件讲稿)Lecture 2 BasicConcepts(Foundations of Data Mining).pdf
- 电子科技大学:《大数据分析与挖掘 Big Data Analysis and Mining》课程教学资源(课件讲稿)Lecture 1 Intro(主讲:邵俊明).pdf
- 计算机科学与技术(PPT讲稿)Unlock with Your Heart - Heartbeat-based Authentication on Commercial Mobile Phones.pptx
- 计算机科学与技术(参考文献)VECTOR - Velocity Based Temperature-field Monitoring with Distributed Acoustic Devices.pdf
- 计算机科学与技术(参考文献)VSkin - Sensing Touch Gestures on Surfaces of Mobile Devices Using Acoustic Signals.pdf
- 计算机科学与技术(参考文献)RespTracker - Multi-user Room-scale Respiration Tracking with Commercial Acoustic Devices.pdf
- 计算机科学与技术(参考文献)Dynamic Speed Warping - Similarity-Based One-shot Learning for Device-free Gesture Signals.pdf
- 计算机科学与技术(参考文献)SpiderMon - Towards Using Cell Towers as Illuminating Sources for Keystroke Monitoring.pdf
- 计算机科学与技术(参考文献)Unlock with Your Heart:Heartbeat-based Authentication on Commercial Mobile Phones.pdf
- 计算机科学与技术(参考文献)QGesture - Quantifying Gesture Distance and Direction with WiFi Signals.pdf
- 计算机科学与技术(PPT讲稿)QGesture - Quantifying Gesture Distance and Direction with WiFi Signals.pptx
- 计算机科学与技术(参考文献)Gait Recognition Using WiFi Signals.pdf
- 计算机科学与技术(参考文献)Gait Recognition Using WiFi Signals.pdf
- 计算机科学与技术(参考文献)Depth Aware Finger Tapping on Virtual Displays.pdf
- 计算机科学与技术(参考文献)Device-Free Gesture Tracking Using Acoustic Signals.pdf
- 计算机科学与技术(参考文献)Device-Free Gesture Tracking Using Acoustic Signals.pdf
- 电子科技大学:《先进计算机网络技术》课程教学资源(课件讲稿)Introduction(冯钢).pdf
- 电子科技大学:《先进计算机网络技术》课程教学资源(课件讲稿)Unit 1 Overview - A big Picture on Traffic Control and QoS in IP networks.pdf
- 电子科技大学:《先进计算机网络技术》课程教学资源(课件讲稿)Unit 2 Call-level Models and Admission Control.pdf
- 电子科技大学:《先进计算机网络技术》课程教学资源(课件讲稿)Unit 3 Traffic Policing and Shaping.pdf
- 电子科技大学:《先进计算机网络技术》课程教学资源(课件讲稿)Unit 4 TCP Traffic Control.pdf
- 电子科技大学:《先进计算机网络技术》课程教学资源(课件讲稿)Unit 5 Buffer Management.pdf
- 电子科技大学:《先进计算机网络技术》课程教学资源(课件讲稿)Unit 6 Packet Scheduling.pdf
- 电子科技大学:《先进计算机网络技术》课程教学资源(课件讲稿)Unit 7 IntServ/RSVP and DiffServ.pdf
- 电子科技大学:《先进计算机网络技术》课程教学资源(课件讲稿)Unit 8 Traffic Management and Modeling.pdf
- 电子科技大学:《先进计算机网络技术》课程教学资源(课件讲稿)Unit 9 Network Traffic Engineering.pdf
- 电子科技大学:《先进计算机网络技术》课程教学资源(课件讲稿)Unit 10 Network Coding and Traffic Balancing.pdf
- 电子科技大学:《先进计算机网络技术》课程教学资源(课件讲稿)Unit 11 AI Enabled Wireless Access Control and Handoff.pdf
- 《机器学习 Machine Learning》课程教学资源(实践资料)华为Atlas人工智能计算解决方案产品彩页.pdf
- 《机器学习 Machine Learning》课程教学资源(实践资料)Xshell远程登陆开发板方法(华为atlas800 - 910).pdf
- 《机器学习 Machine Learning》课程教学资源(实践资料)MNIST手写体识别实验.pdf
- 《机器学习 Machine Learning》课程教学资源(实践资料)MNIST手写数字识别的Atlas 200DK推理应用.pdf
- 《机器学习 Machine Learning》课程教学资源(实践资料)ModelArts花卉识别(基于MindSpore的图像识别全流程代码实战).pdf
- 《机器学习 Machine Learning》课程教学资源(书籍文献)[德] Andreas C. Müller [美] Sarah Guido《Python机器学习基础教程 Introduction to Machine Learning with Python》.pdf
- 《机器学习 Machine Learning》课程教学资源(书籍文献)[美] 弗朗索瓦·肖莱《Python深度学习 Deep Learning with Python》.pdf
- 《机器学习 Machine Learning》课程教学资源(书籍文献)Finding Structure in Time.pdf